# Chapter 34: Encoding and Decoding

This notebook covers Python's standard library modules for encoding binary data into text-safe formats and decoding them back. These encodings are essential for email transport, data URLs, checksums, and binary-to-text conversions.

## Key Concepts
- **`base64`**: Encode/decode bytes using Base64 and URL-safe Base64
- **`quopri`**: Quoted-printable encoding for mostly-ASCII text with some non-ASCII characters
- **`binascii`**: Low-level binary-to-ASCII conversions, hexlify/unhexlify, CRC-32 checksums
- **`uu`**: Unix-to-Unix encoding (legacy format)
- **Hex and bytes**: Converting between bytes, hex strings, and integers

## Section 1: Base64 Encoding Basics

Base64 encodes arbitrary bytes into a printable ASCII string using 64 characters (A-Z, a-z, 0-9, +, /). It is widely used in email (MIME), data URLs, and API tokens.

In [None]:
import base64

# Encode bytes to Base64
original: bytes = b"Hello, World!"
encoded: bytes = base64.b64encode(original)

print(f"Original: {original}")
print(f"Encoded:  {encoded}")
print(f"Type:     {type(encoded).__name__}")

# The result is bytes, but contains only ASCII characters
assert isinstance(encoded, bytes)

# Decode back to original
decoded: bytes = base64.b64decode(encoded)
assert decoded == original

print(f"\nDecoded:  {decoded}")
print("Round-trip successful.")

In [None]:
import base64

# Base64 increases size by roughly 33%
# Every 3 bytes of input become 4 bytes of output
for size in [3, 6, 9, 12, 100]:
    data: bytes = b"A" * size
    encoded: bytes = base64.b64encode(data)
    ratio: float = len(encoded) / len(data)
    print(f"Input: {size:>3} bytes -> Output: {len(encoded):>4} bytes (ratio: {ratio:.2f})")

## Section 2: Base64 with Strings

Since `base64` works with bytes, you need to encode strings to bytes first and decode the result back to a string.

In [None]:
import base64

# Encoding a string: str -> bytes -> base64 bytes -> str
text: str = "Python is great!"
encoded: str = base64.b64encode(text.encode()).decode()

print(f"Text:    {text}")
print(f"Encoded: {encoded}")
assert isinstance(encoded, str)

# Decoding: str -> bytes -> base64 decode -> bytes -> str
decoded: str = base64.b64decode(encoded).decode()
assert decoded == text

print(f"Decoded: {decoded}")
print("\nString round-trip successful.")

## Section 3: URL-Safe Base64

Standard Base64 uses `+` and `/` which are special characters in URLs. URL-safe Base64 replaces these with `-` and `_`.

In [None]:
import base64

# Generate data that produces + and / in standard base64
data: bytes = bytes(range(256))

# Standard encoding may contain + and /
standard: bytes = base64.b64encode(data)
print(f"Standard contains '+': {b'+' in standard}")
print(f"Standard contains '/': {b'/' in standard}")

# URL-safe encoding replaces + with - and / with _
url_safe: bytes = base64.urlsafe_b64encode(data)
print(f"\nURL-safe contains '+': {b'+' in url_safe}")
print(f"URL-safe contains '/': {b'/' in url_safe}")

assert b"+" not in url_safe
assert b"/" not in url_safe

# Both decode back to the same data
assert base64.urlsafe_b64decode(url_safe) == data
print("\nURL-safe round-trip successful.")

In [None]:
import base64

# Comparing standard vs URL-safe encodings side by side
# Use data that highlights the differences
sample: bytes = b"\xfb\xff\xfe"

std: str = base64.b64encode(sample).decode()
url: str = base64.urlsafe_b64encode(sample).decode()

print(f"Data:     {sample.hex()}")
print(f"Standard: {std}")
print(f"URL-safe: {url}")
print(f"\nDifferences: '+' -> '-' and '/' -> '_'")

## Section 4: Base32 and Base16 Encoding

The `base64` module also provides Base32 (32 uppercase characters) and Base16 (hexadecimal) encodings.

In [None]:
import base64

data: bytes = b"Hello!"

# Base64: 6 bits per character
b64: bytes = base64.b64encode(data)
print(f"Base64: {b64.decode():<20} ({len(b64)} chars)")

# Base32: 5 bits per character (uppercase A-Z and 2-7)
b32: bytes = base64.b32encode(data)
print(f"Base32: {b32.decode():<20} ({len(b32)} chars)")

# Base16: 4 bits per character (hex digits)
b16: bytes = base64.b16encode(data)
print(f"Base16: {b16.decode():<20} ({len(b16)} chars)")

# All decode back to the same data
assert base64.b64decode(b64) == data
assert base64.b32decode(b32) == data
assert base64.b16decode(b16) == data
print("\nAll round-trips successful.")

## Section 5: Quoted-Printable Encoding (quopri)

Quoted-printable encoding is designed for data that is mostly ASCII text with occasional non-ASCII bytes. Non-printable bytes are encoded as `=XX` (hex). This is commonly used in email headers and bodies.

In [None]:
import quopri

# Encode text with non-ASCII characters
data: bytes = "H\u00e9llo W\u00f6rld".encode("utf-8")
encoded: bytes = quopri.encodestring(data)

print(f"Original bytes: {data}")
print(f"Encoded:        {encoded}")

# The accented characters are encoded as =XX sequences
assert b"=C3" in encoded  # UTF-8 for accented chars uses C3 prefix

# Decode back
decoded: bytes = quopri.decodestring(encoded)
assert decoded == data
print(f"Decoded:        {decoded}")
print(f"As string:      {decoded.decode('utf-8')}")

In [None]:
import quopri

# Decoding =XX sequences
encoded: bytes = b"Hello=20World"
decoded: bytes = quopri.decodestring(encoded)

print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
assert decoded == b"Hello World"  # =20 is the space character

# More examples
examples: list[bytes] = [
    b"100=25 complete",   # =25 is '%'
    b"price=3D100",       # =3D is '='
    b"line1=0D=0Aline2",  # =0D=0A is CRLF
]

print("\nDecoding examples:")
for enc in examples:
    dec: bytes = quopri.decodestring(enc)
    print(f"  {enc.decode():<25} -> {dec!r}")

## Section 6: Binary-ASCII Conversions (binascii)

The `binascii` module provides low-level functions for converting between binary data and various ASCII representations, including hexadecimal.

In [None]:
import binascii

# hexlify: bytes -> hex string (as bytes)
data: bytes = b"\xde\xad\xbe\xef"
hex_str: bytes = binascii.hexlify(data)

print(f"Data:    {data!r}")
print(f"Hex:     {hex_str}")
print(f"As text: {hex_str.decode()}")

assert hex_str == b"deadbeef"

# unhexlify: hex string -> bytes
restored: bytes = binascii.unhexlify(hex_str)
assert restored == data

print(f"\nRestored: {restored!r}")
print("Round-trip successful.")

In [None]:
import binascii

# hexlify and unhexlify with various data
samples: list[bytes] = [
    b"\x00\x00\x00\x00",
    b"\xff\xff\xff\xff",
    b"Hello",
    b"\x01\x02\x03",
]

print(f"{'Bytes':<25} {'Hex'}")
print("-" * 45)
for s in samples:
    h: str = binascii.hexlify(s).decode()
    print(f"{str(s):<25} {h}")
    # Verify round-trip
    assert binascii.unhexlify(h) == s

## Section 7: CRC-32 Checksums

The `binascii.crc32()` function computes a CRC-32 checksum, which is a fast hash used for data integrity checks (not security).

In [None]:
import binascii

# CRC-32 checksum
crc: int = binascii.crc32(b"hello")
print(f"CRC-32 of b'hello':  {crc}")
print(f"Hex:                 {crc:#010x}")

# The result is deterministic
assert isinstance(crc, int)
assert crc == binascii.crc32(b"hello")

# Different data produces different checksums
crc2: int = binascii.crc32(b"world")
print(f"\nCRC-32 of b'world': {crc2}")
print(f"Different: {crc != crc2}")

In [None]:
import binascii

# Incremental CRC-32: pass previous result as second argument
chunk1: bytes = b"Hello, "
chunk2: bytes = b"World!"

# Compute in one shot
full_crc: int = binascii.crc32(chunk1 + chunk2)

# Compute incrementally
partial: int = binascii.crc32(chunk1)
incremental_crc: int = binascii.crc32(chunk2, partial)

print(f"Full CRC-32:        {full_crc:#010x}")
print(f"Incremental CRC-32: {incremental_crc:#010x}")
assert full_crc == incremental_crc
print("\nIncremental computation matches.")

## Section 8: Hex and Bytes Conversions (Built-in Methods)

Python's built-in `bytes` and `int` types also provide methods for hex conversions, which are often more convenient than `binascii`.

In [None]:
# bytes.hex() and bytes.fromhex()
data: bytes = b"\xca\xfe\xba\xbe"

# Convert bytes to hex string
hex_str: str = data.hex()
print(f"Bytes:    {data!r}")
print(f"Hex:      {hex_str}")
print(f"Type:     {type(hex_str).__name__}")

# Convert hex string back to bytes
restored: bytes = bytes.fromhex(hex_str)
assert restored == data
print(f"Restored: {restored!r}")

# hex() supports a separator argument
spaced: str = data.hex(" ")
print(f"\nWith spaces: {spaced}")

coloned: str = data.hex(":")
print(f"With colons: {coloned}")

In [None]:
# int.to_bytes() and int.from_bytes()
value: int = 305419896  # 0x12345678

# Convert int to bytes (big-endian)
big: bytes = value.to_bytes(4, byteorder="big")
print(f"Value:      {value} (0x{value:08x})")
print(f"Big-endian: {big.hex()}")

# Convert int to bytes (little-endian)
little: bytes = value.to_bytes(4, byteorder="little")
print(f"Little-end: {little.hex()}")

# Convert back
from_big: int = int.from_bytes(big, byteorder="big")
from_little: int = int.from_bytes(little, byteorder="little")
assert from_big == value
assert from_little == value
print(f"\nBoth decode back to: {from_big}")

## Section 9: UU Encoding (Legacy)

UU (Unix-to-Unix) encoding is a legacy format for transferring binary files over text-only channels. While rarely used today, it is still available in Python's `uu` module.

In [None]:
import uu
import io

# UU encode: binary data -> text representation
data: bytes = b"Hello, UU encoding!"
input_buf: io.BytesIO = io.BytesIO(data)
output_buf: io.BytesIO = io.BytesIO()

uu.encode(input_buf, output_buf, name="hello.txt")

encoded: bytes = output_buf.getvalue()
print("UU Encoded:")
print(encoded.decode())

# UU decode: text -> binary data
decode_input: io.BytesIO = io.BytesIO(encoded)
decode_output: io.BytesIO = io.BytesIO()

uu.decode(decode_input, decode_output)

decoded: bytes = decode_output.getvalue()
assert decoded == data
print(f"Decoded: {decoded}")

## Section 10: Practical Patterns

Common encoding patterns used in real applications.

In [None]:
import base64
import binascii


def create_data_uri(data: bytes, mime_type: str) -> str:
    """Create a data URI from bytes and a MIME type."""
    encoded: str = base64.b64encode(data).decode("ascii")
    return f"data:{mime_type};base64,{encoded}"


def parse_data_uri(uri: str) -> tuple[str, bytes]:
    """Parse a data URI and return (mime_type, decoded_data)."""
    header, encoded = uri.split(",", 1)
    mime_type: str = header.split(":")[1].split(";")[0]
    data: bytes = base64.b64decode(encoded)
    return mime_type, data


# Create a data URI for a small text snippet
content: bytes = b"<h1>Hello</h1>"
uri: str = create_data_uri(content, "text/html")
print(f"Data URI: {uri}")

# Parse it back
mime, decoded = parse_data_uri(uri)
print(f"\nMIME type: {mime}")
print(f"Data:      {decoded}")
assert decoded == content

In [None]:
import binascii


def verify_integrity(data: bytes, expected_crc: int) -> bool:
    """Verify data integrity using CRC-32."""
    actual_crc: int = binascii.crc32(data)
    return actual_crc == expected_crc


# Simulate sending data with a checksum
message: bytes = b"Important data payload"
checksum: int = binascii.crc32(message)
print(f"Data:     {message}")
print(f"Checksum: {checksum:#010x}")

# Verify intact data
print(f"\nIntact:    {verify_integrity(message, checksum)}")

# Verify corrupted data
corrupted: bytes = b"Important data payloak"  # last char changed
print(f"Corrupted: {verify_integrity(corrupted, checksum)}")

In [None]:
import base64


def encode_token(user_id: int, timestamp: int) -> str:
    """Create a URL-safe token from user_id and timestamp."""
    payload: bytes = f"{user_id}:{timestamp}".encode()
    return base64.urlsafe_b64encode(payload).decode()


def decode_token(token: str) -> tuple[int, int]:
    """Decode a token back to (user_id, timestamp)."""
    payload: str = base64.urlsafe_b64decode(token).decode()
    user_id_str, ts_str = payload.split(":")
    return int(user_id_str), int(ts_str)


# Create and decode a token
token: str = encode_token(user_id=42, timestamp=1740000000)
print(f"Token:     {token}")

uid, ts = decode_token(token)
print(f"User ID:   {uid}")
print(f"Timestamp: {ts}")

assert uid == 42
assert ts == 1740000000

## Summary

### Base64 (`base64` module)
- **`b64encode(data)`** / **`b64decode(data)`**: Standard Base64 encoding
- **`urlsafe_b64encode(data)`** / **`urlsafe_b64decode(data)`**: URL-safe variant (uses `-` and `_`)
- **`b32encode(data)`** / **`b32decode(data)`**: Base32 encoding
- **`b16encode(data)`** / **`b16decode(data)`**: Base16 (hex) encoding
- Base64 increases data size by approximately **33%** (4 output bytes per 3 input bytes)

### Quoted-Printable (`quopri` module)
- **`encodestring(data)`**: Encode bytes with non-ASCII as `=XX` sequences
- **`decodestring(data)`**: Decode `=XX` sequences back to bytes
- Best for data that is **mostly ASCII** with occasional non-ASCII bytes

### Binary-ASCII (`binascii` module)
- **`hexlify(data)`**: Convert bytes to hex representation (as bytes)
- **`unhexlify(hex_str)`**: Convert hex back to bytes
- **`crc32(data)`**: Compute a CRC-32 checksum (returns `int`)
- CRC-32 supports **incremental** computation via a second argument

### Built-in Hex/Bytes Conversions
- **`bytes.hex(sep)`**: Convert bytes to hex string with optional separator
- **`bytes.fromhex(s)`**: Create bytes from a hex string
- **`int.to_bytes(length, byteorder)`**: Convert int to bytes
- **`int.from_bytes(data, byteorder)`**: Convert bytes to int

### UU Encoding (`uu` module)
- **`uu.encode(input, output)`**: Legacy Unix-to-Unix encoding
- **`uu.decode(input, output)`**: Decode UU-encoded data
- Rarely used today; prefer Base64 for new applications