# Chapter 11: Binary Data and Encodings

This notebook covers the practical side of working with binary data and text encodings. We explore how UTF-8 and other encodings represent characters as bytes, how to handle encoding errors gracefully, and how to work with binary data using `struct` and `memoryview`.

## Key Concepts
- **UTF-8**: Variable-length encoding (1-4 bytes per character), dominant on the web
- **Encoding error handlers**: `strict`, `replace`, `ignore`, `xmlcharrefreplace`
- **Latin-1**: Bijective encoding mapping every byte 0-255 to a character
- **struct**: Pack and unpack binary data to/from C-compatible formats
- **memoryview**: Zero-copy slicing of binary data
- **BOM**: Byte Order Mark for detecting encoding and endianness

## UTF-8 Variable-Length Encoding

UTF-8 encodes each Unicode code point using 1 to 4 bytes. ASCII characters (U+0000 to U+007F) use a single byte, making UTF-8 backward-compatible with ASCII. Characters from other scripts use more bytes.

In [None]:
# UTF-8 uses 1-4 bytes per character
examples: list[tuple[str, str]] = [
    ("A", "ASCII (1 byte)"),
    ("√©", "Latin accent (2 bytes)"),
    ("‰∏≠", "CJK ideograph (3 bytes)"),
    ("üéâ", "Emoji (4 bytes)"),
]

print(f"{'Char':<6} {'Codepoint':<12} {'UTF-8 Hex':<20} {'Bytes':<8} {'Binary'}")
print("-" * 75)
for char, desc in examples:
    encoded = char.encode("utf-8")
    binary = " ".join(f"{b:08b}" for b in encoded)
    print(f"{char:<6} U+{ord(char):04X}{'':>5} {encoded.hex(' '):<20} {len(encoded):<8} {binary}")

# Verify byte counts match expectations
assert len("A".encode("utf-8")) == 1
assert len("√©".encode("utf-8")) == 2
assert len("‰∏≠".encode("utf-8")) == 3
assert len("üéâ".encode("utf-8")) == 4
print("\nAll byte count assertions passed.")

## Comparing Encodings: UTF-8, Latin-1, UTF-16, ASCII

Different encodings represent the same text differently. Each has trade-offs in terms of compatibility, compactness, and character coverage.

In [None]:
# Comparing how different encodings represent the same text
text: str = "caf√©"

encodings: dict[str, str] = {
    "utf-8": "Variable-length, web standard, ASCII-compatible",
    "latin-1": "Fixed 1-byte, covers Western European languages",
    "utf-16": "Fixed 2 or 4 bytes, includes BOM",
    "ascii": "7-bit, English only (will fail on accented chars)",
}

print(f"Encoding {text!r}:\n")
for enc, desc in encodings.items():
    try:
        encoded = text.encode(enc)
        print(f"{enc:>10}: {encoded.hex(' '):<30} ({len(encoded)} bytes)")
        print(f"{'':>10}  {desc}")
    except UnicodeEncodeError as e:
        print(f"{enc:>10}: FAILED - {e}")
        print(f"{'':>10}  {desc}")
    print()

## Encoding Error Handling

When encoding text to bytes, characters that cannot be represented in the target encoding cause errors. Python provides several error handlers to control what happens.

In [None]:
# Encoding error handlers
text: str = "caf√© üêç"

handlers: list[tuple[str, str]] = [
    ("strict", "Raise UnicodeEncodeError (default)"),
    ("replace", "Replace with '?' for bytes"),
    ("ignore", "Silently drop unencodable characters"),
    ("xmlcharrefreplace", "Replace with XML character reference"),
    ("backslashreplace", "Replace with Python backslash escape"),
    ("namereplace", "Replace with \\N{...} escape"),
]

print(f"Encoding {text!r} to ASCII with different error handlers:\n")
for handler, desc in handlers:
    try:
        result = text.encode("ascii", errors=handler)
        print(f"{handler:<20} -> {result}")
        print(f"{'':>20}    {desc}")
    except UnicodeEncodeError as e:
        print(f"{handler:<20} -> UnicodeEncodeError: {e}")
        print(f"{'':>20}    {desc}")
    print()

# Verify behaviors from the test file
assert b"?" in "caf√©".encode("ascii", errors="replace")
assert "caf√©".encode("ascii", errors="ignore") == b"caf"
print("Error handling assertions passed.")

In [None]:
# Decoding error handlers work similarly
# Simulate corrupted UTF-8 data
bad_utf8: bytes = b"caf\xc3"  # Incomplete UTF-8 sequence (missing second byte of '√©')

print(f"Decoding corrupted bytes {bad_utf8!r}:\n")
for handler in ["strict", "replace", "ignore"]:
    try:
        result = bad_utf8.decode("utf-8", errors=handler)
        print(f"{handler:<10} -> {result!r}")
    except UnicodeDecodeError as e:
        print(f"{handler:<10} -> UnicodeDecodeError: {e}")

# Practical pattern: try UTF-8, fall back to Latin-1
def safe_decode(data: bytes, preferred: str = "utf-8") -> str:
    """Decode bytes, falling back to Latin-1 which never fails."""
    try:
        return data.decode(preferred)
    except UnicodeDecodeError:
        return data.decode("latin-1")  # Latin-1 never fails

print(f"\nSafe decode of valid UTF-8:   {safe_decode(b'caf\xc3\xa9')!r}")
print(f"Safe decode of invalid UTF-8: {safe_decode(bad_utf8)!r}  (Latin-1 fallback)")

## Latin-1 Bijective Property

Latin-1 (ISO 8859-1) has a unique property: every byte value 0-255 maps to exactly one character, and every character maps back to exactly one byte. This makes it useful as a "pass-through" encoding that never fails.

In [None]:
# Latin-1 is bijective: every byte 0-255 maps to a character and back
all_pass = True
for i in range(256):
    byte_val: bytes = bytes([i])
    char: str = byte_val.decode("latin-1")
    roundtrip: bytes = char.encode("latin-1")
    if roundtrip != byte_val:
        all_pass = False
        print(f"FAILED at byte {i}")

print(f"Latin-1 bijective property verified for all 256 byte values: {all_pass}")

# Show a sample of the mapping
print(f"\nSample Latin-1 mappings:")
print(f"{'Byte':<8} {'Hex':<8} {'Char':<8} {'Name'}")
print("-" * 55)
import unicodedata
for i in [0, 32, 65, 97, 128, 169, 192, 223, 233, 255]:
    char = bytes([i]).decode("latin-1")
    name = unicodedata.name(char, f"<control U+{i:04X}>")
    display = char if i >= 32 else "."
    print(f"{i:<8} 0x{i:02X}{'':>4} {display:<8} {name}")

# This is why Latin-1 is a safe fallback -- it can decode ANY byte sequence
random_bytes: bytes = bytes(range(0, 256))
decoded = random_bytes.decode("latin-1")
print(f"\nDecoded all 256 bytes successfully: {len(decoded)} characters")

## struct Module for Binary Data

The `struct` module packs and unpacks binary data according to format strings. This is essential for reading/writing binary file formats and network protocols.

In [None]:
import struct

# Pack Python values into binary format
# Format: '>' big-endian, 'H' unsigned short, 'I' unsigned int, 'f' float
packed: bytes = struct.pack(">HIf", 256, 100_000, 3.14)
print(f"Packed bytes: {packed.hex(' ')}")
print(f"Packed length: {len(packed)} bytes")

# Unpack binary data back to Python values
unpacked: tuple = struct.unpack(">HIf", packed)
print(f"\nUnpacked values: {unpacked}")
print(f"  unsigned short: {unpacked[0]}")
print(f"  unsigned int:   {unpacked[1]}")
print(f"  float:          {unpacked[2]:.2f}")

# Common format characters
formats: list[tuple[str, str, object]] = [
    ("b", "signed byte", -42),
    ("B", "unsigned byte", 255),
    ("h", "signed short (2 bytes)", -1000),
    ("H", "unsigned short (2 bytes)", 65535),
    ("i", "signed int (4 bytes)", -100_000),
    ("I", "unsigned int (4 bytes)", 3_000_000),
    ("f", "float (4 bytes)", 3.14),
    ("d", "double (8 bytes)", 3.141592653589793),
]

print(f"\n{'Format':<8} {'Type':<25} {'Value':<20} {'Packed Hex'}")
print("-" * 70)
for fmt, desc, val in formats:
    p = struct.pack(f">{fmt}", val)
    print(f"{fmt:<8} {desc:<25} {str(val):<20} {p.hex(' ')}")

In [None]:
import struct

# Practical example: parsing a simple binary header
# Imagine a binary file format with:
#   - Magic number: 2 bytes (unsigned short)
#   - Version: 1 byte (unsigned char)
#   - Record count: 4 bytes (unsigned int)
#   - Timestamp: 8 bytes (double)

HEADER_FORMAT: str = ">HBI d"  # big-endian
HEADER_SIZE: int = struct.calcsize(HEADER_FORMAT)

# Write a header
header: bytes = struct.pack(HEADER_FORMAT, 0xCAFE, 2, 1_000_000, 1708500000.0)
print(f"Header size: {HEADER_SIZE} bytes")
print(f"Header hex: {header.hex(' ')}")

# Parse the header back
magic, version, count, timestamp = struct.unpack(HEADER_FORMAT, header)
print(f"\nParsed header:")
print(f"  Magic:     0x{magic:04X}")
print(f"  Version:   {version}")
print(f"  Records:   {count:,}")
print(f"  Timestamp: {timestamp}")

# struct.calcsize tells you the byte count for a format
print(f"\nFormat sizes:")
for fmt in ["B", "H", "I", "Q", "f", "d"]:
    print(f"  {fmt}: {struct.calcsize(fmt)} bytes")

## memoryview for Zero-Copy Slicing

`memoryview` provides a way to access the internal data of an object (like `bytes`, `bytearray`, or `array.array`) without copying it. This is important for performance when working with large binary buffers.

In [None]:
import struct

# memoryview allows zero-copy slicing of binary data
data = bytearray(b"Hello, World! Extra data here...")
view = memoryview(data)

# Slicing a memoryview does NOT copy data
slice1 = view[0:5]
slice2 = view[7:12]
print(f"Original: {bytes(data)}")
print(f"Slice [0:5]: {bytes(slice1)}")
print(f"Slice [7:12]: {bytes(slice2)}")

# Modifying through the view modifies the original (zero-copy!)
slice1[0] = ord("h")
print(f"\nAfter modifying view: {bytes(data)}")
print("The original bytearray was modified through the memoryview.")

# Practical: parse fields from a binary buffer without copying
buffer = bytearray(struct.pack(">HI8s", 42, 100, b"TestData"))
mv = memoryview(buffer)

# Extract fields by slicing the memoryview (no copy)
field1 = struct.unpack(">H", mv[0:2])[0]
field2 = struct.unpack(">I", mv[2:6])[0]
field3 = bytes(mv[6:14])
print(f"\nParsed from memoryview:")
print(f"  Field 1 (ushort): {field1}")
print(f"  Field 2 (uint):   {field2}")
print(f"  Field 3 (bytes):  {field3}")

## BOM (Byte Order Mark) Handling

The Byte Order Mark (BOM) is a special Unicode character (U+FEFF) placed at the beginning of a text file to indicate encoding and byte order. UTF-16 and UTF-32 use BOMs to distinguish big-endian from little-endian.

In [None]:
import codecs

text: str = "Hello"

# UTF-16 encoding adds a BOM by default
utf16: bytes = text.encode("utf-16")
utf16_le: bytes = text.encode("utf-16-le")  # Little-endian, no BOM
utf16_be: bytes = text.encode("utf-16-be")  # Big-endian, no BOM

print(f"UTF-16 (with BOM): {utf16.hex(' ')}")
print(f"UTF-16-LE (no BOM): {utf16_le.hex(' ')}")
print(f"UTF-16-BE (no BOM): {utf16_be.hex(' ')}")

# The BOM bytes
print(f"\nBOM constants:")
print(f"  BOM_UTF16_LE: {codecs.BOM_UTF16_LE.hex(' ')}  (FF FE = little-endian)")
print(f"  BOM_UTF16_BE: {codecs.BOM_UTF16_BE.hex(' ')}  (FE FF = big-endian)")
print(f"  BOM_UTF8:     {codecs.BOM_UTF8.hex(' ')}  (EF BB BF)")

# Detect BOM in data
def detect_bom(data: bytes) -> str:
    """Detect the BOM at the start of binary data."""
    if data.startswith(codecs.BOM_UTF8):
        return "UTF-8 BOM"
    elif data.startswith(codecs.BOM_UTF16_LE):
        return "UTF-16 Little-Endian"
    elif data.startswith(codecs.BOM_UTF16_BE):
        return "UTF-16 Big-Endian"
    elif data.startswith(codecs.BOM_UTF32_LE):
        return "UTF-32 Little-Endian"
    elif data.startswith(codecs.BOM_UTF32_BE):
        return "UTF-32 Big-Endian"
    return "No BOM detected"

# Test BOM detection
test_data: list[tuple[str, bytes]] = [
    ("utf-8-sig", text.encode("utf-8-sig")),
    ("utf-16", text.encode("utf-16")),
    ("utf-8", text.encode("utf-8")),
]

print(f"\nBOM detection:")
for enc, data in test_data:
    print(f"  {enc:<12} -> {detect_bom(data):<25} hex: {data[:4].hex(' ')}")

In [None]:
import tempfile
import os

# Practical BOM handling: reading files with BOM
text: str = "Hello, World!"

# Write a file with UTF-8 BOM (common in Windows)
with tempfile.NamedTemporaryFile(mode="wb", suffix=".txt", delete=False) as f:
    f.write(b"\xef\xbb\xbf")  # UTF-8 BOM
    f.write(text.encode("utf-8"))
    temp_path: str = f.name

try:
    # Reading with 'utf-8' keeps the BOM as a character
    with open(temp_path, "r", encoding="utf-8") as f:
        content_with_bom = f.read()
    print(f"Read with 'utf-8':     {content_with_bom!r}")
    print(f"Starts with BOM char:  {content_with_bom[0] == chr(0xFEFF)}")

    # Reading with 'utf-8-sig' automatically strips the BOM
    with open(temp_path, "r", encoding="utf-8-sig") as f:
        content_no_bom = f.read()
    print(f"\nRead with 'utf-8-sig': {content_no_bom!r}")
    print(f"BOM stripped:          {content_no_bom == text}")
finally:
    os.unlink(temp_path)

print("\nTip: Use 'utf-8-sig' when reading files that may have a UTF-8 BOM.")

## Summary

### Key Takeaways

- **UTF-8** uses 1-4 bytes per character and is backward-compatible with ASCII; it is the default encoding for the modern web
- **Latin-1** is bijective -- every byte 0-255 maps to a character, making it useful as a safe fallback decoder
- **Encoding error handlers** (`strict`, `replace`, `ignore`, `xmlcharrefreplace`, `backslashreplace`, `namereplace`) control what happens when characters cannot be encoded
- The **`struct`** module packs/unpacks binary data to C-compatible formats -- essential for binary file formats and protocols
- **`memoryview`** provides zero-copy slicing of binary buffers, avoiding expensive copies for large data
- **BOM (Byte Order Mark)** indicates encoding and endianness; use `utf-8-sig` to handle UTF-8 files with BOM transparently
- Always be explicit about encodings -- never rely on platform defaults