# Chapter 30: Advanced Data Formats

This notebook covers advanced data processing patterns: CSV dialects and quoting, binary data packing with `struct`, property list files with `plistlib`, and data validation techniques.

## Key Concepts
- **CSV dialects**: Customizing delimiters, quoting, and escaping for non-standard CSV
- **struct module**: Packing and unpacking binary data with format strings
- **plistlib**: Reading and writing Apple property list files
- **Data validation**: Verifying data integrity after parsing

## Section 1: CSV Dialects and Custom Delimiters

The `csv` module supports more than just comma-separated files. You can customize the delimiter, quoting behavior, and line terminators using dialects or keyword arguments.

In [None]:
import csv
import io

# Pipe-delimited data (common in legacy systems)
pipe_data: str = "name|age|city\nAlice|30|NYC\nBob|25|LA"

reader = csv.DictReader(io.StringIO(pipe_data), delimiter="|")
rows: list[dict[str, str]] = list(reader)

print(f"Number of rows: {len(rows)}")
for row in rows:
    print(f"  {row['name']}, age {row['age']}, lives in {row['city']}")

# Verify specific values
print(f"\nFirst name: {rows[0]['name']}")
print(f"Second city: {rows[1]['city']}")

In [None]:
# Registering a custom dialect for reuse
csv.register_dialect(
    "pipes",
    delimiter="|",
    quoting=csv.QUOTE_MINIMAL,
    lineterminator="\n",
)

# Write using the custom dialect
output = io.StringIO()
writer = csv.writer(output, dialect="pipes")
writer.writerow(["product", "price", "quantity"])
writer.writerow(["Widget", "9.99", "100"])
writer.writerow(["Gadget", "19.99", "50"])

result: str = output.getvalue()
print("Custom dialect output:")
print(result)

# Read it back with the same dialect
reader = csv.DictReader(io.StringIO(result), dialect="pipes")
for row in reader:
    print(f"  {row['product']}: ${row['price']} x {row['quantity']}")

# Clean up
csv.unregister_dialect("pipes")

In [None]:
# CSV quoting handles fields with special characters
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["name", "note"])
writer.writerow(["Alice", 'She said "hello"'])
writer.writerow(["Bob", "Line1\nLine2"])
writer.writerow(["Charlie", "simple"])

content: str = output.getvalue()
print("CSV with special characters:")
print(repr(content))
print()
print(content)

# Embedded quotes are doubled
print(f"Contains doubled quotes: {'""hello""' in content}")

In [None]:
# Different quoting strategies
data_rows: list[list[str]] = [
    ["name", "value"],
    ["Alice", "100"],
    ["Bob", "has, comma"],
]

# QUOTE_MINIMAL: only quote when necessary (default)
buf = io.StringIO()
csv.writer(buf, quoting=csv.QUOTE_MINIMAL).writerows(data_rows)
print(f"QUOTE_MINIMAL:\n{buf.getvalue()}")

# QUOTE_ALL: quote every field
buf = io.StringIO()
csv.writer(buf, quoting=csv.QUOTE_ALL).writerows(data_rows)
print(f"QUOTE_ALL:\n{buf.getvalue()}")

# QUOTE_NONNUMERIC: quote non-numeric fields
buf = io.StringIO()
csv.writer(buf, quoting=csv.QUOTE_NONNUMERIC).writerows(data_rows)
print(f"QUOTE_NONNUMERIC:\n{buf.getvalue()}")

## Section 2: Binary Data with struct

The `struct` module converts between Python values and C-style binary data. This is essential for working with binary file formats, network protocols, and hardware interfaces.

Format characters:
- `>` big-endian, `<` little-endian, `!` network byte order
- `h` short (2 bytes), `i` int (4 bytes), `q` long long (8 bytes)
- `f` float (4 bytes), `d` double (8 bytes)
- `s` char[] (string), `?` bool

In [None]:
import struct

# Pack a short integer and a float (big-endian)
packed: bytes = struct.pack(">hf", 42, 3.14)

print(f"Packed bytes: {packed}")
print(f"Hex: {packed.hex()}")
print(f"Length: {len(packed)} bytes")
print(f"Is bytes: {isinstance(packed, bytes)}")

# Unpack back to Python values
short_val: int
float_val: float
short_val, float_val = struct.unpack(">hf", packed)

print(f"\nUnpacked short: {short_val}")
print(f"Unpacked float: {float_val:.4f}")
print(f"Float close to 3.14: {abs(float_val - 3.14) < 0.01}")

In [None]:
# calcsize tells you how many bytes a format requires
formats: dict[str, str] = {
    ">i": "int (4 bytes)",
    ">d": "double (8 bytes)",
    ">hh": "two shorts (4 bytes)",
    ">f": "float (4 bytes)",
    ">q": "long long (8 bytes)",
    ">10s": "10-char string (10 bytes)",
    ">?ii": "bool + two ints (9 bytes)",
}

print("Format sizes:")
for fmt, desc in formats.items():
    size: int = struct.calcsize(fmt)
    print(f"  {fmt:>8s} = {size:2d} bytes  ({desc})")

In [None]:
# Practical example: packing a simple binary message header
# Format: version(byte) + message_type(short) + payload_length(int) + checksum(int)
HEADER_FORMAT: str = ">BHI I"  # Spaces are allowed for readability
HEADER_SIZE: int = struct.calcsize(HEADER_FORMAT)

# Pack a header
version: int = 1
msg_type: int = 42
payload_len: int = 256
checksum: int = 0xDEADBEEF

header: bytes = struct.pack(HEADER_FORMAT, version, msg_type, payload_len, checksum)
print(f"Header ({HEADER_SIZE} bytes): {header.hex()}")

# Unpack the header
v, mt, pl, cs = struct.unpack(HEADER_FORMAT, header)
print(f"\nVersion: {v}")
print(f"Message type: {mt}")
print(f"Payload length: {pl}")
print(f"Checksum: 0x{cs:08X}")

In [None]:
# Struct with iter_unpack for repeated structures
# Pack a sequence of (x, y) coordinate pairs
POINT_FORMAT: str = ">ff"  # Two floats per point

points: list[tuple[float, float]] = [(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)]
packed_points: bytes = b"".join(
    struct.pack(POINT_FORMAT, x, y) for x, y in points
)

print(f"Packed {len(points)} points in {len(packed_points)} bytes")

# Unpack all points at once using iter_unpack
unpacked: list[tuple[float, float]] = [
    (round(x, 1), round(y, 1))
    for x, y in struct.iter_unpack(POINT_FORMAT, packed_points)
]
print(f"Unpacked points: {unpacked}")
print(f"Match original: {unpacked == points}")

## Section 3: Property Lists with plistlib

The `plistlib` module reads and writes Apple property list files, commonly used in macOS and iOS for configuration and data storage. Plists support dicts, lists, strings, ints, floats, booleans, bytes, and datetime.

In [None]:
import plistlib
from datetime import datetime

# Create a property list
app_config: dict = {
    "AppName": "MyApp",
    "Version": 2,
    "Debug": False,
    "MaxRetries": 3,
    "Timeout": 30.5,
    "Features": ["dark_mode", "notifications", "sync"],
    "BuildDate": datetime(2025, 1, 15, 12, 0, 0),
}

# Serialize to XML plist format
plist_bytes: bytes = plistlib.dumps(app_config, fmt=plistlib.FMT_XML)
print("XML plist output:")
print(plist_bytes.decode("utf-8"))

In [None]:
# Parse the plist back
loaded: dict = plistlib.loads(plist_bytes)

print("Loaded plist data:")
for key, value in loaded.items():
    print(f"  {key}: {value!r} ({type(value).__name__})")

# Types are preserved
print(f"\nVersion is int: {isinstance(loaded['Version'], int)}")
print(f"Debug is bool: {isinstance(loaded['Debug'], bool)}")
print(f"Features is list: {isinstance(loaded['Features'], list)}")
print(f"BuildDate is datetime: {isinstance(loaded['BuildDate'], datetime)}")

In [None]:
# Binary plist format (more compact)
binary_plist: bytes = plistlib.dumps(app_config, fmt=plistlib.FMT_BINARY)
xml_plist: bytes = plistlib.dumps(app_config, fmt=plistlib.FMT_XML)

print(f"XML plist size:    {len(xml_plist)} bytes")
print(f"Binary plist size: {len(binary_plist)} bytes")
print(f"Savings: {len(xml_plist) - len(binary_plist)} bytes")

# Both formats parse to identical data
from_binary: dict = plistlib.loads(binary_plist)
from_xml: dict = plistlib.loads(xml_plist)
print(f"\nBinary == XML: {from_binary == from_xml}")

## Section 4: Data Validation Patterns

After parsing data from any format (CSV, XML, config files), you should validate it before use. Here are common patterns for data validation in Python.

In [None]:
from typing import Any


def validate_record(record: dict[str, str]) -> list[str]:
    """Validate a parsed CSV record and return a list of errors."""
    errors: list[str] = []

    # Required fields
    required: list[str] = ["name", "age", "email"]
    for field in required:
        if field not in record or not record[field].strip():
            errors.append(f"Missing required field: {field}")

    # Type validation
    if "age" in record and record["age"].strip():
        try:
            age: int = int(record["age"])
            if not (0 <= age <= 150):
                errors.append(f"Age out of range: {age}")
        except ValueError:
            errors.append(f"Invalid age: {record['age']}")

    # Format validation
    if "email" in record and record["email"].strip():
        if "@" not in record["email"]:
            errors.append(f"Invalid email: {record['email']}")

    return errors


# Test with various records
test_records: list[dict[str, str]] = [
    {"name": "Alice", "age": "30", "email": "alice@example.com"},
    {"name": "Bob", "age": "abc", "email": "bob@example.com"},
    {"name": "", "age": "25", "email": "no-at-sign"},
    {"name": "Diana", "age": "200", "email": "diana@example.com"},
]

for record in test_records:
    errors = validate_record(record)
    name: str = record.get("name", "(empty)")
    if errors:
        print(f"{name or '(empty)'}: INVALID - {errors}")
    else:
        print(f"{name}: VALID")

In [None]:
from dataclasses import dataclass, field


@dataclass
class ValidatedConfig:
    """A configuration object that validates its fields on creation."""

    host: str
    port: int
    database: str
    pool_size: int = 5
    timeout: float = 30.0

    def __post_init__(self) -> None:
        """Validate all fields after initialization."""
        errors: list[str] = []

        if not self.host:
            errors.append("host cannot be empty")
        if not (1 <= self.port <= 65535):
            errors.append(f"port must be 1-65535, got {self.port}")
        if not self.database:
            errors.append("database cannot be empty")
        if self.pool_size < 1:
            errors.append(f"pool_size must be >= 1, got {self.pool_size}")
        if self.timeout <= 0:
            errors.append(f"timeout must be > 0, got {self.timeout}")

        if errors:
            raise ValueError(f"Invalid config: {'; '.join(errors)}")


# Valid configuration
config = ValidatedConfig(host="localhost", port=5432, database="mydb")
print(f"Valid config: {config}")

# Invalid configurations
invalid_cases: list[dict[str, Any]] = [
    {"host": "", "port": 5432, "database": "mydb"},
    {"host": "localhost", "port": 99999, "database": "mydb"},
    {"host": "localhost", "port": 5432, "database": "mydb", "pool_size": -1},
]

for case in invalid_cases:
    try:
        ValidatedConfig(**case)
    except ValueError as e:
        print(f"Rejected: {e}")

In [None]:
import xml.etree.ElementTree as ET


def validate_xml_schema(root: ET.Element, required_children: list[str]) -> list[str]:
    """Basic XML structure validation: check for required child elements."""
    errors: list[str] = []

    for child_tag in required_children:
        found: ET.Element | None = root.find(child_tag)
        if found is None:
            errors.append(f"Missing required element: <{child_tag}>")
        elif not (found.text and found.text.strip()):
            errors.append(f"Empty required element: <{child_tag}>")

    return errors


# Validate XML documents
good_xml: str = "<user><name>Alice</name><email>alice@test.com</email></user>"
bad_xml: str = "<user><name>Bob</name></user>"
empty_xml: str = "<user><name></name><email>test@test.com</email></user>"

required: list[str] = ["name", "email"]

for label, xml_str in [("Good", good_xml), ("Missing", bad_xml), ("Empty", empty_xml)]:
    root = ET.fromstring(xml_str)
    errors = validate_xml_schema(root, required)
    status: str = "VALID" if not errors else f"INVALID: {errors}"
    print(f"{label}: {status}")

## Summary

### CSV Dialects
- **`delimiter`**: Change separator from comma to pipe, tab, etc.
- **`csv.DictReader`/`csv.DictWriter`**: Row-level dict access with headers
- **`register_dialect()`**: Define reusable dialect configurations
- **Quoting modes**: `QUOTE_MINIMAL`, `QUOTE_ALL`, `QUOTE_NONNUMERIC`
- Embedded quotes are doubled: `"She said ""hello"""` 

### Binary struct Packing
- **`struct.pack(fmt, ...)`**: Convert Python values to bytes
- **`struct.unpack(fmt, data)`**: Convert bytes back to Python values
- **`struct.calcsize(fmt)`**: Calculate the byte size of a format
- **`struct.iter_unpack(fmt, data)`**: Iterate over repeated structures
- Byte order: `>` big-endian, `<` little-endian, `!` network

### plistlib
- **`plistlib.dumps()`/`plistlib.loads()`**: Serialize/deserialize plists
- Supports XML (`FMT_XML`) and binary (`FMT_BINARY`) formats
- Native types: dict, list, str, int, float, bool, bytes, datetime

### Data Validation
- Always validate parsed data before use
- Check required fields, types, ranges, and formats
- Use `dataclass.__post_init__()` for self-validating objects
- Return error lists rather than raising on first error for batch validation