# Data Serialization

**Chapter 8 - Learning Python, 5th Edition**

Serialization converts Python objects into a format that can be stored or transmitted,
then reconstructed later. Python provides built-in support for JSON (human-readable,
cross-language), CSV (tabular data), pickle (Python-native), and struct (binary
packing). Choosing the right format depends on interoperability, readability,
performance, and security requirements.

## Section 1: JSON - Human-Readable, Cross-Language

JSON (JavaScript Object Notation) is the most common serialization format for
web APIs and configuration files. Python's `json` module maps between JSON types
and Python types:

| JSON | Python |
|------|--------|
| object | `dict` |
| array | `list` |
| string | `str` |
| number (int) | `int` |
| number (real) | `float` |
| true/false | `True`/`False` |
| null | `None` |

In [None]:
import json
from pathlib import Path
import tempfile

work_dir = Path(tempfile.mkdtemp(prefix="ch08_serial_"))

# Basic serialization: dumps() -> string, dump() -> file
data: dict[str, object] = {
    "name": "Learning Python",
    "edition": 5,
    "price": 59.99,
    "in_print": True,
    "topics": ["types", "functions", "classes", "modules"],
    "metadata": None,
}

# Serialize to string
json_str: str = json.dumps(data, indent=2)
print(f"json.dumps() output:\n{json_str}")

# Deserialize from string
restored: dict = json.loads(json_str)
print(f"\njson.loads() type: {type(restored).__name__}")
print(f"Round-trip equal: {data == restored}")

# Serialize to file
json_file = work_dir / "book.json"
with open(json_file, "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Deserialize from file
with open(json_file, "r", encoding="utf-8") as f:
    from_file: dict = json.load(f)

print(f"\nFile round-trip equal: {data == from_file}")

# Useful options
compact: str = json.dumps(data, separators=(",", ":"))
print(f"\nCompact (no spaces): {compact[:60]}...")

sorted_keys: str = json.dumps(data, sort_keys=True, indent=2)
print(f"\nSorted keys:\n{sorted_keys}")

## Section 2: JSON Custom Encoders and Decoders

JSON only supports basic types. For `datetime`, `dataclass`, `set`, `Path`, and
other Python types, you need custom encoders and decoders.

In [None]:
from datetime import datetime, date
from dataclasses import dataclass, asdict
from typing import Any


@dataclass
class Event:
    """An event with a date and attendees."""
    title: str
    date: date
    attendees: set[str]
    location: Path | None = None


# Custom encoder: handles types JSON doesn't natively support
class EnhancedEncoder(json.JSONEncoder):
    """JSON encoder that handles datetime, set, Path, and dataclasses."""

    def default(self, obj: Any) -> Any:
        if isinstance(obj, (datetime, date)):
            return {"__type__": "datetime", "iso": obj.isoformat()}
        if isinstance(obj, set):
            return {"__type__": "set", "items": sorted(obj)}
        if isinstance(obj, Path):
            return {"__type__": "Path", "path": str(obj)}
        if hasattr(obj, "__dataclass_fields__"):
            return {"__type__": type(obj).__name__, **asdict(obj)}
        return super().default(obj)


# Custom decoder hook: reconstructs Python objects from JSON
def enhanced_decoder(dct: dict) -> Any:
    """Object hook that reconstructs custom types from JSON."""
    type_tag = dct.get("__type__")
    if type_tag == "datetime":
        return date.fromisoformat(dct["iso"])
    if type_tag == "set":
        return set(dct["items"])
    if type_tag == "Path":
        return Path(dct["path"])
    if type_tag == "Event":
        return Event(
            title=dct["title"],
            date=dct["date"],
            attendees=dct["attendees"],
            location=dct.get("location"),
        )
    return dct


# Create an Event with non-JSON-native types
event = Event(
    title="Python Meetup",
    date=date(2024, 6, 15),
    attendees={"Alice", "Bob", "Charlie"},
    location=Path("/home/user/events"),
)

# Encode
encoded: str = json.dumps(event, cls=EnhancedEncoder, indent=2)
print(f"Encoded Event:\n{encoded}")

# Decode
decoded: Event = json.loads(encoded, object_hook=enhanced_decoder)
print(f"\nDecoded type: {type(decoded).__name__}")
print(f"Decoded:      {decoded}")
print(f"Date type:    {type(decoded.date).__name__}")
print(f"Attendees:    {type(decoded.attendees).__name__} = {decoded.attendees}")
print(f"Location:     {type(decoded.location).__name__} = {decoded.location}")

# Alternative: using the default= parameter (simpler for one-off use)
simple_data = {"timestamp": datetime.now(), "values": {1, 2, 3}}

def quick_serializer(obj: Any) -> Any:
    if isinstance(obj, datetime):
        return obj.isoformat()
    if isinstance(obj, set):
        return list(obj)
    raise TypeError(f"Not serializable: {type(obj)}")

quick_json = json.dumps(simple_data, default=quick_serializer)
print(f"\nQuick serialize: {quick_json}")

## Section 3: CSV - Tabular Data

CSV (Comma-Separated Values) is ubiquitous for tabular data exchange. Python's
`csv` module handles quoting, escaping, and dialect differences.

- `csv.reader` / `csv.writer` - work with lists of values
- `csv.DictReader` / `csv.DictWriter` - work with dictionaries (named columns)

In [None]:
import csv
from io import StringIO

# --- csv.writer / csv.reader ---
csv_file = work_dir / "products.csv"

products: list[list[str | float]] = [
    ["id", "name", "price", "category"],
    [1, "Widget", 9.99, "Hardware"],
    [2, "Gadget", 24.99, "Electronics"],
    [3, 'Cable, 6ft "USB-C"', 12.50, "Accessories"],  # Commas and quotes in data
    [4, "Battery Pack", 39.99, "Electronics"],
]

with open(csv_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerows(products)

# Read back with csv.reader
with open(csv_file, "r", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    header: list[str] = next(reader)
    rows: list[list[str]] = list(reader)

print(f"Header: {header}")
for row in rows:
    print(f"  {row}")

# Note: csv.reader returns all values as strings
print(f"\nPrice type from csv.reader: {type(rows[0][2]).__name__} (always str)")

# --- csv.DictWriter / csv.DictReader (named columns) ---
employees_file = work_dir / "employees.csv"

employees: list[dict[str, str | int]] = [
    {"name": "Alice", "department": "Engineering", "salary": 95000},
    {"name": "Bob", "department": "Marketing", "salary": 72000},
    {"name": "Charlie", "department": "Engineering", "salary": 88000},
    {"name": "Diana", "department": "Sales", "salary": 68000},
]

fieldnames: list[str] = ["name", "department", "salary"]

with open(employees_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(employees)

# Read back with DictReader
print("\nDictReader output:")
with open(employees_file, "r", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    print(f"  Fieldnames: {reader.fieldnames}")
    for row in reader:
        # row is an OrderedDict (or dict in 3.7+)
        print(f"  {row['name']:10s} {row['department']:15s} ${int(row['salary']):,}")

# CSV with different dialects
tsv_data = "name\tage\tcity\nAlice\t30\tNew York\nBob\t25\tLondon\n"
reader = csv.reader(StringIO(tsv_data), delimiter="\t")
print("\nTSV (tab-separated):")
for row in reader:
    print(f"  {row}")

## Section 4: Pickle - Python-Native Serialization

The `pickle` module serializes arbitrary Python objects to a binary format. It
preserves object types, references, and structure. However, it comes with
important caveats:

**Security Warning**: Never unpickle data from untrusted sources. Pickle can execute
arbitrary code during deserialization, making it a potential attack vector.

In [None]:
import pickle
from dataclasses import dataclass
from datetime import datetime


@dataclass
class UserSession:
    """A user session with complex nested data."""
    user_id: int
    username: str
    login_time: datetime
    permissions: set[str]
    preferences: dict[str, Any]


session = UserSession(
    user_id=42,
    username="alice",
    login_time=datetime(2024, 1, 15, 10, 30, 0),
    permissions={"read", "write", "admin"},
    preferences={
        "theme": "dark",
        "font_size": 14,
        "recent_files": [Path("/home/alice/doc.txt"), Path("/tmp/data.csv")],
    },
)

# Serialize to bytes
pickled: bytes = pickle.dumps(session)
print(f"Pickled size: {len(pickled)} bytes")
print(f"Pickled type: {type(pickled).__name__}")
print(f"First 50 bytes: {pickled[:50]!r}")

# Deserialize from bytes
restored_session: UserSession = pickle.loads(pickled)
print(f"\nRestored type: {type(restored_session).__name__}")
print(f"Restored: {restored_session}")
print(f"Permissions type: {type(restored_session.permissions).__name__}")
print(f"Login time type: {type(restored_session.login_time).__name__}")

# Serialize to file
pickle_file = work_dir / "session.pkl"
with open(pickle_file, "wb") as f:
    pickle.dump(session, f, protocol=pickle.HIGHEST_PROTOCOL)

with open(pickle_file, "rb") as f:
    from_file: UserSession = pickle.load(f)

print(f"\nFile round-trip: {session == from_file}")
print(f"Pickle protocol: {pickle.HIGHEST_PROTOCOL}")

# Pickle preserves object identity and circular references
a_list: list[Any] = [1, 2, 3]
circular: dict[str, Any] = {"data": a_list, "same_data": a_list}
circular["self"] = circular  # Circular reference!

pickled_circular = pickle.dumps(circular)
restored_circular = pickle.loads(pickled_circular)
print(f"\nCircular reference preserved: {restored_circular['self'] is restored_circular}")
print(f"Shared reference preserved: {restored_circular['data'] is restored_circular['same_data']}")

# SECURITY WARNING: demonstrate why pickle is dangerous
print("\n" + "=" * 50)
print("SECURITY WARNING:")
print("  pickle.loads() can execute arbitrary code.")
print("  NEVER unpickle data from untrusted sources.")
print("  Use JSON for data exchange with external systems.")
print("  Pickle is safe ONLY for data you created yourself.")
print("=" * 50)

## Section 5: The `struct` Module - Binary Data Packing

The `struct` module converts between Python values and C-style binary data.
This is essential for reading/writing binary file formats, network protocols,
and hardware interfaces.

Common format characters:

| Format | C Type | Python Type | Size |
|--------|--------|-------------|------|
| `b`/`B` | signed/unsigned char | int | 1 |
| `h`/`H` | short/unsigned short | int | 2 |
| `i`/`I` | int/unsigned int | int | 4 |
| `f` | float | float | 4 |
| `d` | double | float | 8 |
| `s` | char[] | bytes | n |

Byte order prefixes: `<` little-endian, `>` big-endian, `!` network (big), `=` native

In [None]:
import struct

# Basic packing and unpacking
# Format: int (4 bytes) + float (4 bytes) + bool (1 byte)
fmt: str = "<if?"  # Little-endian: int, float, bool
packed: bytes = struct.pack(fmt, 42, 3.14, True)
print(f"Format: {fmt!r}")
print(f"Packed: {packed!r} ({len(packed)} bytes)")
print(f"Calc size: {struct.calcsize(fmt)} bytes")

unpacked: tuple = struct.unpack(fmt, packed)
print(f"Unpacked: {unpacked}")

# Practical example: binary record format
# Sensor data: timestamp (uint32), sensor_id (uint16), value (float32), status (uint8)
SENSOR_FMT: str = "!IHfB"  # Network byte order (big-endian)
SENSOR_SIZE: int = struct.calcsize(SENSOR_FMT)

print(f"\nSensor record format: {SENSOR_FMT!r} ({SENSOR_SIZE} bytes per record)")

# Write sensor data to binary file
sensor_records: list[tuple[int, int, float, int]] = [
    (1705300000, 1, 23.5, 0),   # timestamp, sensor_id, value, status
    (1705300060, 1, 23.7, 0),
    (1705300120, 2, 45.2, 1),   # status=1 means warning
    (1705300180, 1, 23.4, 0),
    (1705300240, 2, 98.6, 2),   # status=2 means error
]

sensor_file = work_dir / "sensors.bin"
with open(sensor_file, "wb") as f:
    for record in sensor_records:
        f.write(struct.pack(SENSOR_FMT, *record))

# Read sensor data back
print(f"\nReading {sensor_file.stat().st_size} bytes ({sensor_file.stat().st_size // SENSOR_SIZE} records):")
status_names = {0: "OK", 1: "WARN", 2: "ERROR"}

with open(sensor_file, "rb") as f:
    while True:
        chunk: bytes = f.read(SENSOR_SIZE)
        if not chunk:
            break
        ts, sid, val, status = struct.unpack(SENSOR_FMT, chunk)
        dt = datetime.fromtimestamp(ts)
        print(f"  [{dt:%H:%M:%S}] sensor={sid} value={val:6.1f} status={status_names[status]}")

# Struct objects for repeated operations (faster than calling struct.pack/unpack)
sensor_struct = struct.Struct(SENSOR_FMT)
print(f"\nStruct object size: {sensor_struct.size} bytes")
print(f"Struct format: {sensor_struct.format!r}")

packed_one = sensor_struct.pack(1705300300, 3, 72.1, 0)
print(f"Packed with Struct object: {packed_one.hex(' ')}")

## Section 6: Practical Pattern - Multi-Format Configuration Loader

A real-world utility that loads configuration from JSON, CSV, or INI files,
auto-detecting the format from the file extension. This pattern demonstrates
how to combine multiple serialization modules into a cohesive interface.

In [None]:
import configparser
from dataclasses import dataclass, field


@dataclass
class AppConfig:
    """Application configuration loaded from any supported format."""
    host: str = "localhost"
    port: int = 8080
    debug: bool = False
    database_url: str = "sqlite:///app.db"
    allowed_origins: list[str] = field(default_factory=lambda: ["*"])
    max_connections: int = 100


class ConfigLoader:
    """Load configuration from JSON, CSV, or INI files."""

    @staticmethod
    def load(path: Path) -> AppConfig:
        """Load config from file, auto-detecting format by extension."""
        suffix = path.suffix.lower()
        loaders: dict[str, Any] = {
            ".json": ConfigLoader._load_json,
            ".csv": ConfigLoader._load_csv,
            ".ini": ConfigLoader._load_ini,
        }
        loader = loaders.get(suffix)
        if loader is None:
            raise ValueError(f"Unsupported config format: {suffix}")
        return loader(path)

    @staticmethod
    def save_json(config: AppConfig, path: Path) -> None:
        """Save configuration as JSON."""
        data = {
            "host": config.host,
            "port": config.port,
            "debug": config.debug,
            "database_url": config.database_url,
            "allowed_origins": config.allowed_origins,
            "max_connections": config.max_connections,
        }
        with open(path, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2)

    @staticmethod
    def _load_json(path: Path) -> AppConfig:
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)
        return AppConfig(**data)

    @staticmethod
    def _load_csv(path: Path) -> AppConfig:
        """Load config from a two-column CSV (key, value)."""
        config_dict: dict[str, Any] = {}
        with open(path, "r", newline="", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for row in reader:
                key, value = row["key"], row["value"]
                # Parse types based on field annotations
                if key in ("port", "max_connections"):
                    config_dict[key] = int(value)
                elif key == "debug":
                    config_dict[key] = value.lower() in ("true", "1", "yes")
                elif key == "allowed_origins":
                    config_dict[key] = json.loads(value)
                else:
                    config_dict[key] = value
        return AppConfig(**config_dict)

    @staticmethod
    def _load_ini(path: Path) -> AppConfig:
        parser = configparser.ConfigParser()
        parser.read(path, encoding="utf-8")
        section = parser["app"] if "app" in parser else parser[parser.sections()[0]]
        return AppConfig(
            host=section.get("host", "localhost"),
            port=section.getint("port", 8080),
            debug=section.getboolean("debug", False),
            database_url=section.get("database_url", "sqlite:///app.db"),
            allowed_origins=json.loads(section.get("allowed_origins", '["*"]')),
            max_connections=section.getint("max_connections", 100),
        )


# Create configuration files in all three formats
config = AppConfig(
    host="0.0.0.0",
    port=9090,
    debug=True,
    database_url="postgresql://localhost/mydb",
    allowed_origins=["https://example.com", "https://api.example.com"],
    max_connections=50,
)

# Save as JSON
json_path = work_dir / "config.json"
ConfigLoader.save_json(config, json_path)

# Create CSV config
csv_path = work_dir / "config.csv"
with open(csv_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["key", "value"])
    writer.writeheader()
    writer.writerow({"key": "host", "value": "0.0.0.0"})
    writer.writerow({"key": "port", "value": "9090"})
    writer.writerow({"key": "debug", "value": "true"})
    writer.writerow({"key": "database_url", "value": "postgresql://localhost/mydb"})
    writer.writerow({"key": "allowed_origins",
                     "value": json.dumps(config.allowed_origins)})
    writer.writerow({"key": "max_connections", "value": "50"})

# Create INI config
ini_path = work_dir / "config.ini"
ini_parser = configparser.ConfigParser()
ini_parser["app"] = {
    "host": "0.0.0.0",
    "port": "9090",
    "debug": "true",
    "database_url": "postgresql://localhost/mydb",
    "allowed_origins": json.dumps(config.allowed_origins),
    "max_connections": "50",
}
with open(ini_path, "w", encoding="utf-8") as f:
    ini_parser.write(f)

# Load from each format and verify they produce the same result
print("Loading configuration from multiple formats:\n")
for path in [json_path, csv_path, ini_path]:
    loaded = ConfigLoader.load(path)
    print(f"  {path.suffix:5s}: host={loaded.host}, port={loaded.port}, "
          f"debug={loaded.debug}, origins={len(loaded.allowed_origins)}")

# Verify equivalence
from_json = ConfigLoader.load(json_path)
from_csv = ConfigLoader.load(csv_path)
from_ini = ConfigLoader.load(ini_path)
print(f"\nAll formats equivalent: {from_json == from_csv == from_ini}")

## Section 7: Comparing Serialization Formats

Each format has trade-offs. Choosing the right one depends on your requirements.

In [None]:
import time

# Compare serialization size and speed for the same data
test_data: list[dict[str, Any]] = [
    {
        "id": i,
        "name": f"item_{i:04d}",
        "value": i * 1.5,
        "active": i % 3 != 0,
        "tags": ["alpha", "beta"] if i % 2 == 0 else ["gamma"],
    }
    for i in range(500)
]

# JSON
t0 = time.perf_counter()
json_bytes: bytes = json.dumps(test_data).encode("utf-8")
json_time = time.perf_counter() - t0

# Pickle
t0 = time.perf_counter()
pickle_bytes: bytes = pickle.dumps(test_data, protocol=pickle.HIGHEST_PROTOCOL)
pickle_time = time.perf_counter() - t0

# CSV (flat structure only)
t0 = time.perf_counter()
csv_buffer = StringIO()
writer = csv.DictWriter(csv_buffer, fieldnames=["id", "name", "value", "active"])
writer.writeheader()
for item in test_data:
    writer.writerow({k: item[k] for k in ["id", "name", "value", "active"]})
csv_bytes: bytes = csv_buffer.getvalue().encode("utf-8")
csv_time = time.perf_counter() - t0

print(f"Serialization comparison (500 records):")
print(f"{'Format':<10s} {'Size (bytes)':>14s} {'Time (ms)':>12s} {'Human-readable':>16s} {'Cross-language':>16s}")
print(f"{'-'*10} {'-'*14} {'-'*12} {'-'*16} {'-'*16}")
print(f"{'JSON':<10s} {len(json_bytes):>14,} {json_time*1000:>11.2f}ms {'Yes':>16s} {'Yes':>16s}")
print(f"{'CSV':<10s} {len(csv_bytes):>14,} {csv_time*1000:>11.2f}ms {'Yes':>16s} {'Yes':>16s}")
print(f"{'Pickle':<10s} {len(pickle_bytes):>14,} {pickle_time*1000:>11.2f}ms {'No':>16s} {'No':>16s}")

print(f"\nKey trade-offs:")
print(f"  JSON:   Universal, human-readable, but limited to basic types")
print(f"  CSV:    Great for flat tabular data, widely supported by spreadsheets")
print(f"  Pickle: Handles any Python object, but Python-only and insecure")
print(f"  Struct: Compact binary, ideal for fixed-format records and protocols")

# Cleanup
import shutil
shutil.rmtree(work_dir)
print(f"\nCleaned up {work_dir}")

## Summary

### JSON (`json` module)
- `dumps()`/`loads()` for strings, `dump()`/`load()` for files
- Custom `JSONEncoder` and `object_hook` for non-standard types
- Best for: APIs, config files, cross-language data exchange

### CSV (`csv` module)
- `reader`/`writer` for list-based rows
- `DictReader`/`DictWriter` for named columns
- Always use `newline=''` when opening CSV files
- Best for: tabular data, spreadsheet interop, data science pipelines

### Pickle (`pickle` module)
- Serializes any Python object (including circular references)
- **Never unpickle untrusted data** - arbitrary code execution risk
- Use `protocol=pickle.HIGHEST_PROTOCOL` for best performance
- Best for: caching, internal Python-to-Python data transfer

### Struct (`struct` module)
- Pack/unpack C-style binary data with format strings
- Use `Struct` objects for repeated operations
- Best for: binary file formats, network protocols, hardware interfaces

### Choosing a Format
1. **Cross-language exchange** -> JSON
2. **Tabular data / spreadsheets** -> CSV
3. **Complex Python objects (trusted source)** -> Pickle
4. **Fixed binary format / protocols** -> Struct
5. **Human-editable configuration** -> JSON or INI