**Pattern 1** — Represent a data pipeline record

**Problem:** Clean and filter raw batch records from a CSV/stream.
Each row is a tuple (id, amount, ts) and you want readable, type-safe code instead of indexing (row[0], row[1], …).

**Task:** Convert raw rows into Record objects and compute the total amount for records above a threshold.

In [0]:
from dataclasses import dataclass
from typing import Iterable, List

@dataclass
class Record:
    id: int
    amount: float
    ts: str

def load_records(raw_rows: Iterable[tuple]) -> List[Record]:
    # Normalize raw tuples into structured Records
    return [Record(*row) for row in raw_rows]

def total_high_value(records: Iterable[Record], min_amount: float) -> float:
    # Sum only high-value records
    return sum(r.amount for r in records if r.amount >= min_amount)

# Example usage
raw_rows = [
    (1, 100.0, "2025-11-19T10:01:00"),
    (2, 15.5,  "2025-11-19T10:02:00"),
    (3, 250.0, "2025-11-19T10:05:00"),
]

records = load_records(raw_rows)
total = total_high_value(records, min_amount=50.0)
print(total)  # 350.0


**Why dataclass here?**

Clear field names: r.id, r.amount, r.ts instead of indices.

Easy to extend schema: just add fields to Record.

Still lightweight, similar to tuples in performance for simple use.

**Time & Space Complexity**

load_records:

Time: You create one Record per raw row → O(n).

Space: You store n Record instances → O(n).

total_high_value:

Time: Single pass over records → O(n).

Space: Uses a generator inside sum, no extra list → O(1) extra.

**Pattern 2** — Hashable CDC records

**Problem:** You receive CDC events from two systems (source vs target) and want to:

Deduplicate CDC events.

Find records that are in source but missing in target.

You want to treat each record as a set element / dict key, which requires it to be hashable and immutable.

**Task:** Model CDC rows as frozen dataclasses and detect drift between two CDC batches.

In [0]:
from dataclasses import dataclass
from typing import Iterable, Set

@dataclass(frozen=True)
class CDCRecord:
    id: int
    amount: float
    status: str   # e.g. "INSERT", "UPDATE", "DELETE"

def dedupe_cdc(events: Iterable[CDCRecord]) -> Set[CDCRecord]:
    """
    Remove duplicate CDC events (same id, amount, status).
    """
    return set(events)   # duplicates automatically removed

def find_missing_in_target(
    src_events: Iterable[CDCRecord],
    tgt_events: Iterable[CDCRecord],
) -> Set[CDCRecord]:
    """
    CDC records present in source but not in target (set difference).
    """
    src_set = set(src_events)
    tgt_set = set(tgt_events)
    return src_set - tgt_set

# Example usage
src = [
    CDCRecord(1, 100.0, "INSERT"),
    CDCRecord(1, 100.0, "INSERT"),  # duplicate
    CDCRecord(2, 50.0,  "UPDATE"),
]

tgt = [
    CDCRecord(1, 100.0, "INSERT"),
]

unique_src = dedupe_cdc(src)
print(unique_src)
# {CDCRecord(id=1, amount=100.0, status='INSERT'),
#  CDCRecord(id=2, amount=50.0, status='UPDATE')}

missing = find_missing_in_target(src, tgt)
print(missing)
# {CDCRecord(id=2, amount=50.0, status='UPDATE')}

**Why frozen=True?**

Makes instances immutable → prevents accidental mutation in CDC logic.

Makes them hashable → can be used in set and as dict keys.

Perfect fit for “row as a fact” patterns: CDC snapshots, dimension versions, etc.

**Time & Space Complexity**

dedupe_cdc (build a set):

Time: Insert each event into a set → average O(n).

Space: Store up to n unique events → O(n).

find_missing_in_target:

Time:

Build src_set: O(n),

Build tgt_set: O(m),

Set difference src_set - tgt_set: O(n) average.

Total: O(n + m).

Space:

src_set + tgt_set + result set → O(n + m).

**dataclass is used for:**

ETL row models

Config objects

Serializable objects

Clean input data models

**Supports:**

defaults

type hints

immutability (frozen=True)

automatic repr/eq/hash

**dataclass**

Row schemas

Lightweight models

CDC-friendly immutable structures