**Pattern 1** — Group rows by category (defaultdict(list))

**Problem:** Group transactions by category so you can later compute totals per category.

**Task:** Build category → list of rows.

In [0]:
from collections import defaultdict

rows = [
    {"order_id": 1, "category": "electronics", "amount": 120},
    {"order_id": 2, "category": "grocery",     "amount": 40},
    {"order_id": 3, "category": "electronics", "amount": 300},
    {"order_id": 4, "category": "grocery",     "amount": 15},
]

# Group rows by category
from collections import defaultdict

groups = defaultdict(list)
for r in rows:
    groups[r["category"]].append(r)

# Example: all electronics orders
electronics_orders = groups["electronics"]

In [0]:
"""
Idea:
- defaultdict(list) auto-creates an empty list on first access.
- One pass over rows, append each row to its category bucket.
"""


**Why this pattern?**
Classic DE pattern: grouping logs by user, events by type, orders by category/date/bucket before further aggregation.

**Time & Space Complexity**

Time: Single pass over rows → O(n)

Space: Store all rows again inside buckets → O(n) (plus O(k) keys, k = number of categories)

**Pattern 2** — Nested grouping (defaultdict of defaultdict(list))

**Problem:** Group sales by year and month to compute monthly stats per year.

**Task:** Build year → month → list of rows.

In [0]:
from collections import defaultdict

rows = [
    {"order_id": 1, "date": "2025-01-10", "amount": 100},
    {"order_id": 2, "date": "2025-01-15", "amount": 50},
    {"order_id": 3, "date": "2025-02-01", "amount": 200},
    {"order_id": 4, "date": "2024-12-30", "amount": 80},
]

# year -> month -> list[rows]
stats = defaultdict(lambda: defaultdict(list))

for r in rows:
    year, month, _ = r["date"].split("-")  # naive split: "YYYY-MM-DD"
    stats[year][month].append(r)

# Example: all orders in Jan 2025
jan_2025_orders = stats["2025"]["01"]

In [0]:
"""
Idea:
- Use nested defaultdicts:
  stats[year] gives a defaultdict(list)
  stats[year][month] gives a list bucket.
- Clean way to build multi-level groupings (year/month, country/city, etc.).
"""


**Why this pattern?**
Very common in reporting: year→month, country→state, tenant→entity, etc. Keeps hierarchy explicit and avoids manual if key not in dict checks at each level.

**Time & Space Complexity**

Time: One pass over rows → O(n)

Space: Store all rows in nested buckets → O(n), with O(k) keys for all distinct (year, month) combos.

**Pattern 3** — Aggregation with defaultdict(int)

**Problem: **Count how many records exist for each status (e.g., SUCCESS, FAILED, RETRY).

**Task:** Build status → count.

In [0]:
from collections import defaultdict

rows = [
    {"job_id": 1, "status": "SUCCESS"},
    {"job_id": 2, "status": "FAILED"},
    {"job_id": 3, "status": "SUCCESS"},
    {"job_id": 4, "status": "RETRY"},
    {"job_id": 5, "status": "FAILED"},
]

counts = defaultdict(int)

for r in rows:
    counts[r["status"]] += 1

# Example usage:
success_count = counts["SUCCESS"]   # 2
failed_count  = counts["FAILED"]    # 2
retry_count   = counts["RETRY"]     # 1


In [0]:
"""
Idea:
- defaultdict(int) starts each unseen key at 0.
- Each row just increments the counter for its status.
- No need for .get() or "if key in dict" checks.
"""


**Why this pattern?**
Core DE pattern for quick aggregations: count events per status, records per file, errors per rule, users per segment, etc.

**Time & Space Complexity**

Time: One pass over rows → O(n)

Space: One integer per distinct status → O(k), where k = number of unique statuses (k ≤ n).

**Pattern 4** — Graph adjacency lists with defaultdict(list)

**Problem:** Model dependencies between ETL jobs (or tables) as a graph to run topological sort, detect cycles, etc.

**Task:** Build job → list of downstream jobs.

In [0]:
from collections import defaultdict

# Each pair (u, v) means: job u must run before job v
edges = [
    ("raw_ingest", "cleaned"),
    ("cleaned",    "enriched"),
    ("enriched",   "aggregated"),
    ("cleaned",    "quality_checks"),
]

graph = defaultdict(list)

for u, v in edges:
    graph[u].append(v)

# Example: jobs that depend on "cleaned"
downstream_of_cleaned = graph["cleaned"]   # ["enriched", "quality_checks"]

In [0]:
"""
Idea:
- Represent a directed graph as adjacency list:
  graph[u] = [v1, v2, ...] where edges are u -> vi.
- defaultdict(list) avoids manual init for each node.
- Foundation for BFS/DFS, topological sort, cycle detection.
"""


**Why this pattern?**
Data platforms are full of graphs: job DAGs, table lineage, microservice calls, feature dependencies. Adjacency lists are the standard in-memory representation.

**Time & Space Complexity**

Let E = number of edges, V = number of vertices (jobs).

Time to build adjacency list: One pass over edges → O(E)

Space: Need a list entry for each edge plus keys for each vertex → O(V + E)

In [0]:
## Group transactions by (cust, date)

groups = defaultdict(list)
for r in rows:
    groups[(r['cust'], r['date'])].append(r)

In [0]:
## Bucket logs by hour

buckets = defaultdict(list)
for log in logs:
    hour = log['ts'][:13]
    buckets[hour].append(log)

**defaultdict**

Grouping engine

Nested grouping

Auto-init