**Pattern 1** — Hash Join (Dictionary Index)

**Problem:** Enrich fact rows with dimension attributes (e.g., add country_name from a dim table).

**Task:** For each fact row, look up matching dim row by key in O(1) average time.

In [0]:
# fact: web events
fact = [
    {"user_id": 1, "event": "click"},
    {"user_id": 2, "event": "purchase"},
    {"user_id": 3, "event": "click"},
]

# dim: user attributes
dim = [
    {"user_id": 1, "country": "US"},
    {"user_id": 2, "country": "CA"},
]

# 1) Build dimension index (hash table)
dim_index = {row["user_id"]: row for row in dim}

# 2) Hash join: inner join fact → dim on user_id
enriched = []
for f in fact:
    if f["user_id"] in dim_index:         # inner join
        d = dim_index[f["user_id"]]
        enriched.append({**f, **d})

# enriched:
# [
#   {'user_id': 1, 'event': 'click',    'country': 'US'},
#   {'user_id': 2, 'event': 'purchase', 'country': 'CA'}
# ]


**Variants:**

**Left join:** always append f, and optionally merge d if exists.

**Semi join:** keep only f if key exists.

**Anti join:** keep f only if key not in dim_index.

**Why this is a pattern:** Dictionary gives O(1) lookup vs O(n) scan of dim for each fact row.

**Time & Space Complexity**

Build index: O(m) where m = len(dim).

Join loop: O(n) where n = len(fact).

Total: O(m + n).

Space: O(m) for dim_index (plus O(n) if you store enriched).

**Pattern 2**— Group By with defaultdict(list)

**Problem:** Group records by a key (user, date, category, etc.).

**Task:** Build key -> [rows] mapping in one pass.

In [0]:
from collections import defaultdict

logs = [
    {"user": "alice", "date": "2025-11-20", "page": "/home"},
    {"user": "alice", "date": "2025-11-20", "page": "/products"},
    {"user": "bob",   "date": "2025-11-20", "page": "/home"},
]

# Group logs by user
groups = defaultdict(list)

for row in logs:
    groups[row["user"]].append(row)

# groups["alice"] → 2 rows
# groups["bob"]   → 1 row


**You can group by any combination:** (user, date), status_code, country, etc.

**Why this is a pattern:** This is the in-memory version of GROUP BY in SQL, used everywhere in ETL/map-reduce style problems.

**Time & Space Complexity**

One pass over logs: O(n).

Each append: O(1) average.

Total: O(n).

Space: O(n) to store all rows in buckets (plus O(k) keys, k = distinct groups).

**Pattern 3** — Aggregation (Counts / Totals)

**Problem:** Count how many rows per category/status, or sum metrics per key.

**Task:** Build key -> aggregate in one pass.

**Example:** Count events per event_type

In [0]:
rows = [
    {"event_type": "click"},
    {"event_type": "click"},
    {"event_type": "purchase"},
]

counts = {}

for row in rows:
    et = row["event_type"]
    counts[et] = counts.get(et, 0) + 1

# counts → {'click': 2, 'purchase': 1}

In [0]:
from collections import Counter

counts = Counter(row["event_type"] for row in rows)


In [0]:
sales = [
    {"date": "2025-11-20", "amount": 100},
    {"date": "2025-11-20", "amount": 50},
    {"date": "2025-11-21", "amount": 200},
]

totals = {}

for s in sales:
    d = s["date"]
    totals[d] = totals.get(d, 0) + s["amount"]

# totals → {'2025-11-20': 150, '2025-11-21': 200}


**Time & Space Complexity**

Single pass: O(n).

Each update: O(1) average.

Total: O(n).

Space: O(k) for k distinct keys (dates, categories, etc.).

**Pattern 4** — Nested Dictionaries (Hierarchical Grouping)

**Problem:** Group records by multiple levels (e.g., year → month → day).

**Task:** Build nested dicts: year -> month -> [rows].

In [0]:
rows = [
    {"year": 2025, "month": 11, "day": 20, "views": 100},
    {"year": 2025, "month": 11, "day": 21, "views": 200},
    {"year": 2025, "month": 12, "day":  1, "views": 150},
]

stats = {}

for r in rows:
    y = r["year"]
    m = r["month"]
    stats.setdefault(y, {}).setdefault(m, []).append(r)

# Example:
# stats[2025][11] → list of rows for Nov 2025
# stats[2025][12] → list of rows for Dec 2025


You can extend deeper: country → state → city, source → table → partition, etc.

**Why this is a pattern:** Mirrors typical DWH dimensions (date hierarchy, geography, org hierarchy) for quick drill-downs without scanning entire list.

**Time & Space Complexity**

Loop over rows: O(n).

Each setdefault and append: O(1) average.

Total: O(n).

Space: O(n) for stored rows + O(h * k) for nested keys (h levels, k distinct combos).

**Pattern 5** — Frequency Tables (Value Counts)

**Problem:** Compute frequency of values (e.g., error codes, countries, SKUs) to detect heavy hitters.

**Task:** Build value -> frequency table.

In [0]:
items = ["500", "200", "200", "404", "500", "500"]

freq = {}
for code in items:
    freq[code] = freq.get(code, 0) + 1

# freq → {'500': 3, '200': 2, '404': 1}

# Top error codes:
top_errors = sorted(freq.items(), key=lambda x: x[1], reverse=True)
# [('500', 3), ('200', 2), ('404', 1)]


Equivalent to Pattern 3 but often applied to 1D sequences (single column or single field).

**Time & Space Complexity**

Build table: O(n).

Optional sorting by frequency: O(k log k) where k = distinct values.

Space: O(k).

**Pattern 6** — Update Mapping (“Dim Enrich” / Decode Codes)

**Problem:** Replace codes with human-readable labels (e.g., country code → country name) across records.

**Task:** In-place enrichment via mapping dict.

In [0]:
facts = [
    {"user": "alice", "country_code": "US"},
    {"user": "bob",   "country_code": "CA"},
    {"user": "chen",  "country_code": "CN"},
]

country_map = {
    "US": "United States",
    "CA": "Canada",
    "CN": "China",
}

for r in facts:
    code = r["country_code"]
    r["country"] = country_map.get(code, "UNKNOWN")

# facts →
# [
#   {'user': 'alice', 'country_code': 'US', 'country': 'United States'},
#   {'user': 'bob',   'country_code': 'CA', 'country': 'Canada'},
#   {'user': 'chen',  'country_code': 'CN', 'country': 'China'},
# ]


Variant: Overwrite code field directly: r["country_code"] = country_map.get(code, code).

**Time & Space Complexity**

Loop over n rows: O(n).

Each dict lookup: O(1) average.

Total: O(n).

Space: O(m) for mapping (m codes) + O(1) extra; in-place update.

**Pattern 7** — Detect Drift / Mismatched Schema

**Problem:** Detect unexpected columns appearing in data (schema drift).

**Task:** Compare actual row keys vs expected schema.

In [0]:
expected_schema = {"user_id", "event_time", "page", "country"}

row = {
    "user_id": 1,
    "event_time": "2025-11-20T10:00:00",
    "page": "/home",
    "device": "mobile",   # unexpected
}

unexpected = set(row.keys()) - expected_schema

if unexpected:
    print("Schema drift detected:", unexpected)
    # {'device'}


In [0]:
## Batch version: Track unexpected keys across many rows.

unexpected_all = set()

for r in rows:
    unexpected_all |= (set(r.keys()) - expected_schema)

# unexpected_all now contains all extra fields seen


**Time & Space Complexity**

For a single row with f fields:

set(row.keys()): O(f).

Set difference: O(f).

For n rows: O(n * f) worst-case.

Space: O(f) per row; O(u) for union of unexpected keys (u distinct extra fields).

**Pattern 8** — Maintain Running Stats (Per Key)

**Problem:** Maintain per-key aggregates (count, sum, avg) in a streaming / single-pass fashion.

**Task:** Build value -> {count, sum} and update on the fly.

In [0]:
values = [10, 20, 10, 30, 20, 10]

stats = {}

for x in values:
    if x not in stats:
        stats[x] = {"count": 0, "sum": 0}
    stats[x]["count"] += 1
    stats[x]["sum"]   += x

# Compute average per value
for val, s in stats.items():
    s["avg"] = s["sum"] / s["count"]

# Example:
# stats[10] → {'count': 3, 'sum': 30, 'avg': 10.0}


In [0]:
## Per-key metrics: Can extend each inner dict to also store min, max, last_seen_ts, etc.

for x in values:
    if x not in stats:
        stats[x] = {"count": 0, "sum": 0, "min": x, "max": x}
    s = stats[x]
    s["count"] += 1
    s["sum"]   += x
    s["min"]   = min(s["min"], x)
    s["max"]   = max(s["max"], x)

**Time & Space Complexity**

Loop over n values: O(n).

Each dict lookup and update: O(1) average.

Total: O(n).

Space: O(k) for k distinct keys (each with small fixed-size stat dict).

In [0]:
## Problem 1 — Implement Hash Join

def hash_join(left, right, key):
    index = {r[key]: r for r in right}
    out = []
    for row in left:
        k = row[key]
        if k in index:
            out.append((row, index[k]))
    return out

In [0]:
## Problem 2 — Group Orders by Customer

from collections import defaultdict

def group_by_customer(rows):
    grp = defaultdict(list)
    for r in rows:
        grp[r['customer']].append(r)
    return grp

In [0]:
## Problem 3 — Count Events per Category

cnt = {}
for r in rows:
    cat = r['category']
    cnt[cat] = cnt.get(cat, 0) + 1

In [0]:
## Problem 4 — Convert list of dicts to dict of lists

out = {}
for row in rows:
    for k, v in row.items():
        out.setdefault(k, []).append(v)

In [0]:
## Problem 5 — Find Missing Schema Columns

missing = expected_schema - row.keys()

In [0]:
## Problem 6 — Reverse Lookup Table

reverse = {}
for k, v in mapping.items():
    reverse.setdefault(v, []).append(k)

In [0]:
## Problem 7 — Build Adjacency List

graph = defaultdict(list)

for u, v in edges:
    graph[u].append(v)

In [0]:
## Problem 8 — Sort Employees by Salary then Name

sorted(rows, key=lambda r: (r['salary'], r['name']))

In [0]:
## Problem 9 — Detect Duplicate Primary Keys

seen = {}
dups = []

for row in rows:
    pk = row['id']
    if pk in seen:
        dups.append((seen[pk], row))
    seen[pk] = row

In [0]:
## Problem 10 — Compute Sums per Category

totals = {}
for r in rows:
    totals[r['cat']] = totals.get(r['cat'], 0) + r['value']

In [0]:
## Problem 11 — Join on Composite Key

idx = {(r['id'], r['date']): r for r in dim}

In [0]:
## Problem 12 — Detect CDC Changes

def changed(r1, r2):
    return r1 != r2

In [0]:
## Problem 13 — Bucket logs by hour

from collections import defaultdict
buckets = defaultdict(list)

for r in logs:
    hour = r['ts'][:13]  # yyyy-mm-dd hh
    buckets[hour].append(r)

In [0]:
## Problem 14 — Enrich rows with country name

for r in fact:
    r['country_name'] = country_dim[r['country_code']]

In [0]:
## Problem 15 — Collapse nested dicts

def flatten(d, parent=''):
    out = {}
    for k, v in d.items():
        key = parent + '.' + k if parent else k
        if isinstance(v, dict):
            out.update(flatten(v, key))
        else:
            out[key] = v
    return out

**Summary**

Dicts = hash tables → O(1) lookup

Best tool for handling DE join/group/agg logic

Use defaultdict for grouping

Use composite keys for joins

Use .get() to simplify counters

Keys must be immutable

Dicts preserve insertion order

Avoid unnecessary copies