**Pattern A** — Composite-key dict aggregation

**Problem:** Aggregate spend per customer per day from transaction records.
**Task:** Produce {(customer_id, date): total_spend}.

In [0]:
from collections import defaultdict

def agg_spend_per_customer_per_day(transactions):
    # transactions: iterable of dicts with keys: customer_id, ts (ISO), amount
    out = defaultdict(float)
    for t in transactions:
        date = t['ts'][:10]                 # "YYYY-MM-DD" (fast string slice)
        key = (t['customer_id'], date)      # composite key
        out[key] += float(t['amount'])
    return dict(out)

**Idea:** Use a composite tuple key in a dict to aggregate quickly without nested structures. Works well streaming row-by-row.

**Time & Space Complexity**
Time: one pass over rows → O(n).
Space: one entry per (customer, day) → O(k) where k ≤ n.

**Pattern B** — Hash join pattern

**Problem:** Join transaction logs with user dimension on user_id.
Task: Produce joined rows (transaction + user fields).

In [0]:
def hash_join_transactions_with_users(transactions, users):
    # users: iterable of dicts with 'user_id' and user attributes
    user_index = {u['user_id']: u for u in users}   # build hash index
    out = []
    for tx in transactions:
        u = user_index.get(tx['user_id'])
        if u:
            joined = tx.copy()
            joined.update({k: v for k, v in u.items() if k != 'user_id'})
            out.append(joined)
    return out

**Idea:** Index the (usually smaller) dimension table, then probe for each fact row — O(1) expected probe.

**Time & Space Complexity**
Time: build index O(m) + probe O(n) → O(n + m).
Space: index size O(m).

**Pattern C** — Anti-join (detect missing dimension records)

**Problem:** Detect transaction records whose user_id is missing in the user dimension.
Task: Return list of transactions with no matching user.

In [0]:
def detect_missing_users(transactions, users):
    user_ids = {u['user_id'] for u in users}
    missing = [tx for tx in transactions if tx['user_id'] not in user_ids]
    return missing

Id**ea**: Build a set of known keys and filter facts that are not in that set (anti-join).

**Time & Space Complexity**
Time: O(m + n) to build set and scan transactions.
Space: O(m) for the set.

**Pattern D** — Category distribution drift (per-key counts)

**Problem:** Detect categories whose proportion changed beyond a threshold between two windows.
Task: Return categories with absolute percentage-point drift > threshold.

In [0]:
from collections import Counter

def detect_category_drift(rows_window_a, rows_window_b, category_key='category', threshold_pp=0.05):
    a_counts = Counter(r[category_key] for r in rows_window_a)
    b_counts = Counter(r[category_key] for r in rows_window_b)
    a_total = sum(a_counts.values()) or 1
    b_total = sum(b_counts.values()) or 1

    drift = {}
    all_keys = set(a_counts) | set(b_counts)
    for k in all_keys:
        pa = a_counts.get(k, 0) / a_total
        pb = b_counts.get(k, 0) / b_total
        if abs(pa - pb) > threshold_pp:
            drift[k] = {'p_a': pa, 'p_b': pb, 'delta': pb - pa}
    return drift

**Idea:** Use Counters to get per-key frequencies and compare normalized proportions between windows.

**Time & Space Complexity**
Time: O(n + m) to count both windows.
Space: O(k) where k is number of unique categories.

**Pattern E** — Build metric tables from event logs

**Problem:** From event logs compute per event_type: count and average latency.
Task: Output list of metric rows {event_type, count, avg_latency}.

In [0]:
from collections import defaultdict

def build_metrics_from_events(events):
    stats = defaultdict(lambda: {'count': 0, 'sum_latency': 0.0})
    for e in events:
        et = e['event_type']
        stats[et]['count'] += 1
        stats[et]['sum_latency'] += float(e.get('latency_ms', 0.0))
    # finalize
    metrics = []
    for et, s in stats.items():
        metrics.append({
            'event_type': et,
            'count': s['count'],
            'avg_latency_ms': s['sum_latency'] / s['count'] if s['count'] else 0.0
        })
    return metrics

**Idea:** Single-pass aggregation accumulating sum and count, compute averages at the end — efficient for streaming.

**Time & Space Complexity**
Time: O(n).
Space: O(k) for one entry per event_type.

**Pattern F** — Summarize logs by hour (timestamp bucketing)

**Problem:** Summarize number of events per hour bucket.
Task: Return {hour_bucket: count} where hour_bucket = "YYYY-MM-DDTHH:00:00".

In [0]:
from collections import defaultdict

def summarize_by_hour(rows, ts_field='ts'):
    out = defaultdict(int)
    for r in rows:
        # assume ISO ts like "2025-11-19T10:01:00"
        hour_bucket = r[ts_field][:13] + ":00:00"   # "YYYY-MM-DDTHH:00:00"
        out[hour_bucket] += 1
    return dict(out)

**Idea:** Bucket by slicing ISO timestamp to hour — cheap and fast without parsing when format is stable.

**Time & Space Complexity**
Time: O(n).
Space: O(h) where h = number of hour buckets (≤ n).

**Pattern G** — Extract unique users per hour

**Problem:** Count unique users seen in each hour.
Task: Return {hour_bucket: unique_user_count}.

In [0]:
from collections import defaultdict

def unique_users_per_hour(rows, user_field='user_id', ts_field='ts'):
    sets = defaultdict(set)
    for r in rows:
        hour = r[ts_field][:13] + ":00:00"
        sets[hour].add(r[user_field])
    return {hour: len(s) for hour, s in sets.items()}

**Idea:** Maintain a set per bucket to deduplicate users, then take lengths for unique counts.

**Time & Space Complexity**
Time: O(n * avg_hash_time) ≈ O(n).
Space: O(u) total where u = sum of distinct users across all buckets (could be O(n) in worst case).

**✔ Grouping**

defaultdict(list)

tuple keys for multi-column grouping

**✔ Aggregation**

counters

sum / count / avg

per-key accumulate patterns

**✔ Joins**

hash joins

left join

semi join

anti join

composite key join

JSON-enriched join

**✔ Dedup**

by ID

by row

using sets

**✔ Common DE Scenarios**

building metric tables

log summarization

DQ checks

schema drift

fact-to-dimension enrichment