**1. What This Pattern Solves**

Aggregate numeric metrics by a key during ETL

Compute sums, counts, mins, maxes, averages

Build fact tables from event-level or transaction-level data

Replace inefficient nested loops or repeated scans

**2. SQL Equivalent**

In [0]:
%sql
SELECT key, SUM(value) AS total, COUNT(*) AS cnt
FROM table
GROUP BY key;

**3. Core Idea**

Use a dictionary keyed by group

Store aggregated values incrementally

One pass, no re-scans

**4. Template Code (MEMORIZE THIS)**

In [0]:
from collections import defaultdict

agg = defaultdict(int)

for key, value in data:
    agg[key] += value


In [0]:
agg = {}
for row in rows:
    key = row[key_field]
    val = row[value_field]
    agg[key] = agg.get(key, 0) + val

In [0]:
for x in data:
    agg[k] = agg.get(k, 0) + x

**5. Detailed Example**

In [0]:
## You have transactions

rows = [
  {"cust":"a", "amt": 10},
  {"cust":"b", "amt": 20},
  {"cust":"a", "amt": 5}
]


In [0]:
## Apply pattern
agg = {}
for r in rows:
    cust = r["cust"]
    amt = r["amt"]
    agg[cust] = agg.get(cust, 0) + amt

# Output : {"a": 15, "b": 20}

In [0]:
transactions = [
    ("A", 100),
    ("B", 50),
    ("A", 25),
    ("B", 75),
    ("C", 20),
]

from collections import defaultdict

revenue = defaultdict(int)

for user, amount in transactions:
    revenue[user] += amount

{
    "A": 125,
    "B": 125,
    "C": 20
}

**6. Mini Practice Problems**

Aggregate total bytes transferred per IP from network logs

Compute total sales per product per day

Count events per user from clickstream data

In [0]:
## Problem 1 — Total spend per user
[ {"u":"a","amt":5}, {"u":"b","amt":10}, {"u":"a","amt":7} ]

In [0]:
## Problem 2 — Count events by type : Same as frequency map, but using this pattern.

In [0]:
## Problem 3 — Sum of durations per day : Logs like:
{"day":"2025-01-01", "dur":120}

**7. Full Data Engineering Problem**

In [0]:
## Problem: Given API logs:
logs = [
  {"url":"/home", "ms":120},
  {"url":"/login", "ms":80},
  {"url":"/home", "ms":200},
  {"url":"/products", "ms":350}
]

In [0]:
## Compute the total response time per URL. Solution Skeleton:
agg = {}
for r in logs:
    url = r["url"]
    ms  = r["ms"]
    agg[url] = agg.get(url, 0) + ms

**Problem**

You ingest millions of payment events daily

Each event has (merchant_id, amount)

Need daily total revenue per merchant

In [0]:
{
  merchant_id: total_revenue
}
from collections import defaultdict

daily_revenue = defaultdict(float)

for merchant_id, amount in payment_events:
    daily_revenue[merchant_id] += amount


**8. Time & Space Complexity**

Time: O(n) — single pass over records

Space: O(k) — number of unique keys

**9. Common Pitfalls**

❌ Re-scanning data for each key
✔ Aggregate in one pass

❌ Using normal dict with manual checks
✔ Use defaultdict

❌ Mixing aggregation logic with filtering
✔ Filter first, aggregate second

❌ Storing raw lists when only sums/counts needed
✔ Store only aggregated values