**Pattern 1** — Group by Single Key (customer_id)

**Problem:** Group all transactions by customer.

**Task:** Given a list of transaction rows, build a mapping customer_id → list of rows so we can later aggregate per customer (total spend, counts, etc.).

In [0]:
from collections import defaultdict

rows = [
    {"customer_id": 101, "amount": 50},
    {"customer_id": 102, "amount": 30},
    {"customer_id": 101, "amount": 20},
    {"customer_id": 103, "amount": 70},
    {"customer_id": 102, "amount": 10},
]

In [0]:
"""
Idea (Group by a single key)

Walk the list once:
- Use a dict: key = customer_id, value = list of all that customer's rows.
- For each row, compute key = row['customer_id'] and append row to groups[key].

This mirrors SQL: SELECT * FROM rows GROUP BY customer_id;
"""

from collections import defaultdict

def group_by_customer(rows):
    groups = defaultdict(list)
    for row in rows:
        cid = row["customer_id"]
        groups[cid].append(row)
    return groups

groups = group_by_customer(rows)
# Example: groups[101] → list of all rows for customer 101

Why this is “Group by Single Key”
One dictionary key, one dimension (customer). Very common for “per-user”, “per-account”, “per-session” logic.

**Time & Space Complexity**

Time: one pass over rows → O(n)

Space: store all rows in the dict → O(n)

**Pattern 2** — Group by Composite Key (customer_id, order_date)

**Problem:** Group orders by (customer_id, order_date) to match fact grain or compute per-day metrics per customer.

**Task:** Given order lines, build mapping (customer_id, order_date) → list of rows.

In [0]:
rows = [
    {"customer_id": 101, "order_date": "2025-11-20", "order_id": 1, "amount": 50},
    {"customer_id": 101, "order_date": "2025-11-20", "order_id": 2, "amount": 30},
    {"customer_id": 101, "order_date": "2025-11-21", "order_id": 3, "amount": 40},
    {"customer_id": 102, "order_date": "2025-11-20", "order_id": 4, "amount": 25},
]

In [0]:
"""
Idea (Composite key = tuple)

Fact tables often use multiple columns as grain (customer_id + order_date).
We mimic that in Python by using a tuple key:

key = (row['customer_id'], row['order_date'])

All rows with the same key land in the same bucket.
"""

from collections import defaultdict

def group_by_customer_date(rows):
    groups = defaultdict(list)
    for row in rows:
        key = (row["customer_id"], row["order_date"])
        groups[key].append(row)
    return groups

groups = group_by_customer_date(rows)
# Example: groups[(101, "2025-11-20")] → 2 rows (order_id 1, 2)

Why this is “Group by Composite Key”
The grouping dimension is the combination of fields. This matches SQL GROUP BY customer_id, order_date and is very common in DE interviews.

**Time & Space Complexity**

Time: one pass → O(n)

Space: store all rows in dict buckets → O(n)

**Pattern 3** — Group by Category → List Extraction

**Problem:** For each product category, collect the list of SKUs in that category.

**Task:** Build category → [sku1, sku2, ...] for fast lookups or downstream export.

In [0]:
items = [
    {"sku": "A100", "cat": "electronics"},
    {"sku": "A101", "cat": "electronics"},
    {"sku": "B200", "cat": "books"},
    {"sku": "B201", "cat": "books"},
    {"sku": "H300", "cat": "home"},
]

In [0]:
"""
Idea (Group by key, store only selected field)

We don't always need entire rows in each group.
Here:
- Group key = item["cat"]
- Stored value = item["sku"]

So each category maps to just the SKUs (lighter than full rows).
"""

from collections import defaultdict

def skus_by_category(items):
    items_by_cat = defaultdict(list)
    for item in items:
        cat = item["cat"]
        items_by_cat[cat].append(item["sku"])
    return items_by_cat

items_by_cat = skus_by_category(items)
# Example: items_by_cat["books"] → ["B200", "B201"]


Why this is “Group → List Extraction”
You group by one column but only keep a specific field (SKU) in the group. Used a lot for building id-lists, email-lists, etc.

**Time & Space Complexity**

Time: single pass over items → O(n)

Space: store every SKU in some list → O(n)

**Pattern 4** — Group by Hour (Time Bucketing)

**Problem:** Group logs into hourly buckets to compute per-hour metrics.

**Task:** Given logs with timestamp strings, group by hour yyyy-mm-dd HH.

In [0]:
logs = [
    {"ts": "2025-11-20 10:01:05", "status": 200},
    {"ts": "2025-11-20 10:15:30", "status": 500},
    {"ts": "2025-11-20 11:00:00", "status": 200},
    {"ts": "2025-11-20 11:45:10", "status": 404},
]

In [0]:
"""
Idea (Truncate timestamp to bucket key)

For many real-time / monitoring tasks, we aggregate by hour or minute.
If ts is "YYYY-MM-DD HH:MM:SS", then:

hour_key = ts[:13]  # 'YYYY-MM-DD HH'

All logs that share the same hour_key go into the same group.
"""

from collections import defaultdict

def group_logs_by_hour(logs):
    groups = defaultdict(list)
    for log in logs:
        hour = log["ts"][:13]  # '2025-11-20 10'
        groups[hour].append(log)
    return groups

groups = group_logs_by_hour(logs)
# Example: groups["2025-11-20 10"] → 2 logs


Why this is “Group by Hour/Minute”
We’re grouping by time buckets, not the exact timestamp. This is standard for throughput, error-rate, and alerting pipelines.

**Time & Space Complexity**

Time: one pass over logs → O(n)

Space: store all logs in hour-buckets → O(n)

**Pattern 5** — Nested Grouping (Year → Month → Rows)

**Problem:** Group events hierarchically by year and then by month.

**Task:** Build year → month → list of rows, to later run year-level and month-level aggregations.

In [0]:
rows = [
    {"year": 2025, "month": 11, "views": 100},
    {"year": 2025, "month": 11, "views": 200},
    {"year": 2025, "month": 12, "views": 150},
    {"year": 2024, "month": 12, "views": 90},
]

In [0]:
"""
Idea (Nested defaultdicts)

We need a 2-level grouping: first by year, then by month.

Use:
groups = defaultdict(lambda: defaultdict(list))

Then for each row:
- groups[year][month].append(row)

This mirrors a cube: year → month → facts.
"""

from collections import defaultdict

def nested_group_year_month(rows):
    groups = defaultdict(lambda: defaultdict(list))
    for r in rows:
        y = r["year"]
        m = r["month"]
        groups[y][m].append(r)
    return groups

groups = nested_group_year_month(rows)
# Example: groups[2025][11] → 2 rows, groups[2025][12] → 1 row


Why this is “Nested Grouping”
Instead of a single dict keyed by a composite tuple, we model the hierarchy explicitly: groups[year][month]. Very similar to partition folders: /year=2025/month=11/.

**Time & Space Complexity**

Time: one pass over rows → O(n)

Space: every row stored once in a nested structure → O(n)