**1. What This Pattern Solves**

This pattern buckets rows into groups based on a key.

Used for:

Grouping transactions by customer

Grouping logs by date or hour

Building Bronze → Silver aggregations

Partition-like transformations

Preparing for reduce/aggregation step

Constructing time-series buckets

Grouping events by device/user/session

Python version of SQL GROUP BY

This is the most-used DE Python pattern after Frequency Map.

**2. SQL Equivalent**
SELECT key, ARRAY_AGG(row)
FROM table
GROUP BY key;

**3. Core Idea**

Use defaultdict(list) so each key automatically starts with an empty list.

**4. Template Code** (MEMORIZE THIS)

In [0]:
from collections import defaultdict

groups = defaultdict(list)

for row in rows:
    key = row[key_field]
    groups[key].append(row)

In [0]:
groups = defaultdict(list)
for r in rows:
    groups[r[key]].append(r)

**5. Detailed Example**

In [0]:
## Given:
rows = [
    {"cust": "a", "amt": 10},
    {"cust": "b", "amt": 20},
    {"cust": "a", "amt": 15}
]

In [0]:
## Apply pattern:

from collections import defaultdict

groups = defaultdict(list)

for r in rows:
    groups[r["cust"]].append(r)

{
  "a": [ {"cust":"a","amt":10}, {"cust":"a","amt":15"} ],
  "b": [ {"cust":"b","amt":20} ]
}


**6. Mini Practice Problems**

In [0]:
## Problem 1 : Group orders by user_id:

[
  {"user":1,"order":100},
  {"user":2,"order":200},
  {"user":1,"order":300}
]

In [0]:
## Problem 2 :Group events by event_type:
## Input
["click","open","click","scroll","open"]
## Output
{
  "click": ["click","click"],
  "open": ["open","open"],
  "scroll": ["scroll"]
}


In [0]:
## Problem 3 :Group logs by date extracted from timestamp:
"2025-01-01 10:00:00"
"2025-01-01 11:00:00"
"2025-01-02 09:00:00"

**7. Full Data Engineering Problem**

In [0]:
## Problem: You receive login logs as dictionaries:

logs = [
  {"user":"alice", "ts":"2025-01-01 10:00"},
  {"user":"bob",   "ts":"2025-01-01 10:05"},
  {"user":"alice", "ts":"2025-01-01 11:00"},
  {"user":"alice", "ts":"2025-01-02 09:00"}
]

In [0]:
## Task: Group logs by user, so you get:

{
  "alice": [
    {...}, {...}, {...}
  ],
  "bob": [
    {...}
  ]
}

In [0]:
## Solution Pattern (Skeleton only):

from collections import defaultdict

groups = defaultdict(list)

for r in logs:
    groups[r["user"]].append(r)

**8. Time & Space Complexity**

Time: O(n)

Space: O(n) for storing all rows

**9. Common Pitfalls & Mistakes**

In [0]:
## ❌ Using a normal dict and writing:

if key not in groups:
    groups[key] = []

**→ Verbose and error-prone.**

❌ Forgetting to append raw rows and instead appending keys.

❌ Using list comprehension incorrectly for grouping (doesn’t work).

✔ Correct: always use defaultdict(list).