**Pattern 1** — Single-Key Hash Join (Classic)

**Problem:** Enrich a fact table (events) with attributes from a small dimension (users) by user_id.

**Task:** For each event in events, attach the matching user record from users on user_id (inner join).

In [0]:
events = [
    {"event_id": 1, "user_id": 101, "action": "click"},
    {"event_id": 2, "user_id": 102, "action": "view"},
    {"event_id": 3, "user_id": 103, "action": "purchase"},
]

users = [
    {"user_id": 101, "country": "US"},
    {"user_id": 102, "country": "IN"},
]

In [0]:
"""
Idea (Build hash index on right, scan left)
Build a dict keyed by join key from the smaller/right side.
Then scan left, probe the dict in O(1) average time to find matches.
"""

def hash_join(left, right, key):
    index = {r[key]: r for r in right}   # build hash index on right

    out = []
    for l in left:                       # scan left, probe right index
        if l[key] in index:
            out.append((l, index[l[key]]))
    return out


**Why this is Hash Join**
We materialize the right side into an in-memory hash table keyed by key, then use constant-time lookups while scanning the left side → classic hash join.

**Time & Space Complexity**

Let n = len(left), m = len(right)

Time: build index O(m) + probe O(n) → O(n + m)

Space: hash index stores all right rows → O(m)

**Pattern 2**— Left Join

**Problem:** Keep all fact rows even if there is no matching dimension row.

**Task:** For each row in left (fact), attach the matching row from right (dimension) or None if missing (left outer join).

In [0]:
orders = [
    {"order_id": 1, "cust_id": 10},
    {"order_id": 2, "cust_id": 20},
    {"order_id": 3, "cust_id": 30},  # customer missing in dim
]

customers = [
    {"cust_id": 10, "segment": "GOLD"},
    {"cust_id": 20, "segment": "SILVER"},
]

In [0]:
"""
Idea (Hash index + default on miss)
Same as inner hash join, but instead of skipping missing matches,
we still return the left row and use dict.get() to attach None when
the key isn’t present on the right side.
"""

def left_join(left, right, key):
    idx = {r[key]: r for r in right}
    out = []
    for l in left:
        out.append((l, idx.get(l[key])))   # None if not found
    return out

**Why this is Hash Join**
We still use an in-memory hash index for O(1) probes; the only difference is we do not filter unmatched rows.

**Time & Space Complexity**

n = len(left), m = len(right)

Time: build index O(m) + probe all left rows O(n) → O(n + m)

Space: hash index on right → O(m)

**Pattern 3** — Semi Join (“Exists Join”)

**Problem:** Filter fact rows to only those with a matching dimension key.

**Task:** From transactions, return only rows whose merchant_id exists in the merchants dimension.

In [0]:
transactions = [
    {"txn_id": 1, "merchant_id": "A"},
    {"txn_id": 2, "merchant_id": "B"},
    {"txn_id": 3, "merchant_id": "C"},
]

merchants = [
    {"merchant_id": "A"},
    {"merchant_id": "C"},
]


In [0]:
"""
Idea (Hash set of right keys)
We don’t need right-side values, only existence.
Build a set of join keys from right, then keep left rows whose key
is in that set.
"""

def semi_join(left, right, key):
    right_keys = {r[key] for r in right}       # hash set of keys
    return [l for l in left if l[key] in right_keys]

**Why this is Hash Join**
This is a hash-based “exists” check: the hash table is a set of keys instead of full rows. Used a lot for “filter fact by valid keys”.

**Time & Space Complexity**

n = len(left), m = len(right)

Time: build set O(m) + scan/probe O(n) → O(n + m)

Space: set of right keys → O(m)

**Pattern 4** — Anti Join (“Not Exists”)

**Problem:** Find records in fact that have no reference in the dimension (DQ / missing keys).

**Task:** From claims, return only rows whose provider_id is not present in providers.

In [0]:
claims = [
    {"claim_id": 1, "provider_id": "P1"},
    {"claim_id": 2, "provider_id": "P2"},
    {"claim_id": 3, "provider_id": "P3"},
]

providers = [
    {"provider_id": "P1"},
    {"provider_id": "P3"},
]

In [0]:
"""
Idea (Hash set + negative filter)
Same set of right keys as semi join, but we keep only the left rows
whose key is NOT in that set.
"""

def anti_join(left, right, key):
    right_keys = {r[key] for r in right}
    return [l for l in left if l[key] not in right_keys]

**Why this is Hash Join**
We still hash the right side; instead of “exists”, we’re doing “not exists”. Classic pattern for missing FK checks and DQ validation.

**Time & Space Complexity**

n = len(left), m = len(right)

Time: build set O(m) + scan O(n) → O(n + m)

Space: set of right keys → O(m)

**Pattern 5** — Composite-Key Join

**Problem:** Join fact and dimension on multiple keys (e.g., id + date), where either alone is not unique.

**Task:** Enrich daily balances facts with dimension data keyed by (account_id, as_of_date).

In [0]:
facts = [
    {"account_id": 1, "as_of_date": "2025-11-01", "balance": 100},
    {"account_id": 1, "as_of_date": "2025-11-02", "balance": 120},
]

dims = [
    {"account_id": 1, "as_of_date": "2025-11-01", "status": "ACTIVE"},
    {"account_id": 1, "as_of_date": "2025-11-02", "status": "ACTIVE"},
]

In [0]:
"""
Idea (Tuple as hash key)
Build the hash index on a tuple of columns that together form a unique
key. Use the same tuple from each left row to probe the index.
"""

def composite_hash_join(left, right, keys):
    # keys is a list like ["account_id", "as_of_date"]
    def make_key(row):
        return tuple(row[k] for k in keys)

    idx = {make_key(r): r for r in right}

    out = []
    for l in left:
        k = make_key(l)
        if k in idx:
            out.append((l, idx[k]))
    return out

**Why this is Hash Join**
Hash maps can use any hashable object as key. By using a tuple of multiple fields, we do a hash join on composite keys (very common for CDC, SCD, daily snapshots).

**Time & Space Complexity**

n = len(left), m = len(right)

Time: index build O(m) + probe O(n); each key computation is O(k) for k join columns → O(k·(n + m)), usually k is small constant → O(n + m)

Space: index holds right rows keyed by tuple → O(m)

**Pattern 6** — Join JSON Events to Dimension Data

**Problem:** Enrich streaming/JSON events with user attributes stored in a dimension table.

**Task:** For each event (JSON dict) with user_id, add country from user_dim keyed by user_id. Use a safe default if user is missing.

In [0]:
events = [
    {"event_id": 1, "user_id": 101, "action": "login"},
    {"event_id": 2, "user_id": 999, "action": "click"},  # unknown user
]

user_dim = [
    {"user_id": 101, "country": "US"},
    {"user_id": 102, "country": "IN"},
]

In [0]:
"""
Idea (Hash join + safe lookup)
Build a hash index on dimension keyed by user_id.
For each JSON event, do a dict.get on the index, then safely extract
the field you want with another get, providing defaults to avoid KeyError.
"""

def enrich_events_with_country(events, dim, user_key="user_id"):
    idx = {d[user_key]: d for d in dim}

    for e in events:
        user = idx.get(e[user_key], {})          # {} if not found
        e["country"] = user.get("country", "UNKNOWN")
    return events

**Why this is Hash Join**
This is the same hash-join idea applied to semi-structured JSON: hash the dimension, then probe per event and merge fields.

**Time & Space Complexity**

n = len(events), m = len(dim)

Time: build index O(m) + enrich events O(n) → O(n + m)

Space: dimension index hash → O(m)