**Pattern 1** — Fast Deduplication

**Problem:** You have a list of user IDs with duplicates (e.g., clickstream log).

**Task:** Return unique user IDs. Sometimes order matters, sometimes it doesn’t.

In [0]:
# Case 1: Order DOES NOT matter (fastest)
user_ids = [101, 102, 101, 103, 102, 104]
unique_users = set(user_ids)  # {101, 102, 103, 104}

# Case 2: Order matters (keep first occurrence)
def dedupe_preserve_order(seq):
    seen = set()
    result = []
    for x in seq:
        if x not in seen:
            seen.add(x)
            result.append(x)
    return result

print(dedupe_preserve_order(user_ids))  # [101, 102, 103, 104]


**Why this pattern**

set gives O(1) average membership, much faster than list in.

Classic DE use: distinct customer IDs, distinct file names, distinct partition keys.

**Time & Space Complexity**

set(seq): Time O(n), Space O(n).

dedupe_preserve_order: Time O(n), Space O(n) for seen + result.

**Pattern 2** — Reference Integrity Check

**Problem:** Ensure all fact table foreign keys exist in dimension table (missing dimension rows = bad data).

**Task:** Find all fact IDs that don’t exist in the dimension key set.

In [0]:
fact_fk = [101, 102, 103, 104, 105]    # customer_ids in fact table
dim_keys = [101, 102, 104]             # customer_ids in dim_customer

missing_in_dim = set(fact_fk) - set(dim_keys)
print(missing_in_dim)  # {103, 105}


**Why this pattern**

This is the core “orphan record” / FK integrity check in DQ frameworks.

Set difference is exactly “in fact but not in dim”.

**Time & Space Complexity**

Build sets: O(F + D) time, O(F + D) space.

Set difference: O(min(F, D)) expected.

Overall: Time O(F + D), Space O(F + D).

**Pattern 3** — Semi Join (Exists Join)

**Problem:** Keep only fact rows whose key exists in dimension.

**Task:** Filter a list of fact rows to include only those with IDs in dim_ids.

In [0]:
from dataclasses import dataclass

@dataclass
class FactRow:
    id: int
    amount: float

fact_rows = [
    FactRow(101, 10.0),
    FactRow(102, 20.0),
    FactRow(103, 30.0),
]

dim_ids = [101, 103]  # valid dimension keys

dim_id_set = set(dim_ids)
filtered_fact = [r for r in fact_rows if r.id in dim_id_set]

# Keeps only ids 101 and 103
for r in filtered_fact:
    print(r.id, r.amount)


**Why this pattern**

“Semi join” = keep matching rows only, no dim payload needed.

Equivalent of SQL WHERE fact.id IN (SELECT id FROM dim) but done in Python.

**Time & Space Complexity**

Build dim_id_set: O(D) time, O(D) space.

Filter fact: O(F) time (O(1) membership each).

Overall: Time O(F + D), Space O(D + F) for output.

**Pattern 4**— Anti Join (Not Exists)

**Problem:** Find “bad” fact rows whose keys are missing in dimension (DQ failure records).

**Task:** Filter a list of fact rows to only those whose IDs are not in dim_ids.

In [0]:
from dataclasses import dataclass

@dataclass
class FactRow:
    id: int
    amount: float

fact_rows = [
    FactRow(101, 10.0),
    FactRow(102, 20.0),
    FactRow(103, 30.0),
]

dim_ids = [101, 103]
dim_id_set = set(dim_ids)

bad_rows = [r for r in fact_rows if r.id not in dim_id_set]

for r in bad_rows:
    print(r.id, r.amount)  # 102 20.0


**Why this pattern**

“Anti join” = rows that fail the EXISTS condition.

Used in DQ checks, reconciliation, “find all events that didn’t land in warehouse”, etc.

**Time & Space Complexity**

Same as semi-join: Time O(F + D), Space O(D + B) where B = number of bad rows.

**Pattern 5** — Detecting Drift (Schema / Category Drift)

**Problem:** Yesterday’s schema/categories vs today’s. Find what changed.

**Task:** Compute added and removed columns (or categories) between two snapshots.

In [0]:
old_cols = {"id", "name", "email", "created_at"}
new_cols = {"id", "full_name", "email", "created_at", "status"}

# What changed in either direction (symmetric difference)
drift = new_cols ^ old_cols
print(drift)  # {'name', 'full_name', 'status'}

# Often you want more detail:
added = new_cols - old_cols      # {'full_name', 'status'}
removed = old_cols - new_cols    # {'name'}


**Why this pattern**

Schema drift detection = monitor added/removed columns.

Category drift: old set of statuses vs new set of statuses, etc.

**Time & Space Complexity**

Creating sets: O(n + m) where n, m are sizes.

^, -: each O(n + m) in worst case.

Overall: Time O(n + m), Space O(n + m).

**Pattern 6** — Comparing Two Large Datasets Efficiently

**Problem:** You have two large ID lists from two systems (e.g., source vs warehouse) and want to reconcile.

**Task:** Find IDs only in A, only in B, and optionally in both.

In [0]:
a_ids = [1, 2, 3, 4, 5]          # IDs in source system
b_ids = [4, 5, 6, 7]             # IDs in warehouse

a_set = set(a_ids)
b_set = set(b_ids)

only_in_a = a_set - b_set        # {1, 2, 3}
only_in_b = b_set - a_set        # {6, 7}
in_both   = a_set & b_set        # {4, 5}


**Why this pattern**

Classic reconciliation: “Which records never loaded?”, “Which records exist only in downstream system?”.

Set operations give you all three segments in linear time.

**Time & Space Complexity**

Build sets: O(n + m) time, O(n + m) space.

Each of - and &: O(n + m) worst case.

Overall: Time O(n + m), Space O(n + m).

**Pattern 7** — Exploding Multi-Valued Columns (Tags / Roles / Categories)

**Problem:** Each record has a list of tags/roles. You want unique tags per user or globally.

**Task:** Use sets to merge and dedupe tags when aggregating.

In [0]:
items = [
    {"user": "alice", "tags": ["premium", "sports", "music"]},
    {"user": "alice", "tags": ["music", "news"]},
    {"user": "bob",   "tags": ["news", "finance"]},
]

# 1) Unique tags per user
from collections import defaultdict

user_tags = defaultdict(set)

for item in items:
    user = item["user"]
    for tag in item["tags"]:
        user_tags[user].add(tag)

print(user_tags["alice"])  # {'premium', 'sports', 'music', 'news'}
print(user_tags["bob"])    # {'news', 'finance'}

# 2) Global unique tag universe
all_tags = set()
for item in items:
    all_tags.update(item["tags"])

print(all_tags)  # {'premium', 'sports', 'music', 'news', 'finance'}


**Why this pattern**

Multi-valued columns appear everywhere: product categories, user interests, permissions.

Sets naturally dedupe while you aggregate, no need for extra checks.

**Time & Space Complexity**

Let total number of tag entries = T.

Building user_tags: Time O(T) (each add is O(1) average), Space O(T) in worst case.

Building all_tags: Time O(T), Space O(U) where U = number of unique tags.

In [0]:
## Problem 1 — Find missing foreign keys

def missing_fk(fact_ids, dim_ids):
    return set(fact_ids) - set(dim_ids)

In [0]:
## Problem 2 — Find new categories introduced in production

new = set(prod) - set(train)

In [0]:
## Problem 3 — Check if ID is unique

def has_unique(ids):
    return len(ids) == len(set(ids))

In [0]:
## Problem 4 — Intersection of multiple customer datasets

common = set(a) & set(b) & set(c)

In [0]:
## Problem 5 — Detect schema changes

def diff_schema(a, b):
    a, b = set(a), set(b)
    return a ^ b    # symmetric diff

In [0]:
## Problem 6 — Dedup while retaining order

def dedup(seq):
    seen = set()
    out = []
    for x in seq:
        if x not in seen:
            seen.add(x)
            out.append(x)
    return out

In [0]:
## Problem 7 — Validate primary keys don't overlap

dup = set(table1_ids) & set(table2_ids)

In [0]:
## Problem 8 — Find invalid status codes

invalid = set(records) - {'OK', 'FAIL', 'PENDING'}

In [0]:
## Problem 9 — Remove stopwords

stop = {"the", "and", "of", "in"}
result = [w for w in words if w not in stop]

In [0]:
## Problem 10 — Determine if two datasets are identical

set(a) == set(b)

** Summary**

Sets = hash tables → O(1) membership

Best for:

dedup

membership tests

drift detection

referential integrity checks

CDC diffs

category validation

Not ordered

Require hashable elements

Extremely important for DE DQ & join-style logic

Use tuple keys when needing multi-column uniqueness

Avoid repeated set construction inside loops

Symmetric diff (^) is extremely useful