**⭐ 1. What This Pattern Solves**

Filter one dataset using the existence (or non-existence) of keys in another

Common in ETL validation, exclusion lists, and CDC pipelines

Avoids row explosion caused by full joins

Used for “keep only matching keys” or “drop matching keys”

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Semi Join
SELECT *
FROM orders o
WHERE EXISTS (
  SELECT 1 FROM customers c
  WHERE o.customer_id = c.id
);

-- Anti Join
SELECT *
FROM orders o
WHERE NOT EXISTS (
  SELECT 1 FROM fraud_users f
  WHERE o.user_id = f.user_id
);


**⭐ 3. Core Idea**

Convert the right-side dataset into a set of keys

Iterate left-side rows and check membership only

No column merge, no duplication — just filtering

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
right_keys = {r[key] for r in right}

# Semi Join
semi = [l for l in left if l[key] in right_keys]

# Anti Join
anti = [l for l in left if l[key] not in right_keys]


**⭐ 5. Detailed Example**

In [0]:
orders = [
    {"order_id": 1, "user_id": "A"},
    {"order_id": 2, "user_id": "B"},
    {"order_id": 3, "user_id": "C"},
]

blocked_users = [
    {"user_id": "B"},
]

blocked = {u["user_id"] for u in blocked_users}

anti_join = [
    o for o in orders
    if o["user_id"] not in blocked
]

#Output
[
    {"order_id": 1, "user_id": "A"},
    {"order_id": 3, "user_id": "C"}
]

**⭐ 6. Mini Practice Problems**

From page views, keep only users who made a purchase

Remove transactions linked to blacklisted accounts

Filter IoT readings that do not have a registered device ID

**⭐ 7. Full Data Engineering Scenario**

Problem
You receive daily transactions and a fraud-user list.
Only non-fraud transactions should move to the gold layer.

Expected Output
Clean transactions excluding all fraud users

Skeleton Solution

In [0]:
fraud_ids = {f["user_id"] for f in fraud_table}

clean_txns = [
    t for t in transactions
    if t["user_id"] not in fraud_ids
]


**⭐ 8. Time & Space Complexity**

Time: O(n + m)

Build set: O(m)

Filter left: O(n)

Space: O(m) for the key set

**⭐ 9. Common Pitfalls & Mistakes**

❌ Using nested loops → O(n × m)
❌ Using full join then filtering
❌ Forgetting to deduplicate right-side keys

✔ Always convert right-side keys to a set
✔ Use semi/anti joins when no column merge is needed
✔ Prefer this over joins for performance and clarity