**⭐ 1. What This Pattern Solves**

Semi: keep rows from left where a match exists in right (like SQL EXISTS) but do not duplicate or add right columns.
Anti: keep left rows that have no match in right (like NOT EXISTS) — used for filtering or deduping.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- semi
SELECT * FROM A WHERE EXISTS (SELECT 1 FROM B WHERE B.k = A.k);

In [0]:
%sql
-- anti
SELECT * FROM A WHERE NOT EXISTS (...);

**⭐ 3. Core Idea**

Use DataFrame join(..., how='left_semi') and how='left_anti' (PySpark supports these). These are efficient — no wide row materialization from right side.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
semi = df_left.join(df_right, on=join_keys, how='left_semi')
anti  = df_left.join(df_right, on=join_keys, how='left_anti')

**⭐ 5. Detailed Example**

In [0]:
users = spark.createDataFrame([(1,'A'),(2,'B'),(3,'C')], ['id','name'])
purchased = spark.createDataFrame([(1,),(3,)], ['id'])

buyers = users.join(purchased, on='id', how='left_semi')  # ids 1 and 3
non_buyers = users.join(purchased, on='id', how='left_anti')  # id 2

**⭐ 6. Mini Practice Problems**

From sessions, keep only rows where user_id exists in active_users (semi).

From staging_emails, keep rows not present in dim_emails (anti).

Use anti join to find orphaned order_items without orders.

**⭐ 7. Full Data Engineering Problem**

You run daily ingestion: apply anti-join between incoming_customers and dim_customers to find new customers and insert them; use semi-join to tag incoming events that belong to known customers for enrichment.

**⭐ 8. Time & Space Complexity**

More efficient than full join when you only need existence checks — still may require shuffle keyed by join columns but avoids transferring right-side payloads.

**⭐ 9. Common Pitfalls**

Using broadcast with left_semi/left_anti incorrectly (possible but check sizes).

Expecting semi join to return right columns — it doesn't.

Not deduplicating right side when right has duplicates → can change semantics (semi returns left once, but duplicate right rows can affect performance).