**⭐ 1. What This Pattern Solves**

Return rows that have matching keys in both tables — typical when you need only records that exist in both sides (e.g., orders with known customers).

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *
FROM orders o
JOIN customers c
  ON o.customer_id = c.id;

**⭐ 3. Core Idea**

Use DataFrame .join() with how="inner" (default). Preserve only rows where join key exists on both sides; distribute shuffle keyed by the join column(s).

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql import functions as F

# df_left and df_right are DataFrames; join_keys can be string or list
joined = df_left.join(df_right, on=join_keys, how="inner")

**⭐ 5. Detailed Example**

In [0]:
data_orders = [(1,'2025-01-01',100),(2,'2025-01-02',50),(3,'2025-01-03',75)]
orders = spark.createDataFrame(data_orders, ['order_id','order_date','amount'])

data_customers = [(1,'Alice'),(2,'Bob')]
customers = spark.createDataFrame(data_customers, ['id','name'])

# inner join orders -> customers
res = orders.join(customers, orders.order_id == customers.id, how='inner') \
            .select('order_id','order_date','amount','name')
res.show()

**⭐ 6. Mini Practice Problems**

Inner join events and users on user_id, then compute event count per user.

Join transactions and accounts on acct_id and filter amount > 1000.

Given A(id, v) and B(id, w), produce id, v, w only where both sides exist.

**⭐ 7. Full Data Engineering Problem**

You have a streaming bronze_events topic and a dim_users slowly changing dimension. Build a daily batch job that inner joins events to active users and writes silver_events only for active users — ensure the join keys are properly typed and avoid cartesian. Use appropriate partitioning on event_date.

**⭐ 8. Time & Space Complexity**

Distributed join cost dominated by shuffle: time ≈ O(N + M) network IO + local sort/hash; worst-case memory depends on join strategy (broadcast vs shuffle). Expect heavy network IO if both sides large.

**⭐ 9. Common Pitfalls**

Forgetting to cast join key types (e.g., string vs int) → empty results.

Creating cartesian join by accident (no join condition).

Not coalescing duplicates when joining multiple columns with same names (use .select or rename).