**⭐ 1. What This Pattern Solves**

Efficiently join a large table with a much smaller reference table by broadcasting the small table to all executors to avoid shuffle.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT /*+ BROADCAST(customers) */ *
FROM orders
JOIN customers ON orders.customer_id = customers.id;

**⭐ 3. Core Idea**

Use pyspark.sql.functions.broadcast(small_df) in .join() or set spark.sql.autoBroadcastJoinThreshold. Spark will copy the small table to each executor and perform map-side join — avoids expensive shuffle of the large table.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.functions import broadcast

joined = big_df.join(broadcast(small_df), on=join_keys, how='inner')

**⭐ 5. Detailed Example**

In [0]:
big = spark.range(0, 10_000_000).withColumnRenamed('id','order_id')
small = spark.createDataFrame([(1,'A'),(2,'B')], ['id','name'])

res = big.join(broadcast(small), big.order_id == small.id, how='inner')
# broadcast makes join map-side if small fits memory threshold

**⭐ 6. Mini Practice Problems**

Broadcast dim_country (200 rows) to join with large events (tens of millions).

Force broadcast for currency_codes in a join; measure explain() and spot BroadcastHashJoin.

Try joining users with small_lookup both with and without broadcast and compare stages.

**⭐ 7. Full Data Engineering Problem**

You need to enrich streaming click events with geo_ip mapping (small static table ~50k rows). Implement broadcast join in streaming/batch to avoid shuffle. Validate that the small table fits memory; if not, use partitioning or range join.

**⭐ 8. Time & Space Complexity**

Broadcast join: time O(N) map-side, memory cost = size(small_df) per executor. Avoid if small table > executor memory. Ideal when small << available executor RAM (~tens of MBs).

**⭐ 9. Common Pitfalls**

Broadcasting a table that does not fit executor memory → OOM.

Relying on auto-broadcast threshold without checking (use spark.conf or explicit broadcast).

Not accounting for serialized size (many columns or wide rows can be bigger than expected).