**⭐ 1. What This Pattern Solves**

Guides Spark’s query planner to make better optimization decisions:

Broadcast hints → force small table to be broadcasted to all nodes, avoiding shuffle in joins.

Merge hints → optimize Delta table merge operations for performance.

Use cases:

Joining a huge fact table with a small dimension table.

Performing incremental updates with MERGE INTO on Delta tables.

Reducing shuffle in large joins or merges.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Broadcast hint
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table b
JOIN small_table s
ON b.id = s.id;

-- Merge hint (Delta)
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.value = s.value
WHEN NOT MATCHED THEN INSERT *

**The broadcast hint avoids full shuffle.**

Delta merge automatically handles updates/inserts efficiently.

**⭐ 3. Core Idea**

Spark’s optimizer decides whether to shuffle or broadcast; hints override the default.

Use broadcast when the small table fits in memory.

Use merge for upserts in Delta → efficient, avoids full overwrite.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.functions import broadcast

# Broadcast small DF in join
df_joined = large_df.join(broadcast(small_df), "id")

# Delta Merge
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/mnt/delta/target")
delta_table.alias("t").merge(
    source=source_df.alias("s"),
    condition="t.id = s.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
from delta.tables import DeltaTable

spark = SparkSession.builder.appName("PerfPatterns").getOrCreate()

# Example DataFrames
large_data = [(i, i*2) for i in range(1000000)]
small_data = [(i, "NY") for i in range(1000)]
large_df = spark.createDataFrame(large_data, ["id", "value"])
small_df = spark.createDataFrame(small_data, ["id", "state"])

# Broadcast join
joined_df = large_df.join(broadcast(small_df), "id")
joined_df.show()

# Delta merge example
delta_table = DeltaTable.forPath(spark, "/mnt/delta/customers")
source_df = spark.createDataFrame([(1, "Alice_new")], ["id", "name"])
delta_table.alias("t").merge(
    source=source_df.alias("s"),
    condition="t.id = s.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()


**Explanation:**

broadcast(small_df) → avoids shuffle.

merge → safely updates/inserts data into Delta table efficiently.

**⭐ 6. Mini Practice Problems**

When would you use a broadcast hint in a join?

Write a Delta merge to update a record if it exists and insert if it doesn’t.

Why is broadcasting too large a table dangerous?

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a 1TB sales fact table and a 10K customer dimension table. You need to update customer info in a Delta table efficiently while joining for analytics.

In [0]:
from pyspark.sql.functions import broadcast
from delta.tables import DeltaTable

# Broadcast join for analytics
df_analytics = sales.join(broadcast(customers), "customer_id")

# Delta merge for updates
delta_customers = DeltaTable.forPath(spark, "/mnt/delta/customers")
delta_customers.alias("t").merge(
    source=updates.alias("s"),
    condition="t.customer_id = s.customer_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

 # Minimizes shuffle, supports safe upserts, efficient for large-scale pipelines.

**⭐ 8. Time & Space Complexity**

| Operation      | Time Complexity                                           | Space Complexity            |
| -------------- | --------------------------------------------------------- | --------------------------- |
| Broadcast join | O(n) scan large table, O(small_table_size * nodes) memory | O(small_table) per executor |
| Delta merge    | O(n) + O(m) shuffle if needed                             | O(n+m) on disk              |


**⭐ 9. Common Pitfalls**

Broadcasting too large tables → executor OOM.

Forgetting to alias tables in merge → Spark Delta error.

Using merge on huge tables without filtering → full scan → slow.

Ignoring shuffle reduction → performance suffers despite hints.