**⭐ 1. What This Pattern Solves**

Shuffles in Spark (caused by wide transformations like groupBy, join, distinct, repartition) are costly in time and network I/O.
Reducing shuffle size improves job runtime, memory usage, and cluster efficiency.

Use cases:

Aggregations over massive datasets.

Joins where one table is much smaller.

Optimizing ETL pipelines for speed and stability.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Standard join (may shuffle all data)
SELECT *
FROM big_table b
JOIN big_table2 b2
ON b.id = b2.id;

-- Reduce shuffle: pre-aggregate / filter
WITH filtered_b AS (
  SELECT id, SUM(amount) AS total
  FROM big_table
  WHERE date = '2025-12-01'
  GROUP BY id
)
SELECT *
FROM filtered_b fb
JOIN small_table st
ON fb.id = st.id;

-- Pre-aggregation or filtering reduces the amount of data that moves during shuffle.


**⭐ 3. Core Idea**

Shuffle = moving data across partitions → expensive.

Reduce shuffle by:

Pre-aggregate / filter large datasets.

Broadcast small tables in joins.

Use partitioning to colocate data.

Avoid unnecessary repartition or wide transformations.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.functions import col, sum as spark_sum, broadcast

# Pre-filter / pre-aggregate
df_small = df.filter(col("date") == "2025-12-01") \
             .groupBy("id").agg(spark_sum("amount").alias("total"))

# Broadcast small table in join to avoid shuffle
df_joined = df_small.join(broadcast(small_table), "id")

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, broadcast

spark = SparkSession.builder.appName("PerfPatterns").getOrCreate()

big_data = [(i, 100) for i in range(1000000)]
small_data = [(i, "NY") for i in range(1000)]

df_big = spark.createDataFrame(big_data, ["id", "amount"])
df_small = spark.createDataFrame(small_data, ["id", "state"])

# Pre-filter / pre-aggregate
df_agg = df_big.filter(col("amount") > 50).groupBy("id").agg(spark_sum("amount").alias("total"))

# Broadcast join reduces shuffle
df_joined = df_agg.join(broadcast(df_small), "id")
df_joined.show()


**Explanation:**

Filtering before groupBy reduces rows shuffled.

Broadcasting small table avoids full shuffle join.

**⭐ 6. Mini Practice Problems**

Reduce shuffle when joining a 500M row table with a 1K row table.

Explain why filtering early can improve shuffle performance.

How would you partition a dataset to reduce shuffle in multiple joins?

**⭐ 7. Full Data Engineering Problem**

Scenario: You need to compute monthly revenue per customer from 1TB of transactional data and join it with a 10K customer profile table.

In [0]:
from pyspark.sql.functions import col, sum as spark_sum, broadcast

# Step 1: Pre-filter data for December
df_dec = transactions.filter(col("month") == 12)

# Step 2: Pre-aggregate revenue per customer
df_revenue = df_dec.groupBy("customer_id").agg(spark_sum("amount").alias("monthly_revenue"))

# Step 3: Broadcast join with small customer profile
df_final = df_revenue.join(broadcast(customer_profile), "customer_id")
df_final.write.format("delta").mode("overwrite").save("/mnt/silver/revenue")


**⭐ 8. Time & Space Complexity**

| Operation                              | Time Complexity                     | Space Complexity      |
| -------------------------------------- | ----------------------------------- | --------------------- |
| Wide transformation w/ shuffle         | O(n log p + shuffle)                | O(n + shuffle buffer) |
| Wide transformation w/ reduced shuffle | O(n_filtered log p + shuffle_small) | O(n_filtered)         |


**⭐ 9. Common Pitfalls**

Shuffling full dataset unnecessarily (e.g., join before filtering).

Not broadcasting small tables → unnecessary data movement.

Ignoring partitioning → uneven shuffle.

Pre-aggregating incorrectly → inaccurate results.