**⭐ 1. What This Pattern Solves**

Reduces expensive shuffles in Spark jobs by minimizing wide transformations (like groupBy, join, distinct, repartition).

Wide transformations require data movement across partitions → slow and resource-heavy.

Narrow transformations (like map, filter, select) operate within partitions → fast.

Use cases:

Large joins in ETL pipelines.

Aggregations over billions of rows.

Improving streaming pipeline latency.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Wide transformation (shuffle-heavy)
SELECT user_id, SUM(amount)
FROM transactions
GROUP BY user_id;

-- Narrow transformations (no shuffle)
SELECT user_id, amount * 2 AS double_amount
FROM transactions
WHERE amount > 100;

-- Wide = requires redistribution.
-- Narrow = per-row operations.

**⭐ 3. Core Idea**

Narrow transformations → partition-local → low cost.

Wide transformations → require shuffle → optimize by:

Reducing data before shuffle (filter, select).

Using partitioning wisely (repartition/broadcast join).

Avoiding unnecessary wide operations.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Narrow transformations
df2 = df.filter("amount > 100").select("user_id", "amount")

# Wide transformations
agg_df = df2.groupBy("user_id").sum("amount")  # causes shuffle

# Optimization: pre-aggregate before join or reduce data
small_df = small_df.select("user_id", "value")
joined_df = df2.join(broadcast(small_df), "user_id")

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, broadcast

spark = SparkSession.builder.appName("PerfPatterns").getOrCreate()
data = [("Alice", 100), ("Bob", 200), ("Charlie", 300)]
df = spark.createDataFrame(data, ["user", "amount"])

# Narrow transformation
df_narrow = df.filter(col("amount") > 100).select("user", "amount")

# Wide transformation (groupBy triggers shuffle)
agg_df = df_narrow.groupBy("user").agg(spark_sum("amount").alias("total"))

# Optimize join by broadcasting small DF
small_df = spark.createDataFrame([("Alice", "NY")], ["user", "state"])
joined_df = df_narrow.join(broadcast(small_df), "user")
joined_df.show()

**Explanation:**

filter + select = narrow → fast.

groupBy = wide → triggers shuffle.

broadcast avoids shuffle in small-large joins.

**⭐ 6. Mini Practice Problems**

Identify wide vs narrow transformations in this code: df.filter(...).join(...).select(...).

How would you optimize a groupBy on a massive DataFrame?

When is broadcasting a small table helpful?

**⭐ 7. Full Data Engineering Problem**

Scenario: A 500M row clickstream dataset is joined with a 1000-row user profile table. After filtering clicks from the last week, we want aggregated metrics per user.

In [0]:
from pyspark.sql.functions import col, sum as spark_sum, broadcast

# Step 1: Narrow transformation: filter
df_filtered = clicks.filter(col("event_date") >= "2025-12-01")

# Step 2: Optimize join with broadcast
df_joined = df_filtered.join(broadcast(user_profile), "user_id")

# Step 3: Wide transformation: groupBy
agg_df = df_joined.groupBy("user_id").agg(spark_sum("clicks").alias("total_clicks"))
agg_df.show()

## Minimizes shuffle while performing wide operations only after reducing data.


**⭐ 8. Time & Space Complexity**

| Operation               | Time Complexity      | Space Complexity      |
| ----------------------- | -------------------- | --------------------- |
| Narrow (filter, select) | O(n)                 | O(n)                  |
| Wide (groupBy, join)    | O(n log p + shuffle) | O(n + shuffle buffer) |


**⭐ 9. Common Pitfalls**

Applying wide transformations too early → massive shuffle.

Joining large tables without broadcasting → unnecessary data movement.

Forgetting to pre-filter → shuffling unnecessary data.

Ignoring partitioning → uneven partition sizes → skew.