# PySpark: Zero to Hero
## Module 20: Optimizing Shuffle Operations

Shuffle is the most expensive operation in Spark because it involves:
1.  **Disk I/O:** Writing intermediate data to disk.
2.  **Network I/O:** Transferring data between executor nodes.
3.  **Serialization/Deserialization:** Converting data for transfer.

In this module, we will learn how to optimize shuffle by tuning the partition count and understanding how to minimize data movement.

### Agenda:
1.  **The Shuffle Problem:** Why 200 partitions is the default and often wrong.
2.  **Tuning Partitions:** How to calculate the optimal shuffle partition number.
3.  **Coalesce vs. Repartition:** Efficiently changing partition counts.
4.  **Best Practices:** Filtering early to reduce shuffle size.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count

spark = SparkSession.builder \
    .appName("Shuffle_Optimization") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

## 1. Default Shuffle Partitions

By default, whenever a shuffle happens (Join/GroupBy), Spark creates **200 partitions**.

*   **Too Many Partitions (Overkill):** For small data, you get thousands of tiny files and task scheduling overhead.
*   **Too Few Partitions (Bottleneck):** For huge data, each partition becomes too large, causing Out Of Memory (OOM) errors.

In [None]:
# Create a small DataFrame
df = spark.range(1, 1000)

# Perform a wide transformation (GroupBy) -> Triggers Shuffle
df_grouped = df.groupBy("id").count()

# Check the number of partitions AFTER the shuffle
print(f"Default Shuffle Partitions: {df_grouped.rdd.getNumPartitions()}")

# Check the config value
print(f"Config Value: {spark.conf.get('spark.sql.shuffle.partitions')}")

## 2. Tuning the Partition Count

For smaller datasets (like our local examples), 200 is too high. We should lower it to match our core count or data size (e.g., 8 or 16).

For larger datasets (TB scale), you might need to increase this to 1000 or more.

In [None]:
# Set shuffle partitions to 8 (matching local cores)
spark.conf.set("spark.sql.shuffle.partitions", "8")

# Perform the same operation again
df_optimized = df.groupBy("id").count()

print(f"Optimized Shuffle Partitions: {df_optimized.rdd.getNumPartitions()}")
df_optimized.explain()

## 3. Coalesce vs. Repartition

Sometimes you want to change the partition count manually without a GroupBy/Join.

*   **`repartition(n)`:** Full Shuffle. Redistributes data evenly. **Expensive**. Use when increasing partitions.
*   **`coalesce(n)`:** No Shuffle (mostly). Merges existing partitions. **Efficient**. Use when decreasing partitions (e.g., before writing to disk).

In [None]:
# Repartition: Increases partitions (Triggers Shuffle)
df_repartitioned = df.repartition(10)
print(f"Repartitioned Count: {df_repartitioned.rdd.getNumPartitions()}")

# Coalesce: Decreases partitions (No Full Shuffle)
df_coalesced = df_repartitioned.coalesce(2)
print(f"Coalesced Count: {df_coalesced.rdd.getNumPartitions()}")

## 4. Filter Early (Push Down Predicate)

The best way to optimize shuffle is to **shuffle less data**.
Always filter your DataFrames **BEFORE** joining or grouping. Spark's optimizer does this automatically in many cases, but explicit filtering is safer.

In [None]:
# Bad Practice: Join huge tables, then filter
# Good Practice: Filter first, then join

df1 = spark.range(1000).withColumnRenamed("id", "id1")
df2 = spark.range(1000).withColumnRenamed("id", "id2")

# Filter BEFORE Join
df1_filtered = df1.filter("id1 > 500")

# Join smaller dataset
df_joined = df1_filtered.join(df2, df1_filtered.id1 == df2.id2)

df_joined.explain()

## Summary

1.  **Default Partitions:** 200 is the default. Change it using `spark.sql.shuffle.partitions`.
2.  **Optimization Rule:**
    *   Small Data: Reduce partitions (e.g., 10-50).
    *   Large Data: Increase partitions (e.g., 500-2000) to avoid OOM.
3.  **Coalesce:** Use `coalesce` to reduce partition count efficiently before writing files.
4.  **Filter Early:** Reduce the amount of data entering the shuffle stage.

**Next Steps:**
In the next module, we will explore **Caching and Persistence**, techniques to speed up iterative algorithms by storing data in memory.