# PySpark: Zero to Hero
## Module 26: Adaptive Query Execution (AQE)

Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of **runtime statistics** to choose the most efficient query execution plan. 

In previous versions (Spark 2.x), the query plan was fixed once generated. In Spark 3.x with AQE, the plan can evolve while the query is running based on the actual data characteristics observed at intermediate stages.

### Agenda:
1.  **Baseline:** Running a Skewed Join *without* AQE (Standard SortMergeJoin).
2.  **Enable AQE:** Configuring Spark to use Adaptive Query Execution.
3.  **Feature 1: Coalescing Post-Shuffle Partitions:** Dynamically reducing the number of shuffle partitions.
4.  **Feature 2: Optimizing Skew Joins:** Automatically splitting skewed partitions.
5.  **Feature 3: Dynamic Join Selection:** Switching from SortMergeJoin to BroadcastJoin at runtime.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import random
import time

# Initialize Spark Session
# We set shuffle partitions to 200 to demonstrate the "Too many partitions" problem
spark = SparkSession.builder \
    .appName("AQE_Demo") \
    .master("local[*]") \
    .config("spark.executor.memory", "512m") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# 1. Create Department Data
dept_data = [(i, f"Dept_{i}") for i in range(0, 10)]
dept_df = spark.createDataFrame(dept_data, ["dept_id", "dept_name"])

# 2. Create Skewed Employee Data (1 Million rows, skew on dept 8 and 9)
# This is the same data generation logic from the previous module
def generate_skewed_data():
    data = []
    for _ in range(1000000):
        if random.random() > 0.1:
            dept = random.choice([8, 9])
        else:
            dept = random.choice(range(0, 8))
        data.append((dept, "Emp_Name"))
    return data

emp_rdd = spark.sparkContext.parallelize(generate_skewed_data())
emp_df = spark.createDataFrame(emp_rdd, ["dept_id", "emp_name"])

print("Data Generation Complete.")

In [None]:
# To see the impact of AQE, we first disable it.
# We also disable autoBroadcastJoin to force a SortMergeJoin/Shuffle.

spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") # Disable Broadcast

print("AQE Disabled. Running Baseline Join...")

start_time = time.time()
joined_df = emp_df.join(dept_df, "dept_id", "left_outer")
# Trigger action
joined_df.write.format("noop").mode("overwrite").save()
print(f"Baseline Join Duration: {time.time() - start_time:.2f} seconds")

# Observation in Spark UI:
# 1. Check Stages: You will see 200 shuffle partitions.
# 2. Most partitions are empty (processing 0 records), wasting resources.
# 3. One or two partitions process huge data (skew), potentially causing spillage.

## AQE Optimization Features

When we enable `spark.sql.adaptive.enabled`, Spark unlocks three capabilities:

1.  **Coalescing Shuffle Partitions:**
    *   Instead of creating 200 fixed partitions, AQE looks at the data size after the shuffle map stage.
    *   If partitions are small, it merges (coalesces) them into fewer, larger partitions.
    
2.  **Optimizing Skew Joins:**
    *   If a partition is significantly larger than the median size, AQE splits it into smaller sub-partitions.
    *   This happens automatically without manual "Salting".

3.  **Converting Sort-Merge Join to Broadcast Join:**
    *   If a table size is estimated to be large initially but turns out to be small after filtering (at runtime), AQE can switch the strategy to Broadcast Join dynamically.

In [None]:
# Enable AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Feature 1: Coalesce Partitions
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Set advisory size small (8MB) to trigger coalescing behavior on our small local data
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "8m") 

# Feature 2: Skew Join Optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Set threshold small (10MB) to detect our local data as "skewed"
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "10m")

# Keep Broadcast disabled for now to verify Skew handling in SortMergeJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

print("AQE Enabled (Coalesce + Skew). Running Join...")

start_time = time.time()
joined_df_aqe = emp_df.join(dept_df, "dept_id", "left_outer")
joined_df_aqe.write.format("noop").mode("overwrite").save()
print(f"AQE Join Duration: {time.time() - start_time:.2f} seconds")

# Observation in Spark UI:
# 1. Shuffle Partitions: Instead of 200, you might see a number like 10 or 17 (Coalesced).
# 2. Skew: The timeline should look more balanced. The skewed partition was split.
# 3. Plan: Use joined_df_aqe.explain() to see 'CustomShuffleReader' in the plan.

In [None]:
# Let's verify that partitions were actually coalesced.
# Note: We use spark_partition_id() on the resulting DF.

from pyspark.sql.functions import spark_partition_id

part_count = joined_df_aqe.withColumn("pid", spark_partition_id()).select("pid").distinct().count()
print(f"Number of partitions used in AQE join: {part_count}")
# Expectation: Much lower than 200.

In [None]:
# Feature 3: Converting SortMergeJoin to BroadcastJoin at Runtime
# We re-enable the broadcast threshold (e.g., 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") # 10MB

print("AQE Enabled (Full Features). Running Join...")

start_time = time.time()
joined_df_broadcast = emp_df.join(dept_df, "dept_id", "left_outer")
joined_df_broadcast.write.format("noop").mode("overwrite").save()
print(f"AQE + Dynamic Broadcast Duration: {time.time() - start_time:.2f} seconds")

# Observation:
# Even if we initially planned for a SortMergeJoin, AQE realizes 'dept_df' is tiny 
# at runtime and switches to a BroadcastHashJoin.
# This avoids the shuffle entirely!

## Summary: Why use AQE?

1.  **Simplicity:** You don't need to manually tune `spark.sql.shuffle.partitions` perfectly for every job. AQE adjusts it.
2.  **Robustness:** It handles data skew automatically, removing the need for complex code hacks like Salting in most cases.
3.  **Performance:** Dynamic strategy selection (switching to Broadcast) saves massive amounts of network I/O.

**Note:** AQE is enabled by default in Spark 3.2+, but understanding these configs helps when debugging specific edge cases.