# PySpark: Zero to Hero
## Module 19: Understanding Spark Execution Plan, DAG, and Shuffle

To write optimized Spark applications, you must understand what happens under the hood. This module dives into the **DAG (Directed Acyclic Graph)**, how Spark breaks down jobs into **Stages** and **Tasks**, and the performance-critical concept of **Shuffle**.

### Agenda:
1.  **Lazy Evaluation & Actions:** When does Spark actually run?
2.  **The DAG:** Visualizing the execution graph.
3.  **Jobs, Stages, and Tasks:** The hierarchy of execution.
4.  **Shuffle:** What is it, and why is it expensive?
5.  **Spark UI:** Analyzing the execution plan.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# We disable Adaptive Query Execution (AQE) temporarily to see the raw DAG
# and full shuffle impact for learning purposes.
spark = SparkSession.builder \
    .appName("Spark_Internals_DAG") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "false") \
    .getOrCreate()

print(f"Spark UI Link: {spark.sparkContext.uiWebUrl}")

In [None]:
# We create two DataFrames to simulate a Join operation (which triggers a Shuffle)

# DataFrame 1: Even numbers
df1 = spark.range(0, 200, 2).withColumnRenamed("id", "id1")
# Repartition to simulate distributed data
df1 = df1.repartition(5)

# DataFrame 2: Multiples of 4
df2 = spark.range(0, 200, 4).withColumnRenamed("id", "id2")
df2 = df2.repartition(7)

print("DataFrames Created (Lazy Evaluation - Nothing executed yet)")

## 1. Triggering Execution

Spark is lazy. The transformations above (`range`, `repartition`) built a logical plan but didn't execute it.
Execution only starts when we call an **Action** (like `.count()`, `.show()`, `.collect()`).

Let's perform a **Join**, which forces Spark to move data around (Shuffle).

In [None]:
# Joining the two DataFrames
# This requires data with the same Key to be on the same Partition.
# Spark must perform a SHUFFLE to achieve this.

df_joined = df1.join(df2, df1["id1"] == df2["id2"], "inner")

# Action to trigger the job
df_joined.show(5)

## 2. The Explain Plan

We can see how Spark intends to execute the query using `.explain()`.

Look for:
*   **Scan:** Reading data.
*   **Exchange:** This indicates a **Shuffle** (Network transfer of data).
*   **HashAggregate / SortMergeJoin:** The actual computation.

In [None]:
print("--- Logical and Physical Plan ---")
df_joined.explain()

## 3. Jobs, Stages, and Tasks

*   **Job:** Triggered by an Action (e.g., `show()`).
*   **Stage:** Spark breaks a Job into Stages at **Shuffle boundaries**.
    *   If you have a `repartition` or `join`, Spark must end the current stage, write data to disk (shuffle write), and start a new stage (shuffle read).
*   **Task:** The unit of work. One task per partition.

**In our example:**
1.  **Stage 0 & 1:** Read `df1` and `df2` in parallel (Narrow Transformation).
2.  **Exchange (Shuffle):** Redistribute data based on Join Keys.
3.  **Stage 2:** Perform the Join on the shuffled data.

In [None]:
# By default, Spark uses 200 shuffle partitions for joins/aggregations.
# This is often too high for small data (causes overhead) or too low for huge data (OOM).

current_partitions = spark.conf.get("spark.sql.shuffle.partitions")
print(f"Default Shuffle Partitions: {current_partitions}")

# Let's change it to something smaller for our small dataset
spark.conf.set("spark.sql.shuffle.partitions", 4)

# Re-run the join to see the difference in task count in Spark UI
df_joined.count()

## Summary

1.  **DAG:** Spark builds a graph of transformations and only runs it when an Action is called.
2.  **Shuffle:** The process of moving data between nodes (Exchanges). It is expensive (Disk I/O + Network).
3.  **Stages:** Created whenever a Shuffle occurs. Minimizing shuffles improves performance.
4.  **Tuning:** The configuration `spark.sql.shuffle.partitions` is critical for tuning join performance.

**Next Steps:**
In the next module, we will look at **Spark Memory Management** and how to deal with Out Of Memory (OOM) errors.