# PySpark: Zero to Hero
## Module 4: DataFrames and Execution Plans

In this module, we look at the core data structure of PySpark: the **DataFrame**. We also explore the "magic" that happens inside Spark when you run a queryâ€”specifically, how the **Catalyst Optimizer** turns your Python code into efficient physical machine code.

### Agenda:
1.  **Structured APIs:** What is a DataFrame?
2.  **Immutability:** Why can't we change data?
3.  **The Life Cycle of a Spark Job:**
    *   Logical Planning (The "What")
    *   Physical Planning (The "How")
4.  **The DAG:** Directed Acyclic Graph.

## 1. What is a DataFrame?

A DataFrame is the most common Structured API in Spark.
*   **Structure:** It looks exactly like a Table in a database or Excel. It has **Rows** and **Columns**.
*   **Schema:** It has a defined schema (column names and data types).
*   **Distributed:** Unlike a Pandas DataFrame (which sits on one computer), a Spark DataFrame is split into **Partitions** and spread across the cluster.

> **Note:** DataFrames are built on top of RDDs. They are easier to use and much faster due to the optimization engine.

## 2. Immutability

DataFrames are **Immutable**. This means once you create a DataFrame, **you cannot change it**.

*   **Scenario:** You have `df1` and you want to filter it.
*   **Action:** Spark does *not* modify `df1`. Instead, it creates a new DataFrame `df2` that contains the filtered results.
*   **Why?** This is crucial for fault tolerance in distributed computing. If a node crashes, Spark knows exactly how to recreate the data from the original immutable source.

## 3. How Spark Compiles Code: The Execution Plan

When you write `df.select().filter()`, Spark does not run it immediately. It goes through four distinct phases to optimize your code. This process is handled by the **Catalyst Optimizer**.

### Phase 1: Unresolved Logical Plan
*   Spark checks your syntax.
*   It knows you want to query a table named "Sales", but it doesn't know if that table actually exists or if the columns are correct.

### Phase 2: Resolved Logical Plan
*   Spark looks up the **Catalog** (metadata repository).
*   It verifies: *"Does the 'Sales' table exist? Does column 'Amount' exist?"*
*   If yes, the plan is "Resolved". If no, it throws an AnalysisException.

### Phase 3: Optimized Logical Plan
*   The Catalyst Optimizer applies rules to make queries faster.
*   *Example:* If you `filter` data and then `select` a column, Spark will swap them to `filter` first (Predicate Pushdown) to reduce the data volume as early as possible.

### Phase 4: Physical Plan
*   Spark generates multiple physical plans (different ways to actually do the job on the hardware).
*   **Cost Model:** It calculates the "cost" (CPU/RAM usage) of each plan.
*   It selects the **Best Physical Plan** and sends it to the Executors.

## 4. The DAG (Directed Acyclic Graph)

The final plan is visualized as a **DAG**.
*   **Directed:** It flows in one direction (Input -> Output).
*   **Acyclic:** It does not loop back on itself.
*   **Graph:** A visual representation of steps.

We will see this in the Spark UI (localhost:4040) when we run code.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Execution_Plan_Deep_Dive").master("local[*]").getOrCreate()

# 1. Create a simple DataFrame
data = [("A", 10), ("B", 20), ("C", 30), ("A", 40)]
df = spark.createDataFrame(data, ["Id", "Value"])

# 2. Define Transformations (Logical Plan builds here)
df_filtered = df.filter(df["Value"] > 15)
df_result = df_filtered.select("Id", "Value")

# 3. Explain the Plan
# This command shows the Physical Plan (and optionally the Logical Plans)
print("--- Physical Plan ---")
df_result.explain()

# To see the full details (Logical + Physical), uncomment below:
# print("\n--- Extended Plan (Logical + Physical) ---")
# df_result.explain(extended=True)

## Summary

1.  **DataFrames** are distributed tables with a schema.
2.  You don't modify DataFrames; you transform them into **new** DataFrames (Immutability).
3.  **Catalyst Optimizer** is the brain of Spark SQL. It optimizes your code via:
    *   Unresolved -> Resolved -> Optimized -> Physical Plan.
4.  **Physical Plan** is what actually runs on the cluster.

**Next Steps:**
Now that we understand the theory of how Spark processes data, we are ready to set up our environment. Please proceed to **Notebook 5: Environment Setup**.