# PySpark: Zero to Hero
## Module 3: Core Spark Concepts

Before we start writing complex code, we need to understand the vocabulary of Spark. If you don't understand these five concepts, you cannot write efficient Spark applications.

### Agenda:
1.  **Partitions:** How data is split.
2.  **Transformations:** The instructions (Logic).
3.  **Actions:** The triggers (Execution).
4.  **Lazy Evaluation:** Why Spark waits to run.
5.  **SparkSession:** The entry point.

## 1. What is a Partition?

To allow Executors to work in parallel, Spark breaks the data down into chunks called **Partitions**.

*   **Analogy:** Recall the "Bag of Marbles" from the previous lecture. The marbles were inside small **pouches**.
*   **The Logic:**
    *   1 Pouch = 1 Partition.
    *   If you have 4 Pouches (Partitions), Spark can assign them to 4 Tasks running on 4 Cores simultaneously.
    
> **Key Takeaway:** The number of partitions determines the level of parallelism. If you have a huge cluster but only 1 partition, only 1 core will work while the rest sit idle.

## 2. What are Transformations?

Transformations are the instructions or code used to modify data. They build the **Logical Plan**.
*   *Examples:* `select()`, `filter()`, `groupBy()`, `withColumn()`.

There are two types of Transformations:

### A. Narrow Transformation
*   **Definition:** Each input partition contributes to only **one** output partition.
*   **Data Movement:** No data shuffling is required between nodes.
*   *Examples:* `filter()`, `select()`, `map()`.
*   *Scenario:* If you filter for `Salary > 10000`, the Executor can check its own partition without talking to other Executors.

### B. Wide Transformation
*   **Definition:** Input partitions contribute to **many** output partitions.
*   **Data Movement:** This triggers a **Shuffle**. Data must move across the network to group related data together.
*   *Examples:* `groupBy()`, `join()`, `distinct()`, `orderBy()`.

## 3. Lazy Evaluation & Actions

Spark is **Lazy**. When you tell Spark to `filter` or `groupBy`, it does **not** execute the code immediately. Instead, it records the instructions in a **DAG (Directed Acyclic Graph)**.

### The Sandwich Shop Analogy
Imagine you go to a sandwich shop:
1.  **Instruction 1:** "I want a sandwich."
2.  **Instruction 2:** "Use white bread."
3.  **Instruction 3 (Change of mind):** "Actually, change that to brown bread."
4.  **Action:** You pay for the order.

*   **If the chef wasn't lazy:** They would have started making the white bread sandwich immediately (Step 2), then thrown it away when you changed your mind (Step 3). This wastes resources.
*   **Because the chef is lazy (Spark):** They wait until you pay (**Action**). They see the full list of instructions, realize you ultimately want brown bread, and make the sandwich correctly the first time.

### What is an Action?
An Action is the command that forces Spark to execute the plan (The "Payment").
*   *Examples:*
    1.  **View Data:** `.show()`, `.take()`
    2.  **Collect Data:** `.collect()` (brings data to Driver - careful!)
    3.  **Write Data:** `.write.csv()`, `.saveAsTable()`

## 4. What is SparkSession?

*   The **SparkSession** is the entry point for writing Spark applications.
*   It represents the **Driver** process.
*   **One-to-One Relationship:** For 1 Spark Application, there is 1 SparkSession.
*   It coordinates the execution of code on the cluster.

In [None]:
from pyspark.sql import SparkSession

# 1. Initialize SparkSession ( The Driver / Entry Point )
spark = SparkSession.builder \
    .appName("Lazy_Evaluation_Demo") \
    .master("local[*]") \
    .getOrCreate()

# 2. Create Data (Partitions)
# We create a list of numbers
data = [("James", "Sales", 3000),
        ("Michael", "Sales", 4600),
        ("Robert", "Sales", 4100),
        ("Maria", "Finance", 3000),
        ("James", "Sales", 3000),
        ("Scott", "Finance", 3300),
        ("Jen", "Finance", 3900),
        ("Jeff", "Marketing", 3000),
        ("Kumar", "Marketing", 2000),
        ("Saif", "Sales", 4100)]

columns = ["Employee_Name", "Department", "Salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Check Partitions (accessing the underlying RDD)
print(f"Number of Partitions: {df.rdd.getNumPartitions()}")

# 3. Transformations (LAZY - Nothing happens here yet!)
# Step A: Filter (Narrow Transformation)
df_filtered = df.filter(df["Salary"] > 3000)

# Step B: GroupBy (Wide Transformation - triggers Shuffle plan)
df_grouped = df_filtered.groupBy("Department").count()

print("Transformations defined. logical plan created. No execution yet.")

# 4. Action (The Trigger)
# This forces Spark to look at the plan and execute it.
print("--- Triggering Action (.show) ---")
df_grouped.show()

## Summary

*   **Partitions** are the units of parallelism.
*   **Transformations** build the plan but don't run it (Lazy).
*   **Narrow Transformations** are fast (no network traffic).
*   **Wide Transformations** are slow (Shuffle / network traffic).
*   **Actions** trigger the actual execution.

**Coming Up Next:**
In the next notebook, we will visualize the **DAG (Directed Acyclic Graph)**, understand **Structured APIs**, and see how Spark designs the **Execution Plan**.