# 01 - Spark Architecture & Execution Model

This notebook covers four foundational topics:

1. **End-to-end query execution** - Job -> Stage -> Task breakdown
2. **Narrow vs. wide transformations** - the shuffle line
3. **Lazy evaluation** - when Spark actually does work
4. **Client mode vs. Cluster mode** - driver placement and trade-offs

Every concept is demonstrated on real data with
a running standalone cluster and Spark UI.

Sources:
- https://spark.apache.org/docs/latest/cluster-overview.html


## Setup

It's assumed you already have a standalone Spark cluster running:
- Spark UI: http://localhost:4040

Dataset: transaction categorization parquet.

In [1]:
from pyspark.sql import functions as F

In [1]:
# Load parquet
txn = spark.read.parquet(r"C:\code\spark-tuning-handbook\data\transaction_cat.parquet")
txn.printSchema()
print("rows:", txn.count())
print("partitions:", txn.rdd.getNumPartitions())

root
 |-- transaction_description: string (nullable = true)
 |-- category: string (nullable = true)
 |-- country: string (nullable = true)
 |-- currency: string (nullable = true)

rows: 4501043
partitions: 4


Spark UI: http://localhost:4040/jobs/
You should see a few jobs from setup.
We will use the UI to confirm:
- job count and durations
- number of stages
- tasks per stage
- shuffle read/write


## Topic 1 - Architecture: Job -> Stage -> Task


### Execution: from action to tasks

When you call an action (`count()`, `collect()`, `show()`, `write`, `save`, etc), Spark runs work like this:

Driver -> DAGScheduler -> Stages (split by shuffle boundaries) -> Tasks (1 task per partition per stage) -> Executors (running on Worker Machines)

**Rules to remember**

- **1 action = 1 job** - each `count()`, `show()`, `collect()`, `write.*` submits one job  
- **Shuffle = stage boundary** - wide transformations (`groupBy`, `join`, `repartition`, `orderBy`, `distinct`) create a new stage  
- **1 partition = 1 task** - within a stage, Spark launches one task per Spark partition 

So if you have 8 partitions and 1 shuffle, you get **2 stages Ã— 8 tasks = 16 tasks**.

### Demo: 1 action -> 1 job


In [30]:
spark.conf.set("spark.sql.adaptive.enabled", "false") # for clarity and simplicity in Spark UI
spark.conf.set("spark.sql.shuffle.partitions", "8") # we don't need default 200

In [21]:
txn.unpersist(blocking=True) # clear cache, don't proceed until done (blocking=True)

DataFrame[transaction_description: string, category: string, country: string, currency: string]

In [11]:
txn.count()

4501043

Spark UI -> Jobs -> latest job
Expect:
- 1 stage
- 8 tasks (1 per partition)
- Shuffle Write = 0


## Topic 2 - Narrow vs. Wide


### Narrow transformations

Narrow transformations are partition-local: they do not move data across the network.
Examples: `filter`, `select`, `withColumn`.

Multiple narrow transformations can run in one stage.


In [15]:
# narrow -> narrow -> narrow
narrow = (
    txn
    .filter(F.col("country") == "USA")
    .withColumn("desc_len", F.length("transaction_description"))
    .select("country", "currency", "desc_len")
)

narrow.count()

899163

Spark UI -> Jobs -> latest job
Expect:
- 1 stage
- 4 tasks
- no shuffle

### Wide transformations

Wide transformations require data movement (shuffle) across partitions.
Examples: `groupBy`, `join`, `repartition`, `orderBy`, `distinct`.

A shuffle creates a stage boundary.

In [32]:
# narrow -> narrow -> WIDE
wide = (
    txn
    .filter(F.col("country") == "USA") # produces very few rows, only some reduce partitions actually get data and only those report shuffle read, others remain empty.
    .withColumn("desc_len", F.length("transaction_description"))
    .groupBy("country")
    .agg(F.count("*").alias("txn_count"), F.avg("desc_len").alias("avg_desc_len"))
)

result = wide.collect()

In [41]:
print(*result, sep = '\n')

Row(country='AUSTRALIA', txn_count=901765, avg_desc_len=19.834742976274306)
Row(country='INDIA', txn_count=901544, avg_desc_len=19.06047514042576)
Row(country='UK', txn_count=899915, avg_desc_len=18.45531744664774)
Row(country='USA', txn_count=899163, avg_desc_len=18.633907311577545)
Row(country='CANADA', txn_count=898656, avg_desc_len=19.254804953174517)


Spark UI -> Jobs -> latest job
Expect:
- 2 stages
- Stage 0: partial aggregation + shuffle write
- Stage 1: shuffle read + final aggregation
- Shuffle Read/Write > 0

Shuffle Write - data written to disk when Spark redistributes rows across partitions.
Shuffle Read - data read from other partitions after redistribution.

### Demo: multiple shuffles -> multiple stages


In [45]:
multi = (
    txn
    .groupBy("country", "category").count()
    .repartition(8, F.col("country"))
)

demo = multi.collect()

Spark UI -> Jobs -> latest job
Expect:
- 3 stages (2 shuffles -> 3 stages)
- Shuffle Read/Write visible in stages twice

In [52]:
print(*demo[:5], sep = '\n')

Row(country='AUSTRALIA', category='Transportation', count=89996)
Row(country='AUSTRALIA', category='Income', count=91072)
Row(country='AUSTRALIA', category='Utilities & Services', count=90891)
Row(country='AUSTRALIA', category='Shopping & Retail', count=90110)
Row(country='AUSTRALIA', category='Financial Services', count=90377)


## Topic 3 - Lazy Evaluation

### Theory

Spark transformations are **lazy** - they don't execute when you call them.
Instead, each transformation appends a node to an internal **DAG** (Directed Acyclic Graph)
that describes the computation.

Execution happens **only** when you call an **action**:

**Transformations (lazy):** `filter`, `select`, `groupBy`, `join`, `withColumn`, `orderBy`, `repartition`, ... 
**Actions (trigger execution):** `count`, `collect`, `show`, `take`, `first`, `write.*`, `foreach`, ...

**Why laziness is powerful:**

1. **Whole-plan optimization.** Spark can see the full computation before it touches data, so it can
   reduce work (e.g., filter early, avoid reading unused columns, pick a better join strategy).

2. **Pipelining.** Multiple narrow transformations can be fused into a single pass over each partition
   (no intermediate shuffle / no extra stages).

### Demo: transformations are instant, actions are not

The cells below define a pipeline (no action yet), then trigger it with `show()`.


In [53]:
%%time

# Define a complex pipeline - nothing executes here
pipeline = (
    txn
    .filter(F.col("country") == "USA") # narrow transformation
    .withColumn("desc_upper", F.upper(F.col("transaction_description"))) # narrow transformation
    .withColumn("desc_len", F.length(F.col("transaction_description"))) # narrow transformation
    .groupBy("currency") # wide transformation
    .agg(F.count("*").alias("txn_count"), F.avg("desc_len").alias("avg_len"))
    .orderBy("txn_count", ascending=False) # wide transformation 
)

# no action
pipeline

CPU times: total: 31.2 ms
Wall time: 63.2 ms


DataFrame[currency: string, txn_count: bigint, avg_len: double]

Execution time should be small.
Spark UI -> Jobs tab
Expect: no new job yet.


In [55]:
%%time

# Action triggers execution
pipeline.show()

+--------+---------+------------------+
|currency|txn_count|           avg_len|
+--------+---------+------------------+
|     USD|   899163|18.633907311577545|
+--------+---------+------------------+

CPU times: total: 0 ns
Wall time: 979 ms


Execution time will be higher.
Spark UI -> Jobs tab
Expect: a new job appears with multiple stages.

NB: groupBy + agg shuffles once, and orderBy can use that already-partitioned output instead of reshuffling again.

## Topic 4 - Client Mode vs. Cluster Mode

### What you are running right now

**Client mode**

- The driver runs on your local machine (your Jupyter kernel process)
- Executors run on the cluster worker nodes
- Spark UI is exposed on your machine (http://localhost:4040)
- If you stop your notebook or kill the process, the job stops (driver dies)

Used for interactive work, notebooks, debugging.

**Cluster mode**

- The driver runs inside the cluster (on one of the nodes)
- Executors run on the cluster worker nodes 
- Spark UI is hosted on the driver node in the cluster
- The job continues even if your local machine disconnects

Used for production workloads and spark-submit jobs.

## Key Takeaways

**Job** - Created by each action. One `count()` / `show()` / `write` = one job.
**Stage** - A chunk of work between shuffle boundaries. N shuffles -> N+1 stages.
**Task** - Smallest unit of work - one task per partition per stage.
**Narrow transformation** - Partition-local (filter, select, map). Pipelined within a stage, no shuffle.
**Wide transformation** - Requires shuffle (groupBy, join, repartition). Creates a new stage.
**Lazy evaluation** - Transformations build a DAG. Only actions trigger execution. Enables whole-plan optimization.
**Client mode** - Driver on your machine. Use for notebooks / interactive work.
**Cluster mode** - Driver on a cluster node. Use for production `spark-submit` jobs.

Spark UI is the fastest way to understand what happened after an action:
- Jobs tab -> jobs and durations
- Stages tab -> stages per job, shuffle read/write
- Stage detail -> task count, duration spread, skew signals