# Cache, Persist, and Lazy Evaluation in PySpark

This notebook focuses on **fundamental Spark concepts** for students:

Concepts:
1. Transformations vs Actions (lazy evaluation)
2. `cache()` and `persist()`
3. When caching helps
4. `unpersist()`
5. Viewing the execution plan with `explain()`

Dataset used: `samples.tpch.orders`


In [None]:
from pyspark.sql import functions as F
import time

orders_df = spark.read.table("samples.tpch.orders")
print("Orders count:", orders_df.count())
display(orders_df.limit(5))


## 1. Transformations vs Actions

- **Transformations**: define a recipe (no work is done yet).  
  Examples: `select`, `filter`, `withColumn`, `groupBy`, `repartition`.
- **Actions**: actually run the recipe and return a result.  
  Examples: `count`, `show`, `collect`, `take`, `display`.

Spark uses **lazy evaluation**:
- It waits until an action is called to build a plan and execute it.


In [None]:
# Only transformations (no action yet)
filtered_orders = orders_df.filter(F.col("o_totalprice") > 10000)
projected_orders = filtered_orders.select("o_orderkey", "o_custkey", "o_totalprice")

# No job has run yet. We only defined a *plan*.
print("We defined transformations but haven't triggered any action yet.")


In [None]:
# Now an action: count()
start = time.time()
count_val = projected_orders.count()
end = time.time()

print("Count:", count_val)
print("Time for first count (no cache):", round(end - start, 3), "seconds")


## 2. Using `cache()`

If we know we'll reuse the same DataFrame multiple times, we can **cache** it.

- `df.cache()` tells Spark to keep the data in memory after the first action.
- The **first** action still does full work.
- Subsequent actions are usually faster.


In [None]:
cached_orders = projected_orders.cache()

# First action (materializes the cache)
start = time.time()
_ = cached_orders.count()
end = time.time()

print("Time for count with cache (first time):", round(end - start, 3), "seconds")


In [None]:
# Second action on the same cached DataFrame
start = time.time()
max_price = cached_orders.agg(F.max("o_totalprice").alias("max_price")).collect()[0]["max_price"]
end = time.time()

print("Max price:", max_price)
print("Time for second action on cached data:", round(end - start, 3), "seconds")


## 3. `persist()` with Storage Levels

`cache()` is shorthand for `persist(StorageLevel.MEMORY_AND_DISK)`.

We can choose other levels:
- MEMORY_ONLY
- MEMORY_AND_DISK
- DISK_ONLY
- etc.

For teaching purposes, we’ll show syntax only (behavior may depend on cluster size).


In [None]:
from pyspark import StorageLevel

# Example: force DISK_ONLY (just for demonstration)
disk_persisted = projected_orders.persist(StorageLevel.DISK_ONLY)

print("Storage level:", disk_persisted.storageLevel)

# Trigger materialization
_ = disk_persisted.count()


## 4. `unpersist()`

When you are done with a cached/persisted DataFrame, you should `unpersist()` it:

- Frees up memory / disk used by cache
- Good practice in long-running notebooks


In [None]:
# Unpersist both
cached_orders.unpersist()
disk_persisted.unpersist()

print("Unpersisted cached DataFrames.")


## 5. Viewing the Execution Plan with `explain()`

`explain()` shows:
- Logical and physical plans
- Where filters, scans, and shuffles happen

Let’s inspect the plan for a small aggregation.


In [None]:
agg_df = (
    orders_df
    .filter(F.col("o_orderstatus") == "F")
    .groupBy("o_orderpriority")
    .agg(F.avg("o_totalprice").alias("avg_price"))
)

agg_df.explain(mode="extended")
display(agg_df)


## Summary for Students

- Spark is **lazy**: transformations build a plan; actions trigger execution.
- Use `cache()`/`persist()` when:
  - The same DataFrame is used in **multiple actions**.
  - The computation is **expensive**.
- Always `unpersist()` when cached data is no longer needed.
- Use `explain()` to understand what Spark is doing under the hood.
