# PySpark: Zero to Hero
## Module 21: Caching and Persistence

When you reuse a DataFrame multiple times in your code (e.g., for multiple transformations or actions), Spark re-computes the entire lineage by default. This can be extremely slow.

**Caching** allows you to save the intermediate DataFrame in memory (or disk) so that subsequent actions can read from the cache instead of re-computing from the source.

### Agenda:
1.  **Why Cache?** The problem of re-computation.
2.  **`cache()` vs. `persist()`:** Understanding the difference.
3.  **Storage Levels:** Memory, Disk, and Serialization options.
4.  **Unpersisting:** Clearing the cache to free up resources.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark import StorageLevel
import time

spark = SparkSession.builder \
    .appName("Caching_and_Persistence") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

In [None]:
# We create a larger dataset to make the time difference noticeable.
# This creates a DataFrame with 10 Million rows.

df = spark.range(1, 10000000).toDF("id")
df = df.withColumn("square", col("id") * col("id"))

# Force an action to materialize (but not cache yet)
print(f"Count: {df.count()}")

## 1. The Problem: Re-computation

If we run two actions on the same DataFrame `df`, Spark computes it **twice**.
Let's measure the time taken for two subsequent counts without caching.

In [None]:
start_time = time.time()
print(f"Count 1: {df.count()}")
print(f"Time taken for Count 1: {time.time() - start_time:.2f} seconds")

start_time = time.time()
print(f"Count 2: {df.count()}")
print(f"Time taken for Count 2: {time.time() - start_time:.2f} seconds")

# Both should take roughly the same amount of time because the work is repeated.

## 2. Using `cache()`

`cache()` is a shorthand for `persist(StorageLevel.MEMORY_AND_DISK)`.
It stores the data in memory. If memory is full, it spills to disk.

**Note:** Caching is **lazy**. The data is not actually cached until the *first* action is called.

In [None]:
# Mark the DataFrame for caching
df.cache()

print("DataFrame marked for caching (Lazy).")

# First Action: This will be slow because it has to compute AND cache the data.
start_time = time.time()
print(f"Count 1 (Caching...): {df.count()}")
print(f"Time taken (Build Cache): {time.time() - start_time:.2f} seconds")

# Second Action: This should be INSTANT because it reads from memory.
start_time = time.time()
print(f"Count 2 (Read Cache): {df.count()}")
print(f"Time taken (From Cache): {time.time() - start_time:.2f} seconds")

## 3. Storage Levels with `persist()`

If you want more control (e.g., Memory Only, Disk Only), use `persist()`.

Common Storage Levels:
*   **MEMORY_ONLY:** Fast, but fails if data > RAM.
*   **MEMORY_AND_DISK (Default for cache):** Spills to disk if RAM is full.
*   **DISK_ONLY:** Good for huge datasets that don't fit in RAM but are expensive to recompute.

In [None]:
# Unpersist first to clear the previous cache
df.unpersist()

# Persist with specific level (e.g., DISK_ONLY)
df.persist(StorageLevel.DISK_ONLY)

# First action caches to Disk
df.count()

# Second action reads from Disk (Slower than memory, but faster than re-compute)
start_time = time.time()
print(f"Count (From Disk Cache): {df.count()}")
print(f"Time taken: {time.time() - start_time:.2f} seconds")

In [None]:
# Always unpersist when done to free up cluster memory for other jobs.
df.unpersist()
print("Cache cleared.")

## Summary

1.  **Re-computation:** Spark re-runs the entire DAG for every action unless cached.
2.  **`cache()`:** Stores data in Memory (and Disk if needed). Best for iterative algorithms.
3.  **`persist()`:** Allows custom storage levels.
4.  **Lazy Caching:** Data is only cached during the first Action, not when `.cache()` is called.
5.  **Unpersist:** Always clean up your cache.

**Next Steps:**
In the final notebook of this series, we will explore **Spark SQL**, running SQL queries directly on DataFrames, and integrating with the Hive Metastore.