⭐ 1. What This Pattern Solves

Optimizes iterative and long lineage computations in Spark.

Caching: keeps intermediate DataFrames in memory for repeated access → avoids recomputation.

Checkpointing: truncates lineage and saves DataFrame to reliable storage (HDFS/S3/DBFS) → avoids very long DAGs that can fail.

Use cases:

Machine learning pipelines with multiple passes over the same data.

Iterative transformations on large DataFrames.

Preventing stack overflow / long lineage errors in long ETL DAGs.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Cache in Spark is like using a temp table
CREATE TEMP VIEW temp AS
SELECT * FROM big_table;
-- Reuse temp view multiple times without re-scanning the source

-- Checkpointing is similar to persisting intermediate results
CREATE TABLE checkpointed AS
SELECT * FROM big_table;
-- Use this table as a new starting point

**⭐ 3. Core Idea**

Cache → keep in memory, fast, volatile.

Checkpoint → write to reliable storage, truncates lineage, useful for recovery.

Use cache for speed; checkpoint for stability in long pipelines.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Caching in memory
df_cached = df.cache()   # also df.persist(StorageLevel.MEMORY_ONLY)

# Checkpointing to storage
spark.sparkContext.setCheckpointDir("/checkpoint/path")
df_checkpointed = df.checkpoint()

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PerfPatterns").getOrCreate()

data = [(i, i*2) for i in range(1000)]
df = spark.createDataFrame(data, ["id", "value"])

# Cache: repeated use
df_cached = df.filter("value % 2 = 0").cache()
df_cached.count()  # triggers caching
df_cached.show()   # uses cache

# Checkpoint: long lineage protection
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
df_long = df_cached.withColumn("double_value", df_cached.value*2)
df_checkpointed = df_long.checkpoint()
df_checkpointed.show()

**Explanation:**

.cache() → keeps df_cached in memory after first action.

.checkpoint() → materializes df_long to disk, truncates lineage, safer for long ETL DAGs.

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a 500GB customer transaction dataset. Your ETL pipeline:

Filters by country

Aggregates by month

Joins with reference data

Writes to Delta

In [0]:
df_filtered = df.filter("country = 'US'").cache()  # step 1, repeated multiple joins
agg_df = df_filtered.groupBy("month").sum("amount")
spark.sparkContext.setCheckpointDir("/mnt/checkpoint")
agg_checkpointed = agg_df.checkpoint()  # protect long lineage
agg_checkpointed.write.format("delta").mode("overwrite").save("/mnt/silver/agg")


**⭐ 8. Time & Space Complexity**

| Operation    | Time Complexity                    | Space Complexity |
| ------------ | ---------------------------------- | ---------------- |
| cache()      | O(n) first action, then O(1) reads | O(n) in memory   |
| checkpoint() | O(n) write to storage              | O(n) in disk     |


**⭐ 9. Common Pitfalls**

Caching too many large DataFrames → memory pressure → OOM.

Forgetting to perform an action after .cache() → nothing is cached.

Checkpointing without setting checkpoint directory → error.

Overusing checkpoint → unnecessary disk I/O.