# Spark Streaming with PySpark
## Module 8: Checkpointing Deep Dive & Fault Tolerance

In the previous module, we enabled **Checkpointing** to maintain state. Today, we will look "under the hood" of the Checkpoint directory to understand how Spark achieves **Fault Tolerance** and **Exactly-Once Semantics**.

### Objectives:
1.  **Analyze the Checkpoint Directory:** Understand `commits`, `offsets`, `sources`, and `metadata`.
2.  **Idempotency Test:** Why does dropping the same file twice not work?
3.  **Fault Tolerance Experiment:** Simulate a crash (delete a commit) and watch Spark recover.
4.  **Production Best Practices:** How to handle re-processing safely.

In [None]:
import os
import shutil
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, current_timestamp

# 1. Initialize Spark
spark = SparkSession.builder \
    .appName("Checkpoint_Deep_Dive") \
    .master("local[*]") \
    .config("spark.sql.streaming.schemaInference", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

# 2. Define Paths
base_dir = "data"
input_dir = f"{base_dir}/input"
checkpoint_dir = f"{base_dir}/checkpoint"

# 3. Clean Start (Optional: Run this to reset experiments)
# if os.path.exists(checkpoint_dir):
#     shutil.rmtree(checkpoint_dir)

# 4. Define Processing Logic (Same as Module 7)
def get_streaming_df():
    raw_df = spark.readStream \
        .format("json") \
        .option("maxFilesPerTrigger", 1) \
        .load(input_dir)

    exploded_df = raw_df.select(
        col("eventId"),
        explode(col("data.devices")).alias("device_data")
    )

    return exploded_df.select(
        col("eventId"),
        col("device_data.deviceId").alias("device_id"),
        col("device_data.temperature").alias("temp"),
        current_timestamp().alias("processed_time")
    )

print("Setup Complete. Processing logic defined.")

## 1. The Checkpoint Structure

When a query starts, Spark creates a folder structure at the `checkpointLocation`.

*   **`metadata`**: Stores the unique ID of the streaming query. If you delete the checkpoint folder, a new ID is generated, and Spark treats it as a brand new query.
*   **`sources`**: Tracks exactly which files/offsets have been read. This prevents processing the same file twice.
*   **`offsets`**: Records the range of data (offset range) included in a specific batch ID.
*   **`commits`**: The "Stamp of Approval". A file appears here ONLY when a batch is successfully processed and written to the sink.

**The Flow:**
1.  Spark reads new data -> Writes to **`offsets`** (Batch N).
2.  Spark processes data -> Writes to Sink.
3.  Spark finishes batch -> Writes to **`commits`** (Batch N).

In [None]:
# Start the stream to generate some checkpoint data
# Make sure you have at least one JSON file in 'data/input'
df = get_streaming_df()

query = df.writeStream \
    .format("console") \
    .outputMode("append") \
    .option("checkpointLocation", checkpoint_dir) \
    .trigger(availableNow=True) \
    .start()

query.awaitTermination()

print("Batch processed. Checkpoint directory updated.")

In [None]:
# Helper function to list files in checkpoint sub-directories
def inspect_checkpoint(sub_dir):
    path = f"{checkpoint_dir}/{sub_dir}"
    if os.path.exists(path):
        print(f"--- Content of {sub_dir} ---")
        files = os.listdir(path)
        print(files)
        # Optionally print content of the latest file
        if files:
            latest_file = max([f for f in files if not f.startswith('.')], key=lambda x: int(x) if x.isdigit() else 0)
            with open(f"{path}/{latest_file}", 'r') as f:
                print(f"\n[Content of {latest_file}]:\n{f.read()[:200]}...") # Print first 200 chars
    else:
        print(f"{sub_dir} does not exist yet.")

# Inspect directories
inspect_checkpoint("metadata")
inspect_checkpoint("sources/0") # 0 is the source ID
inspect_checkpoint("offsets")
inspect_checkpoint("commits")

## Experiment 1: Re-running the Same File

**Scenario:** You accidentally drop `device_01.json` into the input folder again.
**Result:** Spark ignores it.

**Why?**
Spark looks at `checkpoint/sources/0/`. It sees that `device_01.json` is already listed in the tracking log for a previous batch. Since the file signature hasn't changed, it skips it to ensure **Exactly-Once** processing.

**How to force re-process?**
1.  **Rename the file:** `device_01_v2.json`. Spark treats it as a new file.
2.  **Delete Checkpoint:** Spark forgets everything and re-processes ALL files in the input directory.

## Experiment 2: Simulating a Failure

Let's simulate a crash where Spark **Read** the data but **Failed to Commit** (e.g., power failure before writing to `commits`).

**Steps to Simulate:**
1.  Place a NEW file (`device_02.json`) in `data/input`.
2.  Run the stream (Cell 4).
3.  Look at `checkpoint/commits` and identify the latest batch ID (e.g., `1`).
4.  **Manually Delete** that file (`checkpoint/commits/1`).
    *   *State:* Spark has `offsets/1` (knows data exists) but no `commits/1` (thinks job failed).
5.  **Re-run the Stream** (Cell 4).

**Observation:**
Spark sees Offset 1 exists but Commit 1 is missing. It infers that Batch 1 failed. It will **Re-run Batch 1** automatically processing `device_02.json` again to ensure data integrity.

## Production Best Practices

1.  **Do NOT Touch Manually:** Never manually edit or delete files in the checkpoint directory in production. It can corrupt the stream state permanently.
2.  **Changing Logic:** If you change your code logic (e.g., add a new column), you often cannot use the old checkpoint. You must restart with a fresh checkpoint directory.
3.  **Re-processing:** To re-process data, it is safer to rename the input file than to tamper with the checkpoint sources.