# Spark Streaming with PySpark
## Module 7: File Sources & Checkpointing

In the previous modules, we used Sockets (Netcat) to simulate data. In the real world, data often arrives as files (JSON, CSV, Parquet) in a "Landing Zone" or "Stage" bucket (S3, ADLS).

### Objectives:
1.  **Read Stream from Files:** Monitor a directory for new JSON files.
2.  **Schema Inference:** Configure Spark to automatically detect file structure.
3.  **Complex Data Processing:** Flatten nested JSON arrays and structs into a tabular format (CSV).
4.  **Source Cleaning:** Archive processed files automatically.
5.  **Checkpointing:** Understand how Spark remembers what it has processed.

### The Scenario
We are receiving IoT device data in JSON format. Each file contains a batch of readings.
*   **Input:** Nested JSON with arrays.
*   **Output:** Flattened CSV files.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, current_timestamp

# 1. Initialize Spark Session
spark = SparkSession.builder \
    .appName("File_Streaming_Demo") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

# 2. Enable Schema Inference
# By default, Spark Streaming requires you to define a schema upfront.
# For this demo, we enable inference to let Spark read the JSON structure automatically.
spark.conf.set("spark.sql.streaming.schemaInference", "true")

print("Spark Session Created with Schema Inference Enabled.")

## The Input Data (JSON)

We expect JSON files to be dropped into `data/input`. Here is the structure of our sample data (`device_01.json`):

```json
{
  "eventId": "e1",
  "data": {
    "devices": [
      {
        "deviceId": "d1",
        "temperature": 25,
        "measure": "C"
      },
      {
        "deviceId": "d2",
        "temperature": 78,
        "measure": "F"
      }
    ]
  },
  "eventTime": "2024-01-01 10:00:00"
}

In [None]:
# Challenge: The core data is inside a nested array data.devices. We need to explode this array to get one row per device.

# Step 1 - Read Stream


# Define paths
input_dir = "data/input"
archive_dir = "data/archive"

# Read Stream
# maxFilesPerTrigger: Limits how many files are processed per batch (simulates flow).
# cleanSource: "archive" moves processed files to a different folder so the input folder stays clean.
raw_df = spark.readStream \
    .format("json") \
    .option("maxFilesPerTrigger", 1) \
    .option("cleanSource", "archive") \
    .option("sourceArchiveDir", archive_dir) \
    .load(input_dir)

# Note: If you get a "Path does not exist" error, manually create the 'data/input' folder.
print("Read Stream Initialized.")

In [None]:
# 1. Explode the array to create multiple rows
exploded_df = raw_df.select(
    col("eventId"),
    col("eventTime"),
    explode(col("data.devices")).alias("device_data")
)

# 2. Flatten the struct columns using Dot Notation
flattened_df = exploded_df.select(
    col("eventId"),
    col("eventTime"),
    col("device_data.deviceId").alias("device_id"),
    col("device_data.temperature").alias("temp"),
    col("device_data.measure").alias("unit"),
    current_timestamp().alias("processed_time")
)

## Step 3: Checkpointing & Writing

Before we write, we must define a **Checkpoint Location**.

### What is Checkpointing?
It is a directory where Spark saves the **state** of the stream.
1.  **Offsets:** Which files have I already processed?
2.  **State:** (For aggregations) What are the current counts?

**Why is it critical?**
If your application crashes, Spark reads the checkpoint directory upon restart. It sees, *"Ah, I already processed `device_01.json`, so I will ignore it and start looking for `device_02.json`."* without processing duplicates.

In [None]:
output_dir = "data/output"
checkpoint_dir = "data/checkpoint"

# Write Stream
query = flattened_df.writeStream \
    .format("csv") \
    .outputMode("append") \
    .option("header", "true") \
    .option("path", output_dir) \
    .option("checkpointLocation", checkpoint_dir) \
    .start()

print(f"Streaming to {output_dir}...")
print(f"Tracking state in {checkpoint_dir}...")

# Keep the cell running
query.awaitTermination()

## How to Test This?

Since this is a file-based trigger, nothing happens until you drop a file.

1.  **Create** a file named `device_01.json` with the JSON content shown in Cell 3.
2.  **Paste** it into the `data/input` folder.
3.  **Observe:**
    *   Spark will detect the file.
    *   It will process it and write a CSV to `data/output`.
    *   It will **move** the JSON file from `data/input` to `data/archive`.
4.  **Test Checkpoint:** Paste the *exact same file* into `data/input` again.
    *   **Result:** Nothing happens! Spark checks the `checkpoint` folder, realizes this file signature was already processed, and skips it.