# Spark Streaming with PySpark
## Module 15: Event Time, Processing Time & Stateful Processing

In streaming, "Time" is not a single dimension. Data generated at 12:00 might arrive at 12:10 due to network lag. How do we handle this?

### Objectives:
1.  **Event Time vs. Processing Time:** Understanding the difference.
2.  **The Late Data Problem:** What happens when data arrives out of order?
3.  **Stateful Processing:** How Spark calculates aggregations (like averages) over time windows.
4.  **The Memory Challenge:** Why we cannot keep state forever.

## 1. Event Time vs. Processing Time

| Type | Definition | Example |
| :--- | :--- | :--- |
| **Event Time** | The time the event actually occurred at the source. This is embedded in the data itself. | A sensor records temperature at **12:04 PM**. |
| **Processing Time** | The time the system (Spark) receives and processes the data. | Spark reads that sensor record at **12:15 PM**. |

### The Conflict
If we calculate the "Average Temperature for 12:00 - 12:10", should we include the record that arrived at 12:15?
*   **Yes**, because the *Event Time* (12:04) falls in that window.
*   **Problem:** Spark has to remember (maintain state) that the 12:00 window is still "open" to accept late data.

## 2. The Late Data Scenario

Imagine two devices sending data to a server in Bangalore:
1.  **Device D1 (Delhi):** Fast network. Data generated at 12:04, Arrives at 12:04.
2.  **Device D2 (Sydney):** Slow network. Data generated at 12:04, Arrives at 12:14.

If Spark processes data in **10-minute windows** based on **Processing Time**:
*   **Window 1 (12:00-12:10):** Includes D1.
*   **Window 2 (12:10-12:20):** Includes D2 (Incorrectly!).

**Solution:** We must process based on **Event Time**.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, avg

# Initialize Spark
spark = SparkSession.builder \
    .appName("EventTime_Window_Demo") \
    .master("local[*]") \
    .getOrCreate()

# 1. Create dummy streaming data (Representing Flattened Device Data)
# We assume schema: [eventTime, deviceId, temperature]
# In a real scenario, this comes from Kafka.
device_data_df = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 1) \
    .load() \
    .selectExpr(
        "timestamp as eventTime", 
        "'device_1' as deviceId", 
        "cast(value as int) as temperature" 
    )

# 2. Define Windowed Aggregation
# We group by a 10-minute window based on 'eventTime'
# Logic: "Calculate average temperature for every 10 minutes per device"

windowed_counts = device_data_df \
    .groupBy(
        window(col("eventTime"), "10 minutes"), # The Window
        col("deviceId")
    ) \
    .agg(avg("temperature").alias("avg_temp"))

# 3. Print the Schema to understand the structure
windowed_counts.printSchema()

# Note: The 'window' column is a struct containing {start, end}

## 3. Stateful Processing & The Problem

When we run the code above:
1.  Spark creates a "Bucket" (State) in memory for the window `12:00 - 12:10`.
2.  As data arrives, it updates the average in that bucket.

### The Critical Question
**When does Spark drop the bucket?**
If data can arrive late, Spark theoretically has to keep the `12:00 - 12:10` bucket open **forever** just in case a record from 12:04 arrives 5 years later.

**The Consequence:**
*   Infinite State accumulation.
*   Memory Overflow (OOM Error).
*   System Crash.

### The Solution: Watermarking
To solve this, we need a mechanism to tell Spark: *"Hey, if data is older than 30 minutes, just ignore it and drop the old state."*

This mechanism is called **Watermarking**, which we will implement in the next module.