# Spark Streaming with PySpark
## Module 17: Window Operations & Late Data Handling

In this module, we put theory into practice. We will process streaming data using:
1.  **Tumbling Windows:** Non-overlapping aggregation (e.g., every 10 mins).
2.  **Sliding Windows:** Overlapping aggregation (e.g., every 10 mins, sliding every 5 mins).
3.  **Watermarking:** Handling late data by defining a cutoff threshold.

### The Scenario
We receive a stream of words (Animals/Birds) with an `event_time`. We want to count the occurrences of each word within specific time windows.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, count, from_json, explode, expr
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

kafka_jar_package = "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"

spark = SparkSession.builder \
    .appName("Window_Operations_Lab") \
    .master("local[*]") \
    .config("spark.jars.packages", kafka_jar_package) \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

# --- Schema for Input Data ---
# {"event_time": "timestamp", "data": "owl dog cat"}
json_schema = StructType([
    StructField("event_time", StringType(), True),
    StructField("data", StringType(), True)
])

In [None]:
# Read from Kafka (Topic: wildlife)
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:29092") \
    .option("subscribe", "wildlife") \
    .option("startingOffsets", "latest") \
    .load()

# Parse JSON & Explode Words
# Note: We must cast event_time string to TimestampType for windowing!
words_df = kafka_df.select(col("value").cast("string").alias("json_string")) \
    .select(from_json(col("json_string"), json_schema).alias("payload")) \
    .select(
        col("payload.event_time").cast("timestamp").alias("eventTime"),
        explode(expr("split(payload.data, ' ')")).alias("word")
    )

# Check Schema (Important: eventTime must be Timestamp)
words_df.printSchema()

In [None]:
# Scenario: Count words every 10 minutes based on event time.
# Watermark: 10 minutes (Allow data up to 10 mins late)

windowed_counts = words_df \
    .withWatermark("eventTime", "10 minutes") \
    .groupBy(
        window(col("eventTime"), "10 minutes"),
        col("word")
    ) \
    .count()

# Flatten output for readable console print
final_df = windowed_counts.select(
    col("window.start").alias("start_time"),
    col("window.end").alias("end_time"),
    col("word"),
    col("count")
)

In [None]:
# We use Update mode to see changes to windows as late data arrives.
query = final_df.writeStream \
    .format("console") \
    .outputMode("update") \
    .option("truncate", "false") \
    .trigger(processingTime="10 seconds") \
    .start()

query.awaitTermination()

## How to Test Late Data

1.  **Start Producer:** Use `kafka-console-producer` for topic `wildlife`.
2.  **Send On-Time Data:**
    `{"event_time": "2024-01-01 12:05:00", "data": "owl"}`
    *   *Result:* Window 12:00-12:10 -> owl: 1
3.  **Send Another On-Time (Same Window):**
    `{"event_time": "2024-01-01 12:08:00", "data": "dog"}`
    *   *Result:* Window 12:00-12:10 -> owl: 1, dog: 1
4.  **Send Late Data (Within Watermark):**
    `{"event_time": "2024-01-01 12:04:00", "data": "cat"}` (Arriving at 12:15 real time)
    *   *Result:* Window 12:00-12:10 updates! -> owl: 1, dog: 1, cat: 1
5.  **Send Very Late Data (Outside Watermark):**
    `{"event_time": "2024-01-01 11:00:00", "data": "shark"}`
    *   *Result:* Ignored. No update.