# Spark Streaming with PySpark
## Module 11: Triggers, Automation & Performance Tuning

In the previous module, we manually pushed JSON data to Kafka. This is not practical for testing performance.

### Objectives:
1.  **Automate Data Production:** Use a Python script to generate thousands of fake IoT events and push them to Kafka automatically.
2.  **Explore Triggers:** Control *when* Spark processes a batch.
    *   **Default (Unspecified):** Run next batch as soon as previous one finishes.
    *   **ProcessingTime:** Run at fixed intervals (e.g., every 10 seconds).
    *   **AvailableNow (Once):** Process all available data then stop (great for cost saving/periodic jobs).
    *   **Continuous (Experimental):** Low-latency processing (ms level).
3.  **Performance Tuning:** Adjust shuffle partitions to speed up small-data processing.

## Step 1: Automate Data Generation

Instead of typing JSON manually, we will use a Python script to generate random data.

**Action:**
1.  Open your terminal.
2.  Ensure you have the `kafka-python` library installed:
    `pip install kafka-python`
3.  Run the provided python generator script (let's assume you have `device_events.py` and `post_to_kafka.py` from the repo).
    `python post_to_kafka.py`

*This script will start flooding your 'device-data' topic with random events.*

In [None]:
# We reuse the exact same logic from Module 10, but we will change the .trigger() part.

import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, explode
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

kafka_jar_package = "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"

spark = SparkSession.builder \
    .appName("Kafka_Triggers_Demo") \
    .master("local[*]") \
    .config("spark.jars.packages", kafka_jar_package) \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate() # TUNING: Reduced partitions for faster local processing

spark.sparkContext.setLogLevel("ERROR")

# --- Schema Definition (Same as before) ---
device_schema = StructType([
    StructField("deviceId", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("measure", StringType(), True),
    StructField("status", StringType(), True)
])
json_schema = StructType([
    StructField("eventId", StringType(), True),
    StructField("eventTime", StringType(), True),
    StructField("data", StructType([
        StructField("devices", ArrayType(device_schema), True)
    ]), True)
])

# --- Read Stream ---
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:29092") \
    .option("subscribe", "device-data") \
    .option("startingOffsets", "latest") \
    .load()

# --- Transformation Logic ---
json_df = kafka_df.select(col("value").cast("string").alias("json_string"))
parsed_df = json_df.select(from_json(col("json_string"), json_schema).alias("payload"))
flattened_df = parsed_df.select(
    col("payload.eventId"),
    col("payload.eventTime"),
    explode(col("payload.data.devices")).alias("device")
).select(
    "eventId", "eventTime", "device.deviceId", "device.temperature", "device.status"
)

In [None]:
# Scenario: Run a micro-batch every 10 seconds.
# Even if data arrives at t=1s, Spark waits until t=10s to process it.
# This increases latency but reduces overhead for small batches.

print("Starting Stream with ProcessingTime='10 seconds'...")

query_processing_time = flattened_df.writeStream \
    .format("console") \
    .trigger(processingTime='10 seconds') \
    .start()

# Let it run for 30 seconds then stop to test next trigger
query_processing_time.awaitTermination(30)
query_processing_time.stop()
print("Stopped ProcessingTime Query.")

In [None]:
# Scenario: "I want to run this as a nightly job, process everything since last run, and shut down."
# This mimics Batch processing but uses Streaming architecture (Kappa).

print("Starting Stream with AvailableNow=True...")

query_once = flattened_df.writeStream \
    .format("console") \
    .trigger(availableNow=True) \
    .start()

query_once.awaitTermination()
print("Job Finished! (It stopped automatically because availableNow=True)")

## Trigger 3: Continuous Processing

*   **Concept:** Instead of micro-batches, Spark launches long-running tasks that process data row-by-row as it arrives.
*   **Latency:** Milliseconds (vs Seconds for micro-batch).
*   **Limitation:** Not all operations (like aggregations) are supported in this mode yet. Since we are only doing map/flatmap (parsing & flatten), it *might* work here, but requires specific support.

*For this course, we stick to Micro-batch modes as they are the industry standard for robust pipelines.*

## Tuning Summary

1.  **Shuffle Partitions:**
    *   `spark.sql.shuffle.partitions` defaults to 200.
    *   For streaming, this often creates too many tiny tasks.
    *   **Action:** Reduce it to 2, 4, or 8 (matching your core count) for low-volume streams.

2.  **Max Offsets Per Trigger:**
    *   `.option("maxOffsetsPerTrigger", 1000)`
    *   Prevents the stream from crashing if a huge burst of data arrives. It forces Spark to read chunks of 1000 messages at a time.