# Spark Streaming with PySpark
## Module 13: Error Handling & Exception Management

In a real-time pipeline, bad data is inevitable. If a corrupt JSON arrives, your entire stream could crash. We need a robust strategy to handle this.

### Objectives:
1.  **Malformed Records:** Identify data that doesn't match our schema.
2.  **Data Quality Checks:** Filter out records with missing critical fields (e.g., null `customerId`).
3.  **Segregation:** Split the stream into two paths:
    *   **Valid Data:** Process and write to the main table.
    *   **Error Data:** Write to an "Error Table" for debugging.
4.  **Exception Handling:** Wrap custom logic in `try-except` blocks to prevent driver crashes.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, explode, current_timestamp, size, lit, array

# Define packages (Kafka + Postgres)
packages = [
    "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0",
    "org.postgresql:postgresql:42.2.18"
]

spark = SparkSession.builder \
    .appName("Error_Handling_Demo") \
    .master("local[*]") \
    .config("spark.jars.packages", ",".join(packages)) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

# --- Helper Function for Writing to Postgres ---
def write_to_postgres(df, table_name):
    jdbc_url = "jdbc:postgresql://localhost:5432/postgres"
    jdbc_properties = {"user": "postgres", "password": "password", "driver": "org.postgresql.Driver"}
    df.write.jdbc(url=jdbc_url, table=table_name, mode="append", properties=jdbc_properties)

In [None]:
# This function handles the micro-batch logic
def process_microbatch(batch_df, batch_id):
    print(f"Processing Batch: {batch_id}")
    
    # --- 1. Parse JSON & Handle Malformed Data ---
    # We parse the raw string. If parsing fails, the struct fields will be null.
    parsed_df = batch_df.select(
        col("value").cast("string").alias("json_string")
    ).select(
        from_json(col("json_string"), json_schema).alias("payload"), # using json_schema from previous modules
        col("json_string") # Keep original string for error logging
    )
    
    # --- 2. Define Error Conditions ---
    # A. Malformed JSON (payload is null)
    # B. Missing Business Key (customerId is null)
    # C. Empty Device Array (size of devices array is 0)
    
    # We add a column to flag errors
    checked_df = parsed_df.withColumn("is_error", 
        (col("payload").isNull()) | 
        (col("payload.customerId").isNull()) | 
        (size(col("payload.data.devices")) == 0)
    )
    
    # --- 3. Split the Stream ---
    # VALID DATA
    valid_df = checked_df.filter("is_error == false").select(
        col("payload.eventId"),
        col("payload.eventTime"),
        explode(col("payload.data.devices")).alias("device")
    ).select(
        "eventId", "eventTime", "device.deviceId", "device.temperature", "device.status"
    )
    
    # ERROR DATA
    error_df = checked_df.filter("is_error == true").select(
        col("json_string").alias("raw_data"),
        current_timestamp().alias("error_time"),
        lit(batch_id).alias("batch_id")
    )
    
    # --- 4. Write to Sinks ---
    try:
        # Write Valid Data
        if valid_df.count() > 0:
            write_to_postgres(valid_df, "device_data")
            print(f"Written {valid_df.count()} valid records.")
            
        # Write Error Data
        if error_df.count() > 0:
            write_to_postgres(error_df, "device_data_error")
            print(f"Captured {error_df.count()} bad records.")
            
    except Exception as e:
        print(f"CRITICAL ERROR in Batch {batch_id}: {str(e)}")
        # In production, you might send an alert here (Slack/Email)

In [None]:
# Schema must be defined (copy from Module 10)
# ... (Assume json_schema is defined) ...

print("Starting Robust Stream...")

# Read from Kafka
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:29092") \
    .option("subscribe", "device-data") \
    .load()

# Start
query = kafka_df.writeStream \
    .foreachBatch(process_microbatch) \
    .option("checkpointLocation", "data/checkpoint_error_handling") \
    .start()

query.awaitTermination()

## How to Test Failures

1.  **Happy Path:** Send valid JSON via your Python generator.
    *   Check `device_data` table in Postgres.
2.  **Bad Data:** Use `kafka-console-producer` to send:
    *   `"This is just a string"` (Malformed JSON)
    *   `{"eventId": "e1", "customerId": null}` (Missing Key)
    *   `{"eventId": "e1", "data": {"devices": []}}` (Empty Array)
3.  **Verify:** Check `device_data_error` table in Postgres. You should see these raw strings logged with a timestamp.