# Spark Streaming with PySpark
## Module 10: Reading from Kafka & JSON Parsing

This module combines the power of Kafka with Spark Structured Streaming. We will build a real-time pipeline that reads IoT device data from a Kafka topic, parses the JSON payload, and flattens nested arrays into a tabular format.

### Objectives:
1.  **Kafka Source:** Configure Spark to read from a Kafka topic.
2.  **Jar Dependencies:** Load the `spark-sql-kafka` library.
3.  **Data Extraction:** Convert binary Kafka values to Strings.
4.  **JSON Parsing:** Use `from_json` with a defined schema.
5.  **Flattening:** Explode nested arrays and promote struct fields to columns.

### Prerequisites
*   Kafka Cluster running (Module 3 Setup).
*   `device-data` topic created.
*   Data being produced to the topic (we will do this via terminal).

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, explode, expr

# Define the Kafka package version. Ensure it matches your Spark & Scala version.
# For Spark 3.x and Scala 2.12:
kafka_jar_package = "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"

spark = SparkSession.builder \
    .appName("Kafka_Streaming_Demo") \
    .master("local[*]") \
    .config("spark.jars.packages", kafka_jar_package) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print("Spark Session with Kafka support created successfully!")

## Defining the Schema

Streaming DataFrames require a schema to parse JSON data. Our input data looks like this:

```json
{
  "eventId": "e1",
  "eventTime": "2024-01-01...",
  "data": {
    "devices": [
      {"deviceId": "d1", "temperature": 25, "measure": "C", "status": "SUCCESS"},
      {"deviceId": "d2", "temperature": 80, "measure": "F", "status": "ERROR"}
    ]
  }
}

In [None]:
# Schema Definition

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema for the inner device object
device_schema = StructType([
    StructField("deviceId", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("measure", StringType(), True),
    StructField("status", StringType(), True)
])

# Define schema for the main payload
json_schema = StructType([
    StructField("eventId", StringType(), True),
    StructField("eventTime", StringType(), True),
    StructField("data", StructType([
        StructField("devices", ArrayType(device_schema), True)
    ]), True)
])

print("Schema Defined.")

In [None]:
# Kafka Configuration
kafka_topic = "device-data"
kafka_bootstrap_servers = "localhost:29092" # Use "ed-kafka:9092" if running INSIDE Docker

# Read Stream
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("subscribe", kafka_topic) \
    .option("startingOffsets", "earliest") \
    .load()

# Kafka sends data as binary (key, value). We need to cast 'value' to String.
json_df = kafka_df.select(
    col("value").cast("string").alias("json_string")
)

print("Kafka Stream Initialized.")

In [None]:
# 1. Parse JSON string into a Struct
parsed_df = json_df.select(
    from_json(col("json_string"), json_schema).alias("payload")
)

# 2. Explode the array (One row per device)
exploded_df = parsed_df.select(
    col("payload.eventId"),
    col("payload.eventTime"),
    explode(col("payload.data.devices")).alias("device")
)

# 3. Flatten the columns
flattened_df = exploded_df.select(
    col("eventId"),
    col("eventTime"),
    col("device.deviceId"),
    col("device.temperature"),
    col("device.measure"),
    col("device.status")
)

In [None]:
# Write the result to the console to verify
query = flattened_df.writeStream \
    .format("console") \
    .outputMode("append") \
    .option("truncate", "false") \
    .start()

print("Streaming query started...")
print("Go to your terminal and produce data to 'device-data' topic.")
query.awaitTermination()

## How to produce data?

1.  Open your terminal.
2.  Connect to the Kafka container:
    `docker exec -it ed-kafka /bin/bash`
3.  Start the console producer:
    `kafka-console-producer --topic device-data --bootstrap-server localhost:9092`
4.  **Paste this JSON:**
    ```json
    {"eventId": "e100", "eventTime": "2024-01-01 10:00:00", "data": {"devices": [{"deviceId": "D1", "temperature": 25, "measure": "C", "status": "OK"}, {"deviceId": "D2", "temperature": 90, "measure": "F", "status": "WARN"}]}}
    ```
5.  Watch the Jupyter output cell update with the flattened data rows!