# Spark Streaming with PySpark
## Module 12: Writing to Multiple Sinks

A common requirement in data engineering is to write the same streaming data to multiple destinations. For example:
1.  **Data Lake (Parquet):** For long-term storage and historical analysis.
2.  **Database (Postgres):** For real-time dashboards and low-latency queries.

Spark's `writeStream` only supports a single sink by default. To write to multiple sinks, we use the **`foreachBatch`** API.

### Prerequisites
*   **Postgres Database:** We have a Postgres container running in our Docker setup.
*   **JDBC Driver:** We need to load the Postgres JDBC driver jar.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, explode
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define packages: Kafka + Postgres JDBC
# Note: Ensure the postgres jar version is compatible.
packages = [
    "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0",
    "org.postgresql:postgresql:42.2.18"
]

spark = SparkSession.builder \
    .appName("Multiple_Sinks_Demo") \
    .master("local[*]") \
    .config("spark.jars.packages", ",".join(packages)) \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print("Spark Session with Kafka & Postgres support created!")

In [None]:
# --- Schema Definition ---
device_schema = StructType([
    StructField("deviceId", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("measure", StringType(), True),
    StructField("status", StringType(), True)
])
json_schema = StructType([
    StructField("eventId", StringType(), True),
    StructField("eventTime", StringType(), True),
    StructField("data", StructType([
        StructField("devices", ArrayType(device_schema), True)
    ]), True)
])

# --- Read Stream ---
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:29092") \
    .option("subscribe", "device-data") \
    .option("startingOffsets", "latest") \
    .load()

# --- Transform ---
flattened_df = kafka_df.select(col("value").cast("string").alias("json_string")) \
    .select(from_json(col("json_string"), json_schema).alias("payload")) \
    .select(
        col("payload.eventId"),
        col("payload.eventTime"),
        explode(col("payload.data.devices")).alias("device")
    ).select(
        "eventId", "eventTime", "device.deviceId", "device.temperature", "device.measure", "device.status"
    )

In [None]:
# This function runs for every micro-batch.
# 'batch_df' is a standard static DataFrame containing data for this specific batch.
# 'batch_id' is a unique ID for the batch.

def write_to_multiple_sinks(batch_df, batch_id):
    print(f"Processing Batch ID: {batch_id} with {batch_df.count()} records")
    
    # 1. Write to PARQUET (Data Lake)
    # We use standard batch write API here (.write, NOT .writeStream)
    batch_df.write \
        .format("parquet") \
        .mode("append") \
        .save("data/output/device_data_parquet")
    
    # 2. Write to POSTGRES (Database)
    # Ensure you have created the table 'device_data' in Postgres or allow Spark to create it.
    # JDBC URL: jdbc:postgresql://localhost:5432/postgres (mapped port from Docker)
    
    # Note: Replace with your actual Postgres credentials/details from Docker Compose
    jdbc_url = "jdbc:postgresql://localhost:5432/postgres"
    jdbc_properties = {
        "user": "postgres",
        "password": "password", # Default password in many docker images
        "driver": "org.postgresql.Driver"
    }
    
    batch_df.write \
        .jdbc(url=jdbc_url, table="device_data", mode="append", properties=jdbc_properties)
        
    print("Batch written to Parquet and Postgres successfully.")

In [None]:
print("Starting Multiple Sink Stream...")

# Note: We pass the function NAME to foreachBatch (without parentheses)
query = flattened_df.writeStream \
    .foreachBatch(write_to_multiple_sinks) \
    .option("checkpointLocation", "data/checkpoint_multi_sink") \
    .trigger(processingTime="10 seconds") \
    .start()

query.awaitTermination()

## Verification

1.  **Check Parquet:** Look into the folder `data/output/device_data_parquet`. You should see `.parquet` files appearing.
2.  **Check Postgres:** Connect to your Postgres container/client and query the table:
    ```sql
    SELECT * FROM device_data;
    ```
3.  **Check Console:** The print statements inside the python function will show up in the Jupyter/Terminal logs indicating batch progress.