# Spark Streaming with PySpark
## Module 5: Output Modes & Under the Hood

In this module, we will dig deeper into how Spark Streaming executes jobs and explore the different ways we can write data to a sink.

### Agenda
1.  **Under the Hood:** How Micro-batches work.
2.  **Optimization:** Tuning Shuffle Partitions for small data.
3.  **Output Modes:**
    *   **Complete:** Output the entire updated result table.
    *   **Update:** Output only the rows that were updated.
    *   **Append:** Output only new rows (and why this is tricky with aggregations).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col

# 1. Initialize Spark Session
spark = SparkSession.builder \
    .appName("OutputModes_Demo") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

# 2. OPTIMIZATION: Shuffle Partitions
# By default, Spark creates 200 partitions when shuffling data (e.g., during groupBy).
# For small streaming data, this is overkill and slows down processing.
# We set it to a smaller number (e.g., 3 or 8) to speed up micro-batches.
spark.conf.set("spark.sql.shuffle.partitions", 3)

print("Spark Session Created with optimized partitions!")

## Defining the Streaming Logic
We will use the same **Word Count** logic as the previous module. The transformation logic remains constant; we will only change the **Output Mode** to see how it affects the result.

**Prerequisite:**
Make sure your Netcat server is running:
```bash
ncat -l 9999

In [None]:
# Transformation Logic

# 1. Read from Socket
lines_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# 2. Transform: Split and Explode
words_df = lines_df.select(
    explode(
        split(col("value"), " ")
    ).alias("word")
)

# 3. Aggregate: Count words
# Note: Aggregations require maintaining "State" across micro-batches.
word_counts_df = words_df.groupBy("word").count()

## Output Mode 1: Complete

*   **Behavior:** Spark outputs the **entire Result Table** to the console after every trigger.
*   **Use Case:** When you need the total counts of everything calculated so far.
*   **Observation:** If you type "cat" in Batch 1, and "cat" in Batch 2, the output in Batch 2 will show "cat: 2" along with all other previous words.

In [None]:
# Run this cell, type words in Netcat, and observe the console.
# Stop this cell manually before running the next one.

query_complete = word_counts_df.writeStream \
    .format("console") \
    .outputMode("complete") \
    .start()

query_complete.awaitTermination()

## Output Mode 2: Update

*   **Behavior:** Spark outputs **only the rows that were updated** in the current micro-batch.
*   **Use Case:** When you only care about what changed right now (e.g., sending alerts).
*   **Observation:**
    1. Type "cat dog" -> Output: "cat: 1, dog: 1"
    2. Type "cat" -> Output: "cat: 2" (You will NOT see "dog" in this output because "dog" count did not change).

In [None]:
# Run this cell AFTER stopping the previous query.

query_update = word_counts_df.writeStream \
    .format("console") \
    .outputMode("update") \
    .start()

query_update.awaitTermination()

## Output Mode 3: Append

*   **Behavior:** Spark outputs only **new rows** that are finalized.
*   **The Catch:** For aggregations (like Word Count), Spark **cannot** use Append mode without a **Watermark**.
*   **Why?** Spark doesn't know if the count for "cat" is finished. Data for "cat" could arrive 1 hour later. Since it can't update previous outputs in Append mode, it waits forever and outputs nothing (or throws an error).

*We will utilize Append mode in future modules when we handle non-aggregated data (like ETL) or when we implement Watermarking.*

### Summary of Output Modes

| Mode | Behavior | Best For |
| :--- | :--- | :--- |
| **Complete** | Rewrites the whole table. | Dashboards, Total Aggregates. |
| **Update** | Writes only changed rows. | Alerts, DB Updates, Push Notifications. |
| **Append** | Writes only new rows (immutable). | Storing raw data, Log files. |