# Spark Streaming with PySpark
## Module 2: Streaming Basics & Architecture

In this module, we dive into the fundamental concepts that power Spark Structured Streaming. We will understand how Spark handles real-time data and the core differences between Batch and Streaming architectures.

### Agenda
1.  **Batch vs. Streaming:** Understanding the key differences.
2.  **Architecture:** How Structured Streaming works under the hood (Micro-batches).
3.  **The "Unbounded Table" Concept.**
4.  **The 4 Pillars of Streaming:** What, How, When, and Where.
5.  **Practical Demo:** A real-time Word Count application using Netcat.

## 1. Batch vs. Streaming

Before writing code, it is crucial to understand *when* to use streaming.

| Feature | Batch Processing | Stream Processing |
| :--- | :--- | :--- |
| **Data Size** | Large, finite bulk of data. | Small, infinite sequence of data. |
| **Schedule** | Fixed intervals (Daily, Weekly). | Continuous (24/7), Real-time. |
| **Latency** | High (Minutes to Hours). | Low (Seconds to Milliseconds). |
| **Example** | Nightly ETL, Monthly Reporting. | Fraud Detection, Sensor Monitoring. |

### Why use Spark for Streaming?
*   **Unified API:** You use the exact same DataFrame API for streaming as you do for batch. If you know Spark SQL, you know Streaming.
*   **Scalability:** Handles high-volume throughput.
*   **Fault Tolerance:** Built-in recovery mechanisms.

## 2. How Structured Streaming Works

Spark Streaming doesn't process data one record at a time (like Storm or Flink might). Instead, it uses **Micro-Batch Architecture**.

1.  **Input:** Data arrives continuously (e.g., from Kafka or a Socket).
2.  **Micro-Batching:** Spark chops this continuous stream into small chunks called "Micro-batches" (e.g., every 1 second of data).
3.  **Processing:** The Spark Engine processes each small batch using the standard Spark SQL engine.
4.  **Output:** The results are appended to the output sink.

### The "Unbounded Table"
Think of your data stream not as a queue, but as an **Input Table that never stops growing**.
*   Every new data item is just a new row appended to this table.
*   Spark runs your query on this "Unbounded Table" continuously.

## 3. The 4 Pillars of a Streaming Query

Every Structured Streaming application answers four basic questions:

1.  **WHAT (Input Sources):** Where is the data coming from?
    *   *Examples: Kafka, Files (CSV/JSON), Socket (for testing).*
2.  **HOW (Transformations):** What logic are we applying?
    *   *Examples: Filtering, Grouping, Mapping (Standard DataFrame operations).*
3.  **WHEN (Triggers):** How often should we process the data?
    *   *Examples: Every 1 second, "AvailableNow", or continuous.*
4.  **WHERE (Output Sinks):** Where should the results go?
    *   *Examples: Console, File, Kafka, Database.*

## 4. Practical Demo: Socket Word Count

We will build a classic **Word Count** streaming app. We will type words into a terminal, and Spark will count them in real-time.

### **Step 1: Open a Terminal**
We need a source to send data. We will use `netcat` (a utility to read/write network connections).
1.  Open your command prompt or terminal.
2.  Run the following command to start a server on port 9999:
    ```bash
    nc -lk 9999
    ```
    *(Note: On Windows, you may need to install nmap/netcat or use WSL).*

### **Step 2: Run the Code Below**
Once the terminal is listening, run the PySpark code below. Then, type words into your terminal (e.g., "cat dog cat") and hit Enter. Watch the Jupyter output update!

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col

# 1. Initialize Spark Session
spark = SparkSession.builder \
    .appName("Streaming_Word_Count") \
    .master("local[2]") \
    .getOrCreate()

# 2. WHAT: Read from the Socket (Input Source)
# We subscribe to the localhost port 9999 where we are typing words
lines_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# 3. HOW: Transform the data
# The input comes in a column named "value". We split it by space and explode it into words.
words_df = lines_df.select(
    explode(
        split(col("value"), " ")
    ).alias("word")
)

# Perform the aggregation (Count occurrences)
word_counts_df = words_df.groupBy("word").count()

# 4. WHEN & WHERE: Write to Console (Trigger & Output)
# OutputMode "complete" means we rewrite the entire table of counts every time.
query = word_counts_df.writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime="2 seconds") \
    .start()

# Wait for the stream to finish (or stop it manually)
query.awaitTermination()

### Analysis of the Code

In the code above, you saw `.outputMode("complete")`. There are three main output modes in Spark Streaming which determine **how** data is written to the sink:

1.  **Complete Mode:** The *entire* updated Result Table is written to the sink. Useful for aggregations (like our Word Count).
2.  **Append Mode:** Only *new* rows added to the Result Table since the last trigger are written. Useful for simple transformations (no aggregations).
3.  **Update Mode:** Only the rows that were *updated* in the last trigger are written.

In the next notebook, we will explore these modes in depth and start working with file sources.