# Spark Streaming with PySpark
## Module 4: Your First Streaming Application (Word Count)

In this module, we will write our first real-time data processing application. To understand the ease of Spark Structured Streaming, we will use a unique approach:

1.  **Solve the problem in Batch:** We will write the code to count words from a static text file.
2.  **Convert to Streaming:** We will change just a few lines of code to make it process real-time data from a Socket.

### The Objective: Word Count
We want to read lines of text, split them into individual words, and count the occurrence of each word.

### Prerequisites
*   **Netcat (ncat):** A utility to create a data stream from your terminal.
    *   *Linux/Mac:* Pre-installed or `sudo apt-get install netcat`
    *   *Windows:* Included in the Docker container setup in Module 3.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("WordCount_Socket_Stream") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print("Spark Session Created Successfully!")

## Part 1: Solving in Batch Mode

First, let's solve the problem using a static file. Imagine we have a file `input.txt` with the sentence:
> *"Hello world Hello Spark"*

**Logic:**
1.  **Read:** Load the text file.
2.  **Split:** Break the sentence into a list of words: `["Hello", "world", "Hello", "Spark"]`.
3.  **Explode:** Convert the list into separate rows.
4.  **Count:** Group by the word and count.

In [None]:
# --- 1. Create Dummy Data for Batch Testing ---
data = [("Hello world Hello Spark",), ("Spark Streaming is easy",)]
schema = ["value"]
df_batch = spark.createDataFrame(data, schema)

print("Input Batch Data:")
df_batch.show(truncate=False)

# --- 2. Transformation Logic ---

# A. Split the lines into words (creates an array)
df_split = df_batch.withColumn("words", split(col("value"), " "))

# B. Explode the array into rows
df_exploded = df_split.select(explode(col("words")).alias("word"))

# C. Aggregation (Count)
df_count = df_exploded.groupBy("word").count()

print("Batch Word Count Result:")
df_count.show()

## Part 2: Converting to Streaming

Now, let's convert the code above to handle real-time data.

### The Conversion Steps:
1.  **Read:** Change `spark.read` → `spark.readStream`.
2.  **Source:** Change format to `"socket"` and specify host/port.
3.  **Logic:** **NO CHANGE!** The logic for Split, Explode, and Count remains exactly the same.
4.  **Write:** Change `df.show()` → `df.writeStream`.
5.  **Output:** Specify output mode (`complete`) and sink (`console`).

### Before Running: Start Netcat
Open your terminal (or the terminal inside your Docker container) and run:
```bash
ncat -l 9999

In [None]:
### **Cell 6 [Code]: Streaming Implementation**

```python
# --- 1. Read Stream from Socket ---
# We connect to localhost:9999
lines_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# --- 2. Transformation Logic (EXACTLY SAME AS BATCH) ---

# A. Split lines into words
words_df = lines_df.select(
    explode(
        split(col("value"), " ")
    ).alias("word")
)

# B. Aggregation
word_counts_df = words_df.groupBy("word").count()

# --- 3. Write Stream to Console ---
# We use OutputMode "complete" to see the total count updated every time
query = word_counts_df.writeStream \
    .format("console") \
    .outputMode("complete") \
    .start()

# Let the stream run until you stop it manually
# Go to your terminal, type words like "hello spark", hit Enter, and check this notebook's output.
query.awaitTermination()

## Understanding Output Mode: "Complete"

In the code above, we used `.outputMode("complete")`.

*   **Scenario:** We are doing an aggregation (`count`).
*   **Behavior:** Every time new data arrives (e.g., you type "hello"), Spark recalculates the count for *all* words it has seen so far and prints the *entire* table to the console.
*   **Result:** You will see the counts for "hello" increase every time you type it.

*We will explore other modes like `append` and `update` in future modules.*