# Spark Streaming with PySpark
## Module 6: Data Processing Architectures

Before building complex pipelines, it is essential to understand the architectural patterns used in the industry. Today, we look at the two giants of Big Data Architecture: **Lambda** and **Kappa**.

### Why does this matter?
Structuring your data pipeline correctly determines:
1.  **Latency:** How fast data is available.
2.  **Accuracy:** How reliable the data is.
3.  **Maintenance:** How hard it is to update the code.

## 1. Lambda Architecture
The "Classic" approach to Big Data.

### Concept
Lambda architecture divides the system into two distinct layers to handle data:
1.  **Batch Layer (Cold Path):**
    *   Processes all historical data.
    *   High latency, High accuracy.
    *   *Engine Example:* Apache Spark (Batch), Hadoop MapReduce.
2.  **Speed Layer (Hot Path):**
    *   Processes real-time data only.
    *   Low latency, potentially lower accuracy (approximation).
    *   *Engine Example:* Spark Streaming, Storm.
3.  **Serving Layer:** Merges views from both layers for the final query.

### Pros & Cons
| Pros | Cons |
| :--- | :--- |
| **Robust:** Batch layer corrects streaming errors eventually. | **Complexity:** You maintain **two** codebases (Batch logic + Streaming logic). |
| **Reliable:** Good for legacy systems. | **Data Discrepancy:** Logic might differ slightly between layers. |

## 2. Kappa Architecture
The "Modern" approach (and the focus of this course).

### Concept
Kappa architecture simplifies the system by treating **everything as a stream**.
1.  **Single Layer (Speed Layer):**
    *   There is **no** separate batch layer.
    *   Real-time data is processed as it arrives.
    *   Historical data is treated as a "bounded stream" and processed through the *same* engine.

### How Spark Enables Kappa
Spark Structured Streaming uses a **Unified API**. The code to process a batch file is almost identical to the code for a real-time stream. This makes implementing Kappa architecture very natural in Spark.

### Pros & Cons
| Pros | Cons |
| :--- | :--- |
| **Simplicity:** One codebase to maintain. | **Complexity in Reprocessing:** Replaying history requires resetting the stream offset. |
| **Consistency:** Same logic applied to historic and real-time data. | **Out-of-Order Data:** Requires handling "Late Data" (solved by Watermarking). |

## 3. Comparison & Strategy

| Feature | Lambda | Kappa |
| :--- | :--- | :--- |
| **Pipelines** | Two (Batch + Stream) | One (Stream only) |
| **Code Duplication** | High | None |
| **Use Case** | Complex Historical corrections | Modern Streaming ETL |

### Our Approach:
In this course, we lean towards **Kappa Architecture**.
*   We write code that handles data as a stream.
*   We will learn to handle the challenges of Kappa (like **Out-of-Order data**) using **Watermarking** in upcoming modules.

In [None]:
from pyspark.sql import SparkSession

# This cell demonstrates how Spark supports Kappa Architecture through Unified API.
# The logic remains exactly the same, only the input method changes.

spark = SparkSession.builder.appName("Architecture_Demo").getOrCreate()

# --- Hypothetical Code Structure ---

def process_data(df):
    """
    This transformation logic applies to BOTH Batch and Stream.
    This is the core of Kappa Architecture: Write Once, Run Anywhere.
    """
    return df.groupBy("value").count()

# 1. Batch Read (Lambda Cold Path)
# df_batch = spark.read.text("path/to/history")
# final_batch = process_data(df_batch)

# 2. Streaming Read (Kappa / Lambda Hot Path)
# df_stream = spark.readStream.format("socket")...
# final_stream = process_data(df_stream)

print("Spark's DataFrame API unifies Batch and Streaming, enabling Kappa Architecture.")