# Spark Streaming

Spark Streaming is an extension of core APIs that provides fault-tolerant and high-throughput processing of real-time data. It offers APIs that allow scalable processing of data streams generated from various sources.



# Understanding Batch and Stream Processing

## 1. Batch Processing
Batch processing involves applying computational logic to fixed, static datasets and producing results after processing the entire dataset. It is reliable for large-scale data but can be slow, often taking hours to complete complex jobs.

## 2. Stream Processing
Stream processing handles continuous, unbounded streams of data in real time. It requires low-latency, fault-tolerant systems to manage challenges such as varying data arrival rates, maintaining correctness during failures, and processing data efficiently.

## Integration of Batch and Stream Processing
Stream processing often integrates with batch processing to enrich real-time insights. For instance, live user activity streams can be joined with static profiles or historical data to enable dynamic, context-rich analytics for timely decision-making.


# Use Cases of Spark Streaming

Spark Streaming delivers significant value across various domains by enhancing customer experience or proactively monitoring data for actionable insights. Below are some of the prominent use cases:

## 1. Fraud Detection
- **Domain**: Financial Services  
- **Use Case**: Detect fraudulent transactions in real time.  
- **Value**: Enables organizations to take immediate action on suspicious transactions, minimizing potential damage.

## 2. Recommendations
- **Domain**: E-commerce, Media, and others  
- **Use Case**: Recommend products or content based on current user activities on a platform.  
- **Value**: Improves customer engagement and increases revenue through personalized experiences.

## 3. Risk Avoidance
- **Domain**: Service Providers  
- **Use Case**: Expand due diligence processes to evaluate service eligibility.  
- **Value**: Reduces the risk of providing services or products to ineligible consumers.

Spark Streaming is widely adopted across industries for its ability to process real-time data efficiently and make critical decisions on the fly.


# Core Concepts in Stream Processing

## 1. Data Delivery Semantics
Stream processing engines guarantee data delivery under failure scenarios:

- **At Most Once**: Data may be delivered zero or one time. Risk of data loss; suitable for non-critical use cases.
- **At Least Once**: Data is delivered one or more times. Ensures no data loss but may lead to duplication (e.g., financial transactions).
- **Exactly Once**: Data is delivered exactly one time. Prevents loss and duplication, ideal for critical business applications.

### Key Insight
Delivery semantics range from weakest `(at most once)` to strongest `(exactly once)`, with modern engines favoring **exactly once** for reliability.


## 2. Notion of Time in Stream Processing

In the world of stream processing, the notion of time is very important because it enables
you to understand what’s going on in terms of time. For example, in the case of a real-time
anomaly detection application, the notion of time gives insights into the number of
suspicious transactions occurring in the last 5 minutes or a certain part of the day.

- **Event Time**: Timestamp when data is created (e.g., IoT device logs temperature).
- **Processing Time**: Timestamp when the stream engine processes the data.

### Key Insight
- Event time is ideal for understanding real-world events, minimizing lag impact.
- Processing time depends on system clocks and may vary due to delays.
- Use **event time** for accurate temporal analysis, especially with unbounded data streams.

To deal with unbounded incoming streams of data, one common practice in the stream processing engines is to divide the incoming
data into chunks by using the start and end time as the boundary. It makes more sense to use event time as the temporal boundaries.


# 3. Windowing in Stream Processing

### Why Windowing?
- Streaming data is unbounded; processing it in chunks is essential.
- Example: Traffic sensor data analyzed in 1-minute or 5-minute intervals.

### Windowing Patterns
1. **Fixed/Tumbling Window**: 
   - Divides data into non-overlapping, fixed-size windows (e.g., 1 min).
   - Ideal for straightforward aggregations like sum or average.

2. **Sliding Window**: 
   - Overlapping windows with a defined slide interval.
   - Produces smoother aggregations due to data overlap.

3. **Session Window**: 
   - Dynamically determined by periods of user inactivity.
   - Useful for analyzing user behavior (e.g., website sessions).
   
### Sliding Window Example: Producing Smoother Aggregations

Consider a sensor that records temperature every second. Using a **fixed window** of 5 seconds, the aggregation might look like this:

| Time (s) | Data                    | Fixed Window Average |
|----------|-------------------------|----------------------|
| 1-5      | 20, 22, 23, 21, 24       | 22                   |
| 6-10     | 25, 27, 26, 28, 30       | 27.2                 |

In this case, the averages abruptly change between windows.

Now, using a **sliding window** with a length of 5 seconds and a slide interval of 2 seconds, the overlapping windows allow a smoother transition between averages:

| Time (s)   | Data                      | Sliding Window Average |
|------------|---------------------------|------------------------|
| 1-5        | 20, 22, 23, 21, 24         | 22                     |
| 3-7        | 23, 21, 24, 25, 27         | 24                     |
| 5-9        | 24, 25, 27, 26, 28         | 26                     |
| 7-11       | 27, 26, 28, 30, 32         | 28.6                   |

### Key Difference:
- Fixed windows aggregate only distinct time chunks, causing sharp transitions.
- Sliding windows overlap, capturing shared data points, resulting in smoother transitions between aggregates.


### Key Insight
- Temporal boundaries (event or processing time) define windows for meaningful analysis.


# Stream Processing Engine Landscape

- **Apache Storm**: Pioneer in stream processing, abandoned by Twitter in favor of Heron for better resource efficiency.
- **Apache Samza**: Built by LinkedIn, tightly integrated with Kafka for fault-tolerant stream processing.
- **Apache Flink**: Supports both stream and batch processing, known for high-throughput and low latency.
- **Apache Kafka Streams**: Lightweight stream processing library on top of Kafka, easy to write real-time apps.
- **Apache Apex**: Native Hadoop YARN platform, unifies stream and batch processing.
- **Apache Beam**: Unified API for both stream and batch processing, portable across runtimes (Flink, Spark, DataFlow).

**Processing Models:**
- **Record-at-a-Time**: Low latency, processes each piece of data as it arrives (e.g., Apache Flink).
- **Micro-Batching**: Higher throughput, processes data in small batches (e.g., Apache Spark).


# Data Sources for Spark Streaming

Spark Streaming supports the analysis of real-time data from a variety of sources. It provides APIs to connect directly to several messaging queues and data streams, enabling seamless integration and processing. Below are some of the supported data sources:

## Supported Data Sources
1. **Kafka**  
   - Widely used distributed messaging system.
   - Ideal for high-throughput, fault-tolerant event streaming.

2. **Flume**  
   - Designed for collecting, aggregating, and transporting large amounts of log data.

3. **Kinesis**  
   - Amazon's real-time streaming data platform, enabling fast ingestion and processing.

4. **ZeroMQ**  
   - Lightweight messaging library designed for high-performance and low-latency messaging.

5. **Twitter**  
   - Direct integration to analyze real-time data streams from Twitter's public API.

These integrations make Spark Streaming a versatile and powerful tool for real-time data processing across multiple platforms.


# Stream Processing in Spark Streaming

Stream processing in Spark Streaming handles real-time data streams efficiently by breaking them into **micro-batches**. These micro-batches are small chunks of data that are processed sequentially, allowing for scalable and fault-tolerant stream processing.


## DStreams (Discretized Streams)
A **DStream** is the representation of a data stream in Spark Streaming. It is a continuous series of **RDDs** (Resilient Distributed Datasets) where:
- Each RDD contains data from a specific time interval.
- Operations on DStreams are automatically translated into operations on the underlying RDDs.

### Key Features of DStreams
- **Receiver Object**:  
   Every DStream is linked to a receiver object that:
   - Collects data from the source.
   - Stores the data in memory for further processing.

This design makes DStreams a core abstraction for stream processing in Spark.


### Spark Structured Streaming

- **Second-Generation Streaming Engine**: Built on Spark SQL for better scalability, fault tolerance, and performance.

- **Key Features**:
  - **End-to-End Reliability**: Guarantees correctness and handles complex transformations.
  - **Event-Time Processing**: Handles out-of-order data effectively.
  - **Data Integration**: Supports a variety of data sources and sinks.

- **Core Ideas**:
  1. **Stream as a Table**: Treats incoming data as rows appended to a table, leveraging structured APIs (DataFrame/Dataset).
  2. **Transactional Guarantees**: Provides end-to-end exactly-once processing and integrates with storage systems for consistent snapshots.

- **Processing Models**:
  - **Micro-Batching** (default): Low latency (~100ms), suitable for many use cases.
  - **Continuous Processing** (experimental, Spark 2.3+): Ultra-low latency (~1ms), with some restrictions.

- **Developer Benefits**:
  - Easy transition from batch to streaming via structured APIs.
  - Optimized using the Catalyst engine.
  - Simplifies stream processing complexities like state maintenance and event-time handling.


# Spark Structured Streaming

### Overview
Spark Structured Streaming is Spark’s second-generation streaming engine, designed to handle real-time data processing with the following goals:
- **End-to-end reliability** and **guaranteeing correctness**
- **Complex transformations** on incoming data
- Processing based on **event time** and handling **out-of-order data**
- Integration with various **data sources** and **data sinks**

### Core Concepts
The core concepts of building a streaming application in Spark include:
- **Data Sources**: Input streams of data.
- **Data Transformations**: Applying structured APIs to incoming data streams.
- **Output Mode**: Defines how data is written to a sink.
- **Trigger**: Determines when streaming computations are executed.
- **Data Sink**: Destination for the output of streaming applications.

---

## Data Sources

In Spark Structured Streaming, the data sources are different from batch processing. They generate data continuously, and their rate may vary over time. Spark provides native support for the following data sources:

| Data Source Type  | Description                                                                                                                                       | Fault Tolerant  |
|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| **Kafka**         | Reads from Kafka topics (requires Kafka version 0.10+). Popular in production environments.                                                       | Yes (using Kafka offset) |
| **File**          | Reads new files dropped into directories (local file system, HDFS, or S3). Supports formats like text, CSV, JSON, Parquet, ORC.                    | Depends on file system |
| **Socket**        | Reads UTF8 data from a socket (used for testing only).                                                                                            | No              |
| **Rate**          | Generates events with timestamps and a monotonically increasing value (used for testing and benchmarking).                                         | No              |

---

## Output Modes

Output modes in Structured Streaming define how data is written to a sink:

| Output Mode  | Description                                                                 |
|--------------|-----------------------------------------------------------------------------|
| **Append**   | Only new rows appended to the result table are written to the output sink.  |
| **Complete** | The entire result table is written to the output sink.                      |
| **Update**   | Only updated rows are written to the output sink. Unchanged rows are ignored. |

---

## Trigger Types

The trigger defines when the next batch of data will be processed in a streaming job:

| Trigger Type     | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| **Not Specified** | Default (micro-batch mode), Spark processes the next batch after completing the previous one. |
| **Fixed Interval** | Processes data based on a fixed interval, regardless of the time taken for the previous batch. |
| **One-time**      | Used for low volume, one-time processing, the application stops after processing. |
| **Continuous**    | Low-latency, continuous processing (experimental in Spark 2.3 and later). |

---

## Data Sinks

Data sinks are where the processed data is written. Different types of sinks support different output modes:

| Sink Type        | Description                                                                 | Output Modes                  | Fault Tolerance                |
|------------------|-----------------------------------------------------------------------------|--------------------------------|--------------------------------|
| **Kafka Sink**   | Writes data to a Kafka topic. Supports high throughput, fault tolerance.     | Append, Update, Complete       | Exactly-once semantics (with Kafka 0.10+ support) |
| **File Sink**    | Writes data to a file system (HDFS, local, or S3). Supports various formats. | Append, Update, Complete       | Depends on underlying file system (HDFS/S3) |
| **Foreach Sink** | Allows custom row-by-row processing, typically for external systems.        | Append, Update, Complete       | Depends on custom implementation |
| **Console Sink** | Displays output to the console. Primarily for testing and debugging.        | Append                         | None (for debugging only)      |
| **Memory Sink**  | Stores data in memory on the driver node. Suitable for debugging and small jobs. | Append, Update, Complete       | None (data is lost when the application stops) |

---

## Sink Configurations

Each sink type may have its own configuration parameters. Below is a table describing key configuration options for each sink:

| Sink Type        | Key Configuration Parameter(s)                    | Example Configuration                        | Notes                                      |
|------------------|----------------------------------------------------|----------------------------------------------|--------------------------------------------|
| **Kafka Sink**   | `kafka.bootstrap.servers`, `topic`                 | `kafka.bootstrap.servers: <broker-list>`, `topic: <topic-name>` | For connecting to Kafka brokers and topics. |
| **File Sink**    | `path`, `format`                                   | `path: /output/directory`, `format: Parquet` | Supports formats like CSV, JSON, Parquet, Avro. |
| **Foreach Sink** | Custom logic (depends on the use case)             | Custom row processing logic                  | Allows writing custom code for each row.  |
| **Console Sink** | `numRows`                                          | `numRows: 10`                               | Limits the number of rows printed to the console. |
| **Memory Sink**  | `numRows`                                          | `numRows: 1000`                              | Defines how many rows to store in memory. |

---



# Lambda and Kappa Architecture

When dealing with big data processing and analytics, architectural patterns play a crucial role in designing systems that can handle large-scale data efficiently. Two commonly used patterns are **Lambda Architecture** and **Kappa Architecture**.

---

## **Lambda Architecture**

Lambda Architecture is designed to handle massive quantities of data by utilizing both real-time and batch processing. It is particularly suitable for systems that require low-latency reads and updates.

### Key Components:
1. **Batch Layer**: Processes and stores data in a fault-tolerant manner. It computes results over the full dataset and is typically slow but highly accurate.
2. **Speed Layer**: Handles real-time data streams for low-latency processing, typically less accurate due to approximations.
3. **Serving Layer**: Combines outputs from both the batch and speed layers to provide a unified view of the data.

### Pros:
- Combines accuracy and low-latency processing.
- Fault-tolerant due to the separation of batch and real-time layers.

### Cons:
- Complex to implement and maintain due to the dual-layer approach.
- Potential duplication of logic in batch and speed layers.

---

## **Kappa Architecture**

Kappa Architecture was introduced as a simpler alternative to Lambda Architecture, focusing solely on streaming data processing. It eliminates the batch layer entirely and relies on a unified stream processing engine.

### Key Components:
1. **Stream Processing**: All data is processed as a continuous stream, ensuring near real-time analytics.
2. **Data Store**: Stores the processed results for querying and analysis.

### Pros:
- Simpler and easier to maintain compared to Lambda Architecture.
- Ideal for use cases where reprocessing can be achieved by replaying data streams.

### Cons:
- Not suitable for batch processing scenarios requiring historical data analysis.

---

## **Comparison**

| Feature              | Lambda Architecture            | Kappa Architecture             |
|----------------------|--------------------------------|--------------------------------|
| **Complexity**       | High due to dual layers       | Low due to single layer       |
| **Latency**          | Low (real-time layer)         | Very low (stream processing)  |
| **Reprocessing**     | Separate batch jobs           | Stream replay                 |
| **Use Case**         | Mixed batch and real-time     | Pure streaming use cases      |

---

## **Further Reading**

For an in-depth comparison and practical examples, check out this detailed blog post:  
[Lambda vs. Kappa Architecture - Nexocode Blog](https://nexocode.com/blog/posts/lambda-vs-kappa-architecture/)
