# Streaming Data with Apache Spark

Apache Spark offers comprehensive capabilities for processing streaming data, allowing you to carry out real-time analytics effectively. Central to this capability is the concept of a data stream, which forms the core unit of processing. Before working with streaming data in Spark, it’s important to first understand what a data stream is and what makes it unique.

### What Is a Data Stream?
A data stream is an unbounded sequence of data that continuously flows from different sources such as sensors, log files, or social media feeds. As fresh data is produced, it gets appended to the stream, creating a dynamic and ever-evolving dataset. Some examples of data streams include:

- **Social media feeds**
  A continuous flow of posts containing text, user details, and timestamps, which can be processed to analyze trends, sentiments, or user behavior.

- **Sensor readings**
  Data such as temperature and humidity measurements from a network of sensors in a smart building, used to optimize energy usage.

- **Log data**
  Streams of log messages generated by servers that capture system events and error information to help monitor performance or detect security issues.

Processing these data streams introduces unique challenges because of their constantly growing and changing nature. To manage continuous flows of data, there are generally two main strategies:

- **Recompute**
  This traditional method reprocesses the entire dataset whenever new data arrives to ensure accuracy. Although reliable, it can become resource-intensive and slow when dealing with large volumes of data.

- **Incremental processing**
  This approach creates custom logic to track and process only the new data added since the previous update. By focusing solely on recent changes, incremental processing significantly reduces computational overhead and improves efficiency.

A key tool that enables incremental processing in Apache Spark is Spark Structured Streaming, which streamlines the development of scalable, real-time data pipelines.

### Spark Structured Streaming Overview

* A scalable stream processing engine in Apache Spark.
* Transforms how data streams are processed and queried.
* Automatically detects new data as it arrives.
* Incrementally persists results to target sinks (e.g., durable storage like files or tables).

**Core Concept**

* Treats live data streams as unbounded, continuously growing tables.
* Each new record is appended as a new row.
* Enables the use of familiar SQL and DataFrame operations on streaming data.
* Unifies batch and streaming processing, no need for separate stacks.
* Simplifies migration from batch Spark jobs to streaming jobs.

**Append-Only Requirement**

* Streaming sources must be *append-only*.
* Data can only be added—no updates, deletions, or overwrites.
* If a source allows changes to existing data, it is not suitable for Structured Streaming.
* Ensuring compliance with this requirement is essential for streaming workflows.

**Delta Lake Integration**

* Spark Structured Streaming supports integration with:

  * File directories
  * Messaging systems like Kafka
  * Delta Lake tables

### DataStreamReader in PySpark

* Use `spark.readStream` to read a Delta Lake table as a streaming source.
* Enables processing of both existing and new data.
* Returns a *streaming DataFrame* for transformations.

  `streamDF = spark.readStream.table("source_table")`

### DataStreamWriter in PySpark

* After transformations, persist results with `writeStream`.
* Allows configuration of output options and durable storage targets.

  `streamDF.writeStream.table("target_table")`

In [0]:
# Read streaming data from a Delta table
# streamDF = spark.readStream.table("source_table")

# Write the streaming data to another Delta table
# streamDF.writeStream \
#     .trigger(processingTime="10 seconds")
#     .outputMode("append") \
#     .option("checkpointLocation", "/path/to/checkpoint") \
#     .table("target_table")

## Streaming Query Configurations
When configuring `DataStreamWriter`, several important settings control how streaming queries behave:

### Trigger Intervals

* The `trigger` method determines **how frequently** the system processes new data.
* This timing mechanism is called the **trigger interval**.
* There are two main trigger modes:

  * **Continuous Trigger**

    * Processes data *continuously* as soon as it arrives.
    * Suitable for low-latency use cases.

  * **Triggered (Batch) Trigger**

    * Processes data at **fixed time**.
    * Helps balance resource usage and latency.


| **Mode**                          | **Usage**                              | **Behavior**                                                                                                                                                                                          |
| --------------------------------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Continuous**                    | `.trigger(processingTime="5 minutes")` | - Processes data **continuously in micro-batches** at regular intervals.<br>- Default interval: **500 ms** (`processingTime="500ms"`).<br>- Enables near-real-time processing.                        |
| **Triggered Once** *(deprecated)* | `.trigger(once=True)`                  | - Processes **all available data in a single micro-batch**, then stops automatically.<br>- May cause out-of-memory errors if data volume is large.<br>- Deprecated since Databricks Runtime 11.3 LTS. |
| **Triggered AvailableNow**        | `.trigger(availableNow=True)`          | - Processes **all available data in multiple micro-batches** until done, then stops.<br>- More scalable for large datasets.<br>- Ensures efficient resource use.                                      |


### Output Modes

**Append Mode:** `.outputMode("append")`

- Default output mode. 
- Each trigger writes only new incoming rows since the last checkpoint.
- Suitable when you need a growing dataset.

  

**Complete Mode:** `.outputMode("complete")`

- Recomputes all results each time.
- Overwrites the entire target (e.g., updating aggregates).
- Useful for maintaining up-to-date summary tables.


### Checkpointing

`.option("checkpointLocation", "/path/to/checkpoint")`

**Purpose:**

- Stores progress and metadata about the streaming query.
- Ensures recovery after failures—processing resumes from the last checkpoint, not from scratch.

**Key Points:**

- Checkpoints are saved to reliable storage (e.g., DBFS, Amazon S3, Azure Storage).
- Cannot be shared across multiple streaming queries.
- Every streaming write operation requires its own checkpoint location to maintain separate state and guarantees.

## Structured Streaming Guarantees
Spark Structured Streaming provides two main guarantees to ensure reliable, fault-tolerant processing:

### Fault Recovery

* If failures occur (e.g., node crashes or network issues), processing can **resume from the last successful point**.
* This recovery relies on:

  * **Checkpointing** - saves the progress and state of the stream.
  * **Write-Ahead Logs (WALs)** - capture the **offset range** for each trigger, allowing replay of unprocessed data.

* **Repeatable Data Sources** are critical:

  * Sources like cloud object storage and pub/sub messaging systems (Kafka, Event Hubs) allow the same data to be read repeatedly.
  * This ensures data can be **safely reprocessed after a failure** without loss.

### Exactly-Once Semantics

* Every record is processed **exactly one time**, even if failures and retries occur.
* This is possible through **idempotent sinks**, which:

  * Allow multiple writes for the same records without creating duplicates.
  * Use the **offsets as unique identifiers** to detect and ignore duplicate writes.
* **Key Benefits:**

  * No data loss.
  * No duplicate entries.
  * Guaranteed consistent output in the sink.

## Unsupported Operations

* Since streaming data is **infinite/unbounded**, some operations common in batch processing are **not supported** or are limited in streaming:

  * **Sorting** the entire dataset (cannot fully sort infinite data).
  * **Global deduplication** across all time.
* **Alternatives:**

  * Use **windowing** (e.g., tumbling, sliding windows) to group data into bounded chunks for aggregation or deduplication.
  * Use **watermarking** to manage late-arriving data.
* **Note:**

  * Deep knowledge of these techniques is generally required for **Databricks Data Engineer Professional certification**, not for Associate-level.