# Unit 8 Spark Streaming

## Contents
```
8.1. Introduction to Stream Processing with Spark
  8.1.1. Spark Streaming API (DStream)
  8.1.2. Structured Streaming API
  8.1.3. Stream Processing Model
  8.1.4. Streaming Sources, Sinks and Output Mode
  8.1.5. Fault Tolerance and Restarts
  
8.2. Windowing and Aggregates
  8.2.1. Stateless vs Stateful Transformations
  8.2.2. Event Time and Windowing    
  8.2.3. Tumbling Window
  8.2.4. Sliding Window
  8.2.5. Watermarking
  
8.3. Joins
  8.3.1. Joining to a Static Source
  8.3.2. Joining to Another Stream
  8.3.3. Watermark
  8.3.4. Outer Joins
  
```

## Introduction to Stream Processing with Spark

Let's start reviewing how Spark operates in the standard batch processing mode:
![Standard batch processing operation](https://bigdata.cesga.es/img/spark_streaming-non_streaming_operation.png)
In batch mode, we have a input data source, we apply some transformations and we write the output to the given storage.

When procesing streaming data source we have to introduce a new axis, **time**, because in this case the input source is constantly generating new input data as time evolves.

![Microbatches](https://bigdata.cesga.es/img/spark_streaming-microbatch_generation.png)
In stream processing mode Spark divides the input data stream in micro-batches and then each micro-batch is processed in a series of small jobs.

### Spark Streaming API (DStream)

The Spark Streaming API, aka DStream, is the implementation of Spark Streaming based on RDDs. You can find it in legacy projects but for new projects the newer Structured Streaming API is recommended.

NOTE: There are no longer updates to Spark Streaming.

## Structured Streaming

The Structured Streaming API is the new streaming API that uses the Spark SQL engine, ie. the DataFrame API.

Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

The idea behind both Spark Streaming and Structured Streaming is to divide the stream of data into **micro-batches** and each micro-batch its processed as a small job, achieving end-to-end latencies as low as 100 milliseconds.

To achive lower latencies, there is also a low-latency processing mode called **Continuous Processing** which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

The Spark SQL engine (Catalyst) takes care of running the series of jobs incrementally and continuously updating the final result as streaming data continues to arrive.

The system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

## Stream Processing Model

The key idea behind spark structured streaming is to treat the live data stream as a table that is being continously appened.

## Input Sources

- Socket source (for testing)

In [None]:
spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

- File source

In [None]:
spark.readStream \
  .format("json") \
  .option("path", "path/to/source/dir") \
  .option("subscribe", "topic1") \
  .load()

format can be: parquet, json, csv, orc, etc.

- Kafka source

In [None]:
spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .load()

## Exercises

Lab: Unit_8_structured_streaming-dataframe_schema.py
- Review the code
- Run it
- What is the schema of the dataframe generated from the stream

## Output Sinks

- File sink: stores the output to a directory

In [None]:
df.writeStream \
    .format("parquet") \
    .option("path", "path/to/destination/dir") \
    .start()

format can be parquet, json, csv, orc, etc.

- Kafka sink: stores the output to one or more topics in Kafka

In [None]:
df.writeStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    .option("topic", "updates")
    .start()

- Console sink (for debugging): prints the output to stdout every time there is a trigger

In [None]:
df.writeStream \
    .format("console") \
    .start()

- ForeachBatch: runs custom write logic on every micro-batch of the output

In [None]:
def foreach_batch_function(df, epoch_id):
    # Custom function that transforms and writes df to storage
    pass
  
df.writeStream \
    .foreachBatch(foreach_batch_function) \
    .start()

- Foreach sink: runs custom write logic on every row of the output

In [None]:
def process_row(row):
    # Custom function that writes row to storage
    pass
    
df.writeStream \
    .foreach(process_row) \
    .start()

## Output mode

- *Complete Mode*: the entire updated result table will be written.

In [None]:
df.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

- *Append Mode*: only the new rows will be written.

In [None]:
df.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

- *Update Mode*: only the new and updated rows will be written.

In [None]:
df.writeStream \
    .outputMode("update") \
    .format("console") \
    .start()

## Exercises

Lab Unit_8_socket_wordcount:
- Review the code of the app
- Run the app
- Test different output modes

## Learning More

- DStream: [Spark Streaming Programming Guide (legacy)](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)