# 3. Structured Streaming

Scalable and fault-tolerant streaming engine built on the Spark SQL engine.

__Features__

* Expressing streaming computation the same way as on static data - engine will take care of running it incrementally and continuously
* Use the Dataset/DataFrame API to express streaming functions
* Computation is executed on same Spark SQL engine
* System ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-ahead logs

__How Structured Queries Work__

1. Default - queries are processed using a _micro-batch processing engine_ which processes data streams as a series of small batch jobs, allowing it to achieve millisecond latency with exactly-once fault-tolerance guarentees
2. Continuos Processing - achieve even beetter latency with at-least-once guarantees (faster, but less tolerant)

Will explain the concepts used during micro-batch processing models, Continuous Processing models and the APIs used

# Example

Maintaining a running word count of text data received from a data server listening on a TCP socket - streaming a word count

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Create streaming DataFrame that represents text data received from stream hosted at localhost, and
# transforms the DataFrame to calculate word count

lines = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()
words = lines.select(
	explode(
		split(lines.value, " ")
	).alias('word')
)
word_counts = words.groupBy('word').count()


__`lines` DataFrame explained__

Represents an unbounded table containing the streaming text data. Each row will represent a line in the streaming text data. It then splits the sentence into individual words and creates a new row for each unique word with an alias `word`. 

It then creates a DataFrame that groups the unqiue values in the Dataset and counts them. This is a streaming DataFrame which represents the running word counts of the stream.

In [None]:
# Print complete every time the DataFrame is updated
query = word_counts \
	.writeStream \
	.outputMode('complete') \
	.format('console') \
	.start()

query.awaitTermination()

The above code will await sentence data from the source (localhost:9999) and then process the data in the DataFrame. The below command will instantiate a Netcat data server to send the sentence data over. This is a small utility found in most Unix-like systems

In [None]:
!nc -lk 9999

# Programming Model

Key idea is to treate a live data stream as a table that is being continuously appended. This means the new stream processing model is similar to a batch processing model. Computations are expressed as a batch-like query and Spark runs it as an _incremental_ query on the _unbounded_ input table.

## Basic Concepts

Every data item that is arriving on the stream is like a new row being added to the unbounded Input Table. A query on the input will generate the Results Table which will display the current state of the Input Table at that instance. When new rows are appended to the Input Table, it will update the Results Table and changes should be written to an external sink.

![Updating the Results Table](https://spark.apache.org/docs/latest/img/structured-streaming-model.png)

The "Output" is defined as what gets written out to the external storage. The output can be defined in different modes:

1. __Complete Mode__ - The entire updated Results Table is written to the external storage. Depends on the storage connector to decide how to handle the writing of the entire table.
2. __Append Mode__ - Only append new rows appended in the Results Table since the last trigger. This is applicable only to queries where the previous instances are not expected to have changed.
3. __Update Mode__ - Only rows that were updated since the previous trigger will be written. This is different from Complete Mode in that only the rows that have been changed will be written and not all the rows. If there are no changes to previous entries, this is equivalent to Append Mode.

_Note: Each mode is applicable on certain types of queries_

In the Example above `lines` is the Input Table and the final `word_counts` will be the Results Table. Spark will continuously check for new data from the socket connection and run an "incremental" query that combines previous running counts with new data. DataFrame to generate the Results table queries the same way as a static table.


## Handling Event-time and Late Data

In this model, Spark is responsible for updating the Result Table when there is new data, thus relieving users from reasoning about fault-tolerance and data consistency.

Event-time is time embedded in the data itself (usually when the data was generated). In this model, each event is considered a row in the table and Event-time is a column. This allows window-based aggregation, where a grouping and aggregation can be done on the Event-time column - Each window is a group and each row can belong to multiple windows. Therefore, queries can be defined consistently.

Also, since Spark is updaring the Results table on a regular basis, it can update the old aggregates whenever "late" data has arrived. _Watermarking_ specify the threshold of late data, and allows engine to clean up old state.

## Fault Tolerance Semantics

* Streaming sources, sinks and execution engine track the exact progress of the processing to handle any kind of failure by restarting or reprocessing
* Every streaming source is assumed to have an offset to track the read position in the stream
* Checkpointing and write-ahead logs are used to track the offset range of the data processed
* Streaming sinks are idempotent and used for handling reprocessing
