# 3. Structured Streaming

Scalable and fault-tolerant streaming engine built on the Spark SQL engine.

__Features__

* Expressing streaming computation the same way as on static data - engine will take care of running it incrementally and continuously
* Use the Dataset/DataFrame API to express streaming functions
* Computation is executed on same Spark SQL engine
* System ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-ahead logs

__How Structured Queries Work__

1. Default - queries are processed using a _micro-batch processing engine_ which processes data streams as a series of small batch jobs, allowing it to achieve millisecond latency with exactly-once fault-tolerance guarentees
2. Continuos Processing - achieve even beetter latency with at-least-once guarantees (faster, but less tolerant)

Will explain the concepts used during micro-batch processing models, Continuous Processing models and the APIs used

# Example

Maintaining a running word count of text data received from a data server listening on a TCP socket - streaming a word count

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Create streaming DataFrame that represents text data received from stream hosted at localhost, and
# transforms the DataFrame to calculate word count

lines = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()
words = lines.select(
	explode(
		split(lines.value, " ")
	).alias('word')
)
word_counts = words.groupBy('word').count()


__`lines` DataFrame explained__

Represents an unbounded table containing the streaming text data. Each row will represent a line in the streaming text data. It then splits the sentence into individual words and creates a new row for each unique word with an alias `word`. 

It then creates a DataFrame that groups the unqiue values in the Dataset and counts them. This is a streaming DataFrame which represents the running word counts of the stream.

In [None]:
# Print complete every time the DataFrame is updated
query = word_counts \
	.writeStream \
	.outputMode('complete') \
	.format('console') \
	.start()

query.awaitTermination()

The above code will await sentence data from the source (localhost:9999) and then process the data in the DataFrame. The below command will instantiate a Netcat data server to send the sentence data over. This is a small utility found in most Unix-like systems

In [None]:
!nc -lk 9999

# Programming Model

Key idea is to treate a live data stream as a table that is being continuously appended