# 3. API using Datasets and DataFrames

Can use `SparkSession` as the entrypoint to create streaming DataFrames/Datasets.

## Creating streaming Datasets and DataFrames

* `DataStreamReader` -- creates a streaming DataFrame
* `SparkSession.readStream()` -- returns the stream DataFrame

### Input Sources

* __File Source__ - Reads files written in a directory as a stream of data. Files are processed in order of _modification time_. Files must be atomically placed in the given directory, which in most file systems is a file move operation
	* supports glob paths, but not multiple comma-separated paths/globs
	* archiving or deleting completed files will introduce overhead which will slow down the micro-batch processing speed. This option will also reduce the cost to list source files which can be expensive
	* source path should not be used from multiple soruces or queries. It shouldn't match any files in output directory or stream sink
	* delete and move actions are best effort. Failing to delete or move will not fail the streaming query 
* __Kafka Source__ - Reads data from a Kafka source
* __Socket Source (for testing)__ - Reads UTF8 text data from a socket connection. The listening socket is at the driver. Note that this should only be used for testing, exactly-once fault tolerance is not guaranteed.
	* host and port numbers must be provided
* __Rate Source (for testing)__ - Generates data a specified number of rows per second, each output row contains a timestamp (Timestamp data type) containing time of message dispatch and value (Long type) containing the message count. Source is intended for testing and benchmarking
	* source will try to reach `rowsPerSecond` parameter
	* `numPartitions` can be tweaked to help reach desired speed
* __Rate Per Micro-Batch source (for testing)__ - PRovides a specified number of rows per micro-batch. Difference between above is it provides a consistent set of inputs per row in each batch regardless of query execution (i.e batch 0 - 0-99, batch 1 - 100-199, ...). Both timestamp and value are also  provided. Source is intended for testing and benchmarking 
	* similar comments to above

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType

spark = SparkSession.builder.appName("API using Datasets and Datastructures").getOrCreate()

# Example 1 - socket stream
df_socket = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()

df_socket.isStreaming() # Returns True for DataFrames that have streaming sources

df_socket.printSchema()

# Example 2 - Read all the csv
userSchema = StructType().add('name', 'string').add('age', 'integer')
df_csv = spark \
	.readStream \
	.option('sep', ';') \
	.schema(userSchema) \
	.csv('path/to/directory')

Some operations (e.g `map`, `flatMap`) require type to be known at compile time. Above only checks the type at runtime. Untyped streaming DataFrames can be transformed to typed ones using the same method as static DataFrames.

_Refer to 2a. Getting Started - Programmatically specifying the schema to know how to do this_

## Schema inference and parition streaming DataFrames/Datasets

File based sources require schema specification for Structured Streaming. Can be enabled using `spark.sql.streaming.schemaInference` to true for other ad-hoc use cases.

Partition discovery does occur when subdirectories that are named `/key=value/` are present and listing will automatically recurse into these directories. The directories that make up the partitioning scheme must be present when the query starts and must remain static (means the partition columns cannot change in scope)

For example, it is okay to add `/data/year=2016/` when `/data/year=2015/` was present, but it is invalid to change the partitioning column (i.e. by creating the directory `/data/date=2016-04-17/`).

---

# Operations on streaming DataFrames/Datasets

Both untyped, SQL-like operations (SELECT, WHERE, GROUPBY) and RDD-like operations (map, filter, flatMap) are applicable

## Basic Operations - Selection, Projection, Aggregation

Most common operations on DataFrames/Datasets are supported for streaming versions

In [1]:
# streaming dataframe with IOT device data with schema
# {device: string, deviceType: string, signal: double, time: TimeStamp}
df = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()

# Select the devices which have signal more than 10
df.select('device').where("signal > 10")

# Running count of the number of updates for each device
df.groupBy("deviceType").count()

# Register streaming DataFrame as a temporary view and apply SQL commands
df.createOrReplaceTempView('updates')
spark.sql("SELECT count(*) FROM updates")

# Identify whether DataFrame has streaming data
df.isStreaming()

NameError: name 'spark' is not defined

## Window Operations on Event Time

Runs similar to grouped aggregations. Aggregate values are maintained for the rows which have Event-time that gall within the window.

Example: Want to count the words within a 10min window, which updates every 5mins (e.g 12:00-12:10, 12:05-12:15, 12:10-12:20, ...). If a word comes in at 12:07 it should update values in 2 windows. Counts will be indexed by both, the grouping key and the window. Illustrated below:

![streaming window example](https://spark.apache.org/docs/latest/img/structured-streaming-window.png)

* `window(<column>, <window_interval>, <sliding_interval>)` - create a window

In [None]:
from pyspark.sql.functions import window
# schema - {timestamp: Timestap, word: String}
words = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()

# Group the data by window and word and compute the count to each group
windowed_count = words.groupBy(
	window(words.timestamp, '10 minutes', '5 minutes'),
	words.word
).count()

## Handling Late Data and Watermarking

Structured Streaming can maintain the intermediate state for the partial aggregates for a long period of time such that late data can update aggregates of old windows correctly

However, system needs to bound the amount of intermediate in-memory state it accumulates i.e needs to know when an old aggregate can be dropped from in-memory state.

_Watermarking_ - specify the event time column and threshold on how late the data is expected to be in terms of event time. $(max\ event\ time\ seen\ by\ engine - late\ threshold > T)$ where $T$ is the specific window ending at time $T$. Data later than the threshold will be dropped (watermark is the latest timestamp - threshold)

* `.withWatermark(<column>, <threshold>)` - timeframe for late data

In [None]:
# schema - {timestamp: Timestap, word: String}
words = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()

windowed_counts = words \
	.withWatermark('timestamp', '10 minutes') \
	.groupBy(
		window(words.timestamp, '10 minutes', '5 minutes'),
		words.word
	) \
	.count()

Example of how watermarking will work

![watermark example](https://spark.apache.org/docs/latest/img/structured-streaming-watermark-update-mode.png)

## Types of time windows

1. __Tumbling Windows__ - series of fixed-sized non-overlapping and contiguous time intervals. An input can only be bound to a single window
2. __Sliding Windows__ - "fixed-sized", but windows can overlap if the duration of slide is smaller than duration of window. Inputs can be bound to multiple windows
3. __Sesion Windows__ - A session window starts when an input is received, and will extend if a new input is received within the specified gap duration. If no entry is received, then the window closes and awaits the next input to restart a new window. They are therefore dynamic in size of the window length
	* functions/expressions can be used to specify the gap duration dynamically based on the input row. With fynamic gap durations, the closing of a session window does not depend on the latest input anymore, it's range is the union of all events' range

1 and 2 use `window()`, 3 uses `session_window()`

In [None]:
from pyspark.sql.functions import session_window

events = spark \
	.readStream \
	.format('socket') \
	.option('host', 'localhost') \
	.option('port', 9999) \
	.load()

# Static session_window
static_sessionized_counts = events \
	.withWatermark('timestamp', '10 minutes') \
	.groupBy(
		session_window(events.timestamp, '5 minutes'),
		events.userId
	).count()

# Dynamic session_window
from pyspark.sql import functions as F

session_window = session_window(
	events.timestamp,
	F.when(events.userId == 'user1', '5 seconds').when(events.userId == 'user2', '20 seconds').otherwise('5 minutes')
)

dynamic_sessionzied_counts = events \
	.withWatermark('timestamp', '10 minutes') \
	.groupBy(
		session_window,
		events.userId
	).count()

## Conditions for watermarking to clean aggregation state