-sandbox
### Introduction to Structured Streaming
Structured Streaming is an efficient way to ingest large quantities of data from a variety of sources.  This course is intended to teach you how how to use Structured Streaming to ingest data from files and publisher-subscribe systems. Starting with the fundamentals of streaming systems, we introduce concepts such as reading streaming data, writing out streaming data to directories, displaying streaming data and Triggers. We discuss the problems associated with trying to aggregate streaming data and then teach how to solve this problem using structures called windows and expiring old data using watermarking. Finally, we examine how to connect Structured Streaming with popular publish-subscribe systems to stream data from Wikipedia.

#### Objectives
* **Read, write and display streaming data.**	
* Apply time windows and watermarking to aggregate streaming data.
* Use a publish-subscribe system to stream wikipedia data in order to visualize meaningful analytics

First, run the following cell to import the data and make various utilities available for our experimentation.

In [0]:
%run "./Includes/Classroom-Setup"

### The Problem
A stream of data is coming in from a TCP-IP socket, Kafka, EventHub, Kinesis or other sources, but... the data is coming in faster than it can be consumed.
**How can this problem be solved?**

#### The Answer... Implement the Micro-Batch Model
Many APIs solve this problem by employing a Micro-Batch model, wherein a *firehose* of data is collected for a predetermined interval of time (the **Trigger Interval**).<br>
In the following illustration, the **Trigger Interval** is two seconds.

<img src="https://files.training.databricks.com/images/streaming-timeline.png" style="height: 120px;"/>

<br>

#### Processing the Micro-Batch
For each interval, the data from the previous [two-second] interval must be processed... even as the next micro-batch of data is being collected.<br>
In the following illustration, two seconds worth of data is being processed in about one second.

<img src="https://files.training.databricks.com/images/streaming-timeline-1-sec.png" style="height: 120px;">

<br>

**But what happens if the data isn't processed quickly enough while reading from the streaming source?** <br>
- If the source were a TCP/IP stream... then data packets would be dropped; i.e., data would be lost.
- If the source were an IoT device measuring the outside temperature every 15 seconds... then this might be OK.<br>
  ...But, if the incoming data represented continuously shifting stock prices... the business impact could be thousands of lost dollars.
- If the source were a Pub/Sub system like Apache Kafka, Azure EventHub or AWS Kinesis... we would simply fall further behind.<br>
  Eventually, the pubsub system would reach its resource limits inducing other problems; however, we could re-launch the cluster using enough CPU cores to catch-up and remain current.
- **Ultimately, the data for the previous interval must be processed before data from the next interval arrives!**

<br>

#### From Micro-Batch to Table
Apache Spark treats a stream of **micro-batches** as if they were a series of continuous updates to a database table.
- This enables a query to be executed against this **input table**... just as if it were a static database table.
- The computation on the **input table** can then then pushed to a **results table**.
- And finally, the **results table** can be written to an output **sink**. 

<img src="https://files.training.databricks.com/images/eLearning/Delta/stream2rows.png" style="height: 200px"/>

##### Spark Structured Streams consist of two parts:
The **Input source** such as 
* Kafka
* Azure Event Hub
* Files on a distributed system
* TCP-IP sockets
  
And the **Sinks** such as
* Kafka
* Azure Event Hub
* Various file formats
* The system console
* Apache Spark tables (memory sinks)
* The completely custom `foreach()` iterator

<br>

##### Update Triggers:
**Triggers** can be defined to control how frequently the **input table** is updated. Each time a trigger fires, Spark checks for new data (new rows for the input table), and updates the result.
*The default value for `DataStreamWriter.trigger(Trigger)` is ProcessingTime(0), and it will run the query as fast as possible.*  This process repeats in perpetuity.

<br>

##### End-to-End Fault Tolerance:
Structured Streaming ensures end-to-end exactly-once fault-tolerance guarantees through _checkpointing_ and <a href="https://en.wikipedia.org/wiki/Write-ahead_logging" target="_blank">Write Ahead Logs</a>.  Structured Streaming sources, sinks, and the underlying execution engine work together to track the progress of stream processing. If a failure occurs, the streaming engine attempts to restart and/or reprocess the data. This approach _only_ works if the streaming source is replayable. To ensure fault-tolerance, Structured Streaming assumes that every streaming source has offsets, akin to:
* <a target="_blank" href="https://kafka.apache.org/documentation/#intro_topics">Kafka message offsets</a>
* <a target="_blank" href="http://docs.aws.amazon.com/streams/latest/dev/key-concepts.html#sequence-number">Kinesis sequence numbers</a>

At a high level, the underlying streaming mechanism relies on a couple approaches:
* First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.
* Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do _not_ result in duplicates being written to the sink.

Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure **end-to-end, exactly-once semantics** under any failure condition.

#### 1.0 Reading a Stream
The method `SparkSession.readStream` returns a `DataStreamReader` used to configure the stream.
There are a number of key points to the configuration of a `DataStreamReader`:
* The schema
* The type of stream: Files, Kafka, TCP/IP, etc
* Configuration specific to the type of stream
  * For files, the file type, the path to the files, max files, etc...
  * For TCP/IP the server's address, port number, etc...
  * For Kafka the server's address, port, topics, partitions, etc...
  
  
##### 1.1. Defining the Schema:
Every streaming DataFrame must have a schema - the definition of column names and data types.
- Some sources such as Kafka define the schema for you.
- In file-based streaming sources, for example, the schema is user-defined.

In [0]:
# Define the schema using a DDL-formatted string.
dataSchema = "Recorded_At timestamp, Device string, Index long, Model string, User string, _corrupt_record String, gt string, x double, y double, z double"

##### 1.2. Configuring a File Stream
In our example below, we will be consuming files written continuously to a pre-defined directory. 
To control how much data is pulled into Spark at once, we can specify the option `maxFilesPerTrigger`.
In the example below, only one file is read in for every trigger interval: `dsr.option("maxFilesPerTrigger", 1)` <br>
Both the location and file type are specified with the following call, which itself returns a `DataFrame`: `df = dsr.json(dataPath)`

In [0]:
dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data-stream.json"

initialDF = (spark
  .readStream                            # Returns DataStreamReader
  .option("maxFilesPerTrigger", 1)       # Force processing of only 1 file per trigger 
  .schema(dataSchema)                    # Required for all streaming DataFrames
  .json(dataPath)                        # The stream's source directory and file type
)

Given an initial `DataFrame`, transformations can then be applied. The example below illustrates renaming a column and removing an unnecessary column from the schema:

In [0]:
streamingDF = (initialDF
  .withColumnRenamed("Index", "User_ID")  # Pick a "better" column name
  .drop("_corrupt_record")                # Remove an unnecessary column
)

##### 1.3. Streaming DataFrames
Other than the call to `spark.readStream`, it looks just like any other `DataFrame`... But is it a "streaming" `DataFrame`?
You can differentiate between a "static" and "streaming" `DataFrame` with the following call:

In [0]:
streamingDF.isStreaming

Out[16]: True

##### 1.4. Unsupported Operations
Most operations on a "streaming" DataFrame are identical to a "static" DataFrame; however, there are some exceptions:
* Sorting a never-ending stream by `Recorded_At`.
* Aggregating a stream by some criterion.

Next, we will illustrate how to solve this problem.

In [0]:
from pyspark.sql.functions import col

try:
  sortedDF = streamingDF.orderBy(col("Recorded_At").desc())
  display(sortedDF)
except:
  print("Sorting is not supported on an unaggregated stream")

Sorting is not supported on an unaggregated stream


#### 2.0. Writing a Stream
The method `DataFrame.writeStream` returns a `DataStreamWriter` that is used to configure the output of a stream.

There are a number of parameters to the `DataStreamWriter` configuration:
* Query's name (optional) - This name must be unique among all the currently active queries in the associated SQLContext.
* Trigger (optional) - Default value is `ProcessingTime(0`) and it will run the query as fast as possible.
* Checkpointing directory (optional)
* Output mode
* Output sink
* Configuration specific to the output sink, such as:
  * The host, port and topic of the receiving Kafka server
  * The file format and final destination of files
  * A custom sink via `dsw.foreach(...)`

Once the configuration is completed, the job can be triggered by calling `dsw.start()`

##### 2.1. Triggers
The trigger specifies when the system should process the next set of data.

| Trigger Type                           | Example | Notes |
|----------------------------------------|-----------|-------------|
| Unspecified                            |  | _DEFAULT_- The query will be executed as soon as the system has completed processing the previous query |
| Fixed interval micro-batches           | `dsw.trigger(Trigger.ProcessingTime("6 hours"))` | The query will be executed in micro-batches and kicked off at the user-specified intervals |
| One-time micro-batch                   | `dsw.trigger(Trigger.Once())` | The query will execute _only one_ micro-batch to process all the available data and then stop on its own |
| Continuous w/fixed checkpoint interval | `dsw.trigger(Trigger.Continuous("1 second"))` | The query will be executed in a low-latency, continuous processing mode. _EXPERIMENTAL_ in 2.3.2 |

The following illustrates configuring a fixed interval of 3 seconds:<br>
`dsw.trigger(Trigger.ProcessingTime("3 seconds"))`


##### 2.2. Checkpointing
A **checkpoint** stores the current state of a streaming job to a reliable storage system such as Amazon S3, Azure Data Lake Storage (ADLS), or Hadoop Distributed File System (HDFS). It does not store the state of your streaming job to the local file system of any node in your cluster. Together with write ahead logs, a terminated stream can be restarted, and it will continue from where it left off.
To enable this feature, you only need to specify the location of a checkpoint directory: `dsw.option("checkpointLocation", checkpointPath)`

Points to consider:
* If you do not have a checkpoint directory, when the streaming job stops, you will lose all state regarding the streaming job... and upon restart, you start from scratch.
* For some sinks, you will get an error if you do not specify a checkpoint directory (e.g.,`analysisException: 'checkpointLocation must be specified either through option("checkpointLocation", ...)..`)
* Each streaming job requires a dedicated checkpoint directory (i.e., streaming jobs cannot share checkpoint direcctories).


##### 2.3. Output Modes


| Mode   | Example | Notes |
| ------------- | ----------- | ------------ |
| **Complete** | `dsw.outputMode("complete")` | The entire updated Result Table is written to the sink. The individual sink implementation decides how to handle writing the entire table. |
| **Append** | `dsw.outputMode("append")`     | Only the new rows appended to the Result Table since the last trigger are written to the sink. |
| **Update** | `dsw.outputMode("update")`     | Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. Since Spark 2.1.1 |


The following example illustrates writing to a Parquet directory that only supports the `append` mode: `dsw.outputMode("append")`


##### 2.4. Output Sinks

`DataStreamWriter.format` accepts the following values, among others:

| Output Sink | Example                                          | Notes |
| ----------- | ------------------------------------------------ | ----- |
| **File**    | `dsw.format("parquet")`, `dsw.format("csv")`...  | Dumps the Result Table to a file. Supports Parquet, json, csv, etc.|
| **Kafka**   | `dsw.format("kafka")`      | Writes the output to one or more topics in Kafka |
| **Console** | `dsw.format("console")`    | Prints data to the console (useful for debugging) |
| **Memory**  | `dsw.format("memory")`     | Updates an in-memory table, which can be queried through Spark SQL or the DataFrame API |
| **foreach** | `dsw.foreach(writer: ForeachWriter)` | This is your "escape hatch", allowing you to write your own type of sink. |
| **Delta**    | `dsw.format("delta")`     | A proprietary sink |


The follwing example illustrates appending files to a Parquet file while specifying its location: `dsw.format("parquet").start(outputPathDir)`

<br>

##### 2.5. Starting a Stream
The cells below demonstrate writting data from a streaming query to `outputPathDir`. 

A couple of things to note below:
- The query is being named via the call to `queryName`... which can be used later to reference the query by name.
- Spark begins running jobs once the call to `start` is made... which is the equivalent of calling an action on a "static" DataFrame.
- The call to `start` returns a `StreamingQuery` object... which can be used to interact with the running query.

In [0]:
outputPathDir = workingDir + "/output.parquet" # A subdirectory for our output
checkpointPath = workingDir + "/checkpoint"    # A subdirectory for our checkpoint & W-A logs
streamName = "lesson01_ps"                     # An arbitrary name for the stream

In [0]:
streamingQuery = (streamingDF                   # Start with our "streaming" DataFrame
  .writeStream                                  # Get the DataStreamWriter
  .queryName(streamName)                        # Name the query
  .trigger(processingTime="3 seconds")          # Configure for a 3-second micro-batch
  .format("parquet")                            # Specify the sink type, a Parquet file
  .option("checkpointLocation", checkpointPath) # Specify the location of checkpoint files & W-A logs
  .outputMode("append")                         # Write only new data to the "file"
  .start(outputPathDir)                         # Start the job, writing to the specified directory
)

In [0]:
untilStreamIsReady(streamName)                  # Wait until stream is done initializing...

The stream lesson01_ps is active and ready.


#### 3.0. Managing Streaming Queries

When a query is started, the `StreamingQuery` object can be used to monitor and manage the query.

| Method    |  Description |
| ----------- | ------------------------------- |
|`id`| get unique identifier of the running query that persists across restarts from checkpoint data |
|`runId`| get unique id of this run of the query, which will be generated at every start/restart |
|`name`| get name of the auto-generated or user-specified name |
|`explain()`| print detailed explanations of the query |
|`stop()`| stop query |
|`awaitTermination()`| block until query is terminated, with stop() or with error |
|`exception`| exception if query terminated with error |
|`recentProgress`| array of most recent progress updates for this query |
|`lastProgress`| most recent progress update of this streaming query |

For example, the following cell demonstrates polling the streaming query for its most recent progress!

In [0]:
streamingQuery.recentProgress                   # Poll the streaming query for its most recent progress

Out[21]: [{'id': '1ec79780-3d27-4663-8828-23e8abbc1314',
  'runId': '1fa231e0-52bc-4a68-92a4-55e32f8ddd32',
  'name': 'lesson01_ps',
  'timestamp': '2022-04-14T18:22:29.491Z',
  'batchId': 0,
  'numInputRows': 80096,
  'inputRowsPerSecond': 0.0,
  'processedRowsPerSecond': 5142.6003210272875,
  'durationMs': {'addBatch': 11901,
   'getBatch': 561,
   'latestOffset': 1301,
   'queryPlanning': 179,
   'triggerExecution': 15565,
   'walCommit': 621},
  'stateOperators': [],
  'sources': [{'description': 'FileStreamSource[dbfs:/mnt/training/definitive-guide/data/activity-data-stream.json]',
    'startOffset': None,
    'endOffset': {'logOffset': 0},
    'latestOffset': None,
    'numInputRows': 80096,
    'inputRowsPerSecond': 0.0,
    'processedRowsPerSecond': 5142.6003210272875}],
  'sink': {'description': 'FileSink[dbfs:/user/wna8fw@virginia.edu/structured_streaming/01_structured_streaming_introduction_psp/output.parquet]',
   'numOutputRows': -1}},
 {'id': '1ec79780-3d27-4663-8828-23e8

The following example demonstrates how to discover all actively running streams:

In [0]:
for s in spark.streams.active:                  # Iterate over all streams
  print("{}: {}".format(s.id, s.name))          # Print the stream's id and name

1ec79780-3d27-4663-8828-23e8abbc1314: lesson01_ps


The following example demonstrates **`awaitTermination()`** which blocks the current thread:
* Until the stream stops naturally, or... 
* Until the specified timeout elapses (if specified)

If the stream was "canceled", or otherwise terminated abnormally, any resulting exceptions will be thrown by **`awaitTermination()`** as well.

In [0]:
try:
    streamingQuery.awaitTermination(10)         # Stream for up to 10 seconds while the current thread blocks
    print("Awaiting Termination...")
  
except Exception as e:
  print(e)

Awaiting Termination...


... And once the 10 seconds have elapsed without any error, we can explictly stop the stream.

In [0]:
try:
  streamingQuery.stop()                         # Issue the command to stop the stream
  print("The stream has been stopped!")

except Exception:
  print(e)

The stream has been stopped!


When working with streams, in reality, we are working with a separate *thread* of execution. As a result, different exceptions may arise as streams are terminated and/or queried.

For this reason, we have developed a number of utility methods to help with these operations:
* **`untilStreamIsReady(name)`** to wait until a stream is fully initialized before resuming execution.
* **`stopAllStreams()`** to stop all active streams in a fail-safe manner.

The implementation of each of these can be found in the notebook **`./Includes/Common-Notebooks/Utility-Methods`**
<br><br>

#### 4.0. The Display Function
Within Databricks notebooks we can use the `display()` function to render a live plot; **however...**

When you pass a "streaming" `DataFrame` to `display`:
* A "memory" sink is being used
* The output mode is complete
* The query name is specified with the `streamName` parameter
* The trigger is specified with the `trigger` parameter
* The checkpointing location is specified with the `checkpointLocation`

`display(myDF, streamName = "myQuery")`

Because the only active streaming query was programmatically stopped in the previous cell, we will now call `display` in the following cell to *automatically* start the streaming DataFrame, `streamingDF`. This also affords the opportunity for passing `stream_2p` as a new **name** for this newly started stream.

In [0]:
newStreamName = 'stream_2p'
display(streamingDF, streamName = newStreamName)

Recorded_At,Device,User_ID,Model,User,gt,x,y,z
2015-02-23T10:18:54.958+0000,nexus4_2,0,nexus4,g,stand,-0.0003814697,0.03656006,0.030136108
2015-02-23T10:18:54.964+0000,nexus4_2,1,nexus4,g,stand,-0.001449585,0.035491943,0.027999878
2015-02-23T10:18:54.968+0000,nexus4_2,2,nexus4,g,stand,0.0006866455,0.033355713,0.030136108
2015-02-23T10:18:54.976+0000,nexus4_2,3,nexus4,g,stand,0.0006866455,0.033355713,0.022659302
2015-02-23T10:18:54.978+0000,nexus4_2,4,nexus4,g,stand,-0.0003814697,0.030151367,0.025863647
2015-02-23T10:18:54.992+0000,nexus4_2,5,nexus4,g,stand,-0.0003814697,0.025878906,0.023727417
2015-02-23T10:18:54.992+0000,nexus4_2,6,nexus4,g,stand,-0.001449585,0.02267456,0.023727417
2015-02-23T10:18:54.996+0000,nexus4_2,7,nexus4,g,stand,-0.0003814697,0.019470215,0.024795532
2015-02-23T10:18:54.999+0000,nexus4_2,8,nexus4,g,stand,0.0006866455,0.01626587,0.021591187
2015-02-23T10:18:55.000+0000,nexus4_1,0,nexus4,g,stand,0.80996704,-0.34129333,-0.1672821


In [0]:
untilStreamIsReady(newStreamName)               # Wait until stream is done initializing...

The cell below demonstrates how the value passed to `streamName` in the call to `display` enables programatic access the specific stream:

In [0]:
print("Looking for {}".format(newStreamName))

for stream in spark.streams.active:             # Loop over all active streams
  if stream.name == newStreamName:              # Single out "stream_2p"
    print("Found {} ({})".format(stream.name, stream.id)) 

Looking for stream_2p
Found stream_2p (f80be3d6-bdb6-4a99-8889-80664562fa43)


...And finally, all streams can be stopped by calling the `stopAllStreams()` function:

In [0]:
stopAllStreams()

Stopping the stream stream_2p.
The stream stream_2p was stopped.


#### 5.0. Summary
Use cases for streaming include bank card transactions, log files, Internet of Things (IoT) device data, video game play events and countless others.

Some key properties of streaming data include:
* Data coming from a stream is typically not ordered in any way
* The data is streamed into a **data lake**
* The data is coming in faster than it can be consumed
* Streams are often chained together to form a data pipeline
* Streams don't have to run 24/7:
  * Consider the new log files that are processed once an hour
  * Or the financial statement that is processed once a month
  
`readStream` is used to read streaming input from a variety of input sources, and to create a DataFrame.
Nothing happens until either `writeStream` or `display` is invoked. 
`writeStream` can be used to write data to a variety of output sinks.
`display` can be used to draw LIVE bar graphs, charts and other plot types in the notebook.

<br>

**Run the following cell to delete the tables, files, and any other artifacts associated with this lesson.**

In [0]:
%run ./Includes/Classroom-Cleanup

#### 6.0. Review Questions

**Question:** What is Structured Streaming?<br>
**Answer:** A stream is a sequence of data that is made available over time.<br>
Structured Streaming where we treat a <b>stream</b> of data as a table to which data is continously appended.<br>
The developer then defines a query on this input table, as if it were a static table, to compute a final result table that will be written to an output <b>sink</b>. 

**Question:** What purpose do triggers serve?<br>
**Answer:** Developers define triggers to control how frequently the input table is updated.

**Question:** How does micro batch work?<br>
**Answer:** We take our firehose of data and collect data for a set interval of time (the Trigger Interval).<br>
For each interval, our job is to process the data from the previous time interval.<br>
As we are processing data, the next batch of data is being collected for us.

**Q:** What do `readStream` and `writeStream` do?<br>
**A:** `readStream` creates a streaming DataFrame.  `writeStream` sends streaming data to a directory or other type of output sink.

**Q:** What does `display` output if it is applied to a DataFrame created via `readStream`?<br>
**A:** `display` sends streaming data to a LIVE graph!

**Q:** When you do a write stream command, what does this option do `outputMode("append")` ?<br>
**A:** This option takes on the following values and their respective meanings:
* <b>append</b>: add only new records to output sink
* <b>complete</b>: rewrite full output - applicable to aggregations operations
* <b>update</b>: update changed records in place

**Q:** What happens if you do not specify `option("checkpointLocation", pointer-to-checkpoint directory)`?<br>
**A:** When the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

**Q:** How do you view the list of active streams?<br>
**A:** Invoke `spark.streams.active`.

**Q:** How do you verify whether `streamingQuery` is running (boolean output)?<br>
**A:** Invoke `spark.streams.get(streamingQuery.id).isActive`.