# Spark Streaming

> Streaming is an extension to Spark's core which allows to perform __real life data processing__

> __Not all functionalities are available on Python currently, in order to work with full powered framework one should use `scala` or `java`__

![](./images/streaming-arch.png)

Flow looks as below:
1. We hook up to data stream source (e.g. `Apache Kafka`)
2. Incoming data is divided into batches 
3. Batches are processed by `spark` engine
4. Resulting data batches are returned (e.g. after `MapReduce` transforms)

![](./images/streaming-flow.png)

## DStreams (Discretized Streams)

> High level abstraction over stream of data which allows us to easily work it

These can be created either by:
- Reading directly our data source
- Modifying  existing `DStreams` (and creating new ones at the same time)

> __`DStreams` internally are represented as a sequence of `RDD`s__

Let's start by creating `Session` (__you should always start like that!__)

In [None]:
import findspark

findspark.init()

In [None]:
import multiprocessing

import pyspark

# We should always start with session in order to obtain
# context and session if needed
session = pyspark.sql.SparkSession.builder.config(
    conf=pyspark.SparkConf()
    .setMaster(f"local[{multiprocessing.cpu_count()}]")
    .setAppName("TestApp")
).getOrCreate()

## StreamingContext

After that we can create `pyspark.streaming.StremingContext` ([docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.html)) object which:
- takes `sparkContext` as input
- it can be the one we're using for other things (SQL or core PySpark functionality)
- __it does not work with `pyspark.SparkContext` object directly though!__

In [None]:
from pyspark.streaming import StreamingContext

# This context can be used with PySpark streaming
# You might have to specify batchDuration (e.g. on which time window operation will be run)
# By default data is collected every 0.5 seconds
ssc = StreamingContext(session.sparkContext, batchDuration=30)

Important things to notice:
- __We set up the whole computation pipeline first__
- __NOTHING__ is started until we use `ssc.start()` and finish with `ssc.end()`

Methods of interest which allow us to work with streams:
- [`ssc.awaitTermination([timeout])`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.awaitTermination.html#pyspark.streaming.StreamingContext.awaitTermination)
    processes data indefinitely or up to a moment `timeout` is hit
- [`ssc.checkpoint(directory)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.checkpoint.html#pyspark.streaming.StreamingContext.checkpoint) -
    periodically checkpoint data for increased fault tolerance
- [`scc.getActiveOrCreate(path, function)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.getActiveOrCreate.html#pyspark.streaming.StreamingContext.getActiveOrCreate) -
    If there is an active stream (`start`ed and not `stop`ped) return it, otherwise recreate if from checkpoint

Methods we will run each time:
- [`scc.start()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.start.html#pyspark.streaming.StreamingContext.start) - starts execution of streams
- [`scc.stop()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.stop.html#pyspark.streaming.StreamingContext.stop) - stops executions of streams

Creating streams from context:
- [`ssc.socketTextStream(hostname, port [,storageLevel])`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.socketTextStream.html#pyspark.streaming.StreamingContext.socketTextStream) - create input stream by listening on specified `hostname` and `port`
- [`ssc.textFileStreams(directory)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.textFileStream.html#pyspark.streaming.StreamingContext.textFileStream) - watch for new files created on Hadoop compatible system (e.g. HDFS) in specified directory and read them as text files
- [`ssc.binaryRecordsStream(directory, recordLength)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.binaryRecordsStream.html#pyspark.streaming.StreamingContext.binaryRecordsStream) - as above, but reads the files as binary

Given all of that, let's create a `DStream` via `socketTextStream` which will listen on `localhost` for data incoming to the machine:

In [None]:
# We will send lines of data to this socketTextStream
lines = ssc.socketTextStream("localhost", 9999)

Let's also apply same transformations as previously which:
- Will split input text into `words` and flatten the result
- Count unique words in the text

In [None]:
unique_words = lines.flatMap(lambda text: text.split()).countByValue()

And print the incoming words:

In [None]:
unique_words.pprint()

__Notice nothing happened yet!__

This is due to our `computation` not starting yet. In order to do that, we have to run `ssc.start()`.

Before we do that though, let's run `netcat` to push some data to our socket.

Please notice that:
- `--listen` flag is specified __which means we have created server listening on specified port__
- PySpark's `socket` __expects to find server which it can connect to under specified address!__

Let's run this interactive command. It will allow us to send text data to `localhost:9999` (make sure you have `nc` or `netcat` available on your system!).

> __Run it in a separate cell in order not to block the execution of the notebook!__

After you've run it you can send textual data to the server setup by `netcat`

In [None]:
# !nc --listen -p 9999

Now, you can run `start` which:
- __will run indefinitely__, BUT
- __will NOT stop Python's program execution__

Due to above we will be able to run next cell (in our program it would be next line)

In [None]:
ssc.start()

Now we can run the cell below in order to wait for `seconds` until `pyspark` terminates the connection:

In [None]:
seconds = 180

ssc.awaitTermination(seconds)

Finally, we can stop `pyspark` client handling incoming data.

Specifying `stopGraceFully=True` will allow it to finish only after consuming whole data posted to `9999` port:

In [None]:
ssc.stop(stopGraceFully=True)

## Analyze

Let's analyze our results. This infographic will help us clear the confusion:

![](./images/streaming-dstream.png)

![](./images/streaming-dstream-ops.png)

As one can see:

- __Operations are done on the batch of gathered data, NOT ON THE WHOLE DATASET__
- If we want to do that we should `persist` the results to disk periodically
- Streams will be automatically cleared

## Procedure to follow when working with Streams

Here are the steps one usually employs when working with streams:

1. Define `SparkStreamContext` from Session's context
2. Define the input sources by creating input `DStreams`.
3. Define the streaming computations by applying transformation and output operations to `DStreams`.
4. Start receiving data and processing it using `streamingContext.start()`.
5. Wait for the processing to be stopped (manually or due to any error) using `streamingContext.awaitTermination()`.
6. The processing can be manually stopped using `streamingContext.stop()`.

In addition we should remember (especially when working with multiple streams):

1. __Once a context has been started, no new streaming computations can be set up or added to it__.
2. Once a context has been stopped, it cannot be restarted.
3. Only one `StreamingContext` can be active in a JVM at the same time.
4. `stop()` on `StreamingContext` also stops the `SparkContext`. 
    To stop only the `StreamingContext`, set the optional parameter of `stop()` `stopSparkContext` to false.
5. A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped 
    (without stopping the SparkContext) before the next StreamingContext is created.


# Challenges

## Assessment

- Check out [linking with Kafka](https://spark.apache.org/docs/latest/streaming-programming-guide.html#linking) in order to setup real data streaming procedure
- Check out [additional receivers](https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers) and see what one has to do in order to set it up

## Non-assessment

- Check how Spark monitors directories for new files [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#how-directories-are-monitored)