# Checkpointing Demo

A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.
* _Metadata checkpointing_ - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
    * _Configuration_ - The configuration that was used to create the streaming application.
    * _DStream operations_ - The set of DStream operations that define the streaming application.
    * _Incomplete batches_ - Batches whose jobs are queued but have not completed yet.
* _Data checkpointing_ - Saving of the generated RDDs to reliable storage. This is necessary in some _stateful_ transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically _checkpointed_ to reliable storage (e.g. HDFS) to cut off the dependency chains.
To summarize, metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

### When to enable Checkpointing

Checkpointing must be enabled for applications with any of the following requirements:
* _Usage of stateful transformations_ - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
* _Recovering from failures of the driver running the application_ - Metadata checkpoints are used to recover with progress information.
Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

### to configure Checkpointing

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved. This is done by using streamingContext.checkpoint(checkpointDirectory). This will allow you to use the aforementioned stateful transformations. Additionally, if you want to make the application recover from driver failures, you should rewrite your streaming application to have the following behavior.

* When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
* When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory.

This behavior is made simple by using StreamingContext.getOrCreate. This is used as follows.

```python
# Function to create and setup a new StreamingContext
def functionToCreateContext():
    sc = SparkContext(...)  # new context
    ssc = StreamingContext(...)
    lines = ssc.socketTextStream(...)  # create DStreams
    ...
    ssc.checkpoint(checkpointDirectory)  # set checkpoint directory
    return ssc

# Get StreamingContext from checkpoint data or create a new one
context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)

# Do additional setup on context that needs to be done,
# irrespective of whether it is being started or restarted
context. ...

# Start the context
context.start()
context.awaitTermination()
```
If the checkpointDirectory exists, then the context will be recreated from the checkpoint data. If the directory does not exist (i.e., running for the first time), then the function functionToCreateContext will be called to create a new context and set up the DStreams. See the Python example recoverable_network_wordcount.py. This example appends the word counts of network data into a file.

You can also explicitly create a StreamingContext from the checkpoint data and start the computation by using StreamingContext.getOrCreate(checkpointDirectory, None).

In addition to using getOrCreate one also needs to ensure that the driver process gets restarted automatically on failure. This can only be done by the deployment infrastructure that is used to run the application. This is further discussed in the Deployment section.

Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.




### Demo

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
import os
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
def createContext(host, port, outputPath):
    # If you do not see this printed, that means the StreamingContext has been loaded
    # from the new checkpoint
    print("Creating new context")
    if os.path.exists(outputPath):
        os.remove(outputPath)
    sc = SparkContext(appName="PythonStreamingCheckpointedWordCount")
    ssc = StreamingContext(sc, 1)

    # Create a socket stream on target ip:port and count the
    # words in input stream of \n delimited text (eg. generated by 'nc')
    lines = ssc.socketTextStream(host, port)
    words = lines.flatMap(lambda line: line.split(" "))
    wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

    def echo(time, rdd):
        counts = "Counts at time %s %s" % (time, rdd.collect())
        print(counts)
        print("Appending to " + os.path.abspath(outputPath))
        with open(outputPath, 'a') as f:
            f.write(counts + "\n")

    wordCounts.foreachRDD(echo)
    return ssc

In [4]:
host = 'localhost'
port = 8880
# TODO: Change this path to reflect your own directory  if you want to run this demo
checkpoint = "checkpoint"
output = "output/checkdemo.txt"

ssc = StreamingContext.getOrCreate(checkpoint, lambda: createContext(host, int(port), output))

Creating new context


In [5]:
ssc.start()

Counts at time 2018-03-02 03:29:16 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:17 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:18 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:19 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:20 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:21 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:22 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:23 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:24 []
Appending to /home/matthew/pyspark-streami

In [6]:
ssc.stop()

Counts at time 2018-03-02 03:29:47 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt
Counts at time 2018-03-02 03:29:48 []
Appending to /home/matthew/pyspark-streaming/3_advanced/output/checkdemo.txt


## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing