# updateStateByKey Exercise

### updateStateByKey
The `updateStateByKey` operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
1. Define the state - The state can be an arbitrary data type.
2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as:
```python
def updateFunction(newValues, runningCount):
    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount)  # add the new values with the previous running count to get the new count
```
This is applied on a DStream containing words (say, the `pairs` DStream containing `(word, 1)` pairs in the earlier [example](https://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example)).
```python
runningCounts = pairs.updateStateByKey(updateFunction)
```
The update function will be called for each word, with `newValues` having a sequence of 1’s (from the (word, 1) pairs) and the `runningCount` having the previous count. For the complete Python code, take a look at the example [stateful_network_wordcount.py](https://github.com/apache/spark/blob/v2.2.0/examples/src/main/python/streaming/stateful_network_wordcount.py).

Note that using `updateStateByKey` requires the checkpoint directory to be configured, which is discussed in detail in the [checkpointing](https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) section.

### mapWithState


The Python API for Spark lacks the mapWithState function, unlike Java and Scala. As such we will be focusing on updateStateByKey.

### Exercise

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [None]:
sc = SparkContext()
ssc = StreamingContext(sc, 1)
ssc.checkpoint('checkpoint')

In [None]:
ds = ssc.socketTextStream("localhost", 9997)

In [None]:
# Function adds new values with previous running count to get new count
def updateFunction(newValues, runningCount):
    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount)  

In [None]:
# TODO: In addition to successfully using the textFileStream operation, make a new Dstream that does the following:
# - Take in the state DStream
# - Take a mod by 10 of the incoming numbers.
# - Counts how many times each number between 0 and 9 is seen.
# - Update the state with the `updateFunction` using updateStateByKey



##======================= END OF EXERCISE SECTION =======================

dst.pprint()
dst.count().pprint()

In [None]:
ssc.start()

## References
1. https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html