# updateStateByKey Demo

### updateStateByKey
The `updateStateByKey` operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
1. Define the state - The state can be an arbitrary data type.
2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

Note that using `updateStateByKey` requires the checkpoint directory to be configured.


### mapWithState
MapWithState is another stateful transformation. The Python API for Spark lacks the mapWithState function, unlike Java and Scala. As such we will be focusing on updateStateByKey.

### Demo

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
sc = SparkContext()
ssc = StreamingContext(sc, 5)
ssc.checkpoint("checkpoint")

In [4]:
lines = ssc.socketTextStream("localhost", 9999)

In [5]:
def updateFunc(new_values, last_sum):
        return sum(new_values) + (last_sum or 0)

In [6]:
running_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).updateStateByKey(updateFunc)
running_counts.pprint()

In [None]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2018-03-01 23:15:50
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 23:15:55
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 23:16:00
-------------------------------------------
('possession', 1)
('was', 2)
('spot,', 1)
('these', 1)
('close', 1)
('but', 1)
('gleams', 1)
('than', 1)
('existence', 1)
('of', 8)
...

-------------------------------------------
Time: 2018-03-01 23:16:05
-------------------------------------------
('possession', 1)
('was', 2)
('spot,', 1)
('these', 1)
('close', 1)
('but', 1)
('gleams', 1)
('than', 1)
('existence', 1)
('of', 8)
...

-------------------------------------------
Time: 2018-03-01 23:16:10
-------------------------------------------
('possession', 1)
('was', 2)
('spot,', 1)
('these', 1)
('close', 1)
('but', 1)
('gleams', 1)
('than', 1)
('existence', 1)
('of', 8)
...

-----------------

## References
1. https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html