---
Spark Streaming
====

![](images/streaming_joke.png)

----
By the end of this session, you should be able to:
----

- Describe how window and duration size impacts processing
- Join streaming data
- Handle fault tolerance with checkpointing

Optimization - Windowing Operations With Inverse
---------------------------------

How can I avoid the overhead of adding or averaging over the same
values in a window?

```python
windows_word_counts = pair_ds.reduceByKeyAndWindow(
    func=lambda x, y: x + y,
    invFunc=lambda x, y: x - y, 
    windowDuration=30,
    slideDuration=10)
```

- Creates window of length `windowDuration` (30 seconds)

- Moves window every `slideDuration` (10 seconds)

- Merges incoming values using `func`

- Eliminates outgoing values using `invFunc`

- `windowDuration` and `slideDuration` are in seconds

- These must be multiples of the `batchDuration` of the DStream

- This requires that *checkpointing* is enabled on the StreamingContext.

<img src="images/streaming-windowed-stream.png">

<img src="images/streaming-windowed-stream-with-inv.png">

Streaming Durations
-------------------

What are the different durations in a DStream and which one should
I use?

Type               |Meaning
----               |-------
Batch Duration     |How many seconds until next incoming RDD
Slide Duration     |How many seconds until next window RDD
Window Duration    |How many seconds to include in window RDD

![](http://4.bp.blogspot.com/-_HgluMg7Aa0/VHlgpJ99-0I/AAAAAAAABJc/adgprRDim-E/s1600/p1.png)

Duration Impact
---------------

What is the impact of increasing these durations?

Type                 |Increase                                   |Effect 
----                 |--------                                   |------ 
Batch Duration       |Larger but less frequent incoming RDDs     |Less Processing 
Slide Duration       |Less frequent window RDDs                  |Less Processing
Window Duration      |Larger window RDDs                         |More Processing

[Source](http://horicky.blogspot.com/2014/11/spark-streaming.html)

Duration Summary
----------------

- Batch and window duration control RDD size

- Batch and slide duration control RDD frequency

- Larger RDDs have more context and produce better insights.

- Larger RDDs might require more processing.

- Bundling frequent small RDDs into infrequent larger ones can reduce processing.

State DStreams
--------------

How can I aggregate a value over the lifetime of a streaming
application?

- You can do this with the `updateStateByKey` transform.

```python
# add new values with previous running count to get new count
def updateFunction(newValues, runningCount):
    if runningCount is None:
       runningCount = 0
    return sum(newValues, runningCount)  

runningCounts = pairs.updateStateByKey(updateFunction)
```

- This takes a DStream made up of key-value RDDs

- For each incoming RDD for each key it aggregates the values with the
  previous values seen for that key.

- Like the windowing transformations, this requires that checkpointing
  be enabled on the StreamingContext.

Testing Streaming Apps Using TextFileStream
-------------------------------------------

The QueueStream does not work with windowing operations or any
other operations that require checkpointing. How can code that uses
`updateStateByKey` be tested? 

- We can use TextFileStream instead.
- Lets define a function `xrange_write` which we will use for the following examples.
- This will write numbers 0, 1, 2, ... to directory `input`.
- It will write 5 numbers per second, one per line.

In [1]:
%%file text_file_util.py
import itertools
import time
import random
import uuid

from distutils import dir_util 

# Every batch_duration write a file with batch_size numbers, forever.
# Start at 0 and keep incrementing. (For testing.)

def xrange_write(
        batch_size = 5,
        batch_dir = 'input',
        batch_duration = 1):
    dir_util.mkpath('./input')
    
    # Repeat forever
    for i in itertools.count():
        # Generate data
        min = batch_size * i 
        max = batch_size * (i + 1)
        batch_data = xrange(min,max)
      
        # Write to the file
        unique_file_name = str(uuid.uuid4())
        file_path = batch_dir + '/' + unique_file_name
        with open(file_path,'w') as batch_file: 
            for element in batch_data:
                line = str(element) + "\n"
                batch_file.write(line)
    
        # Give streaming app time to catch up
        time.sleep(batch_duration)

Overwriting text_file_util.py


Counting Events
---------------

How can I count a certain type of event in incoming data?

- You can use state DStreams.

- This code takes a mod by 10 of the incoming numbers.

- Then it counts how many times each number between 0 and 9 is seen.

In [2]:
%%file test_count.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from text_file_util import xrange_write

from pprint import pprint

# add new values with previous running count to get new count
def updateFunction(newValues, runningCount):
    if runningCount is None:
       runningCount = 0
    return sum(newValues, runningCount)  

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)
ssc.checkpoint('ckpt')

ds = ssc.textFileStream('input') \
    .map(lambda x: int(x) % 10) \
    .map(lambda x: (x,1)) \
    .updateStateByKey(updateFunction)

ds.pprint()
ds.count().pprint()

print 'Starting ssc'
ssc.start()

# Write data to textFileStream
xrange_write()

Overwriting test_count.py


- Lets run this and see what happens.

In [3]:
%%sh
#$SPARK_HOME/bin/spark-submit test_count.py
echo $SPARK_HOME/bin/spark-submit test_count.py

/Users/asimjalis/d/spark-1.6.0-bin-hadoop2.6/bin/spark-submit test_count.py


- The program will run forever. To terminate hit `Ctrl-C`.

Challenge Question
--------

<details><summary>
How can you calculate a running average using a state DStream?
</summary>
1. In the above example, for the RDD key-value pair, replace `value`
with `(sum,count)`. <br>
2. In `updateStateByKey` add both to `sum` and `count`.<br>
3. Use `map` to calculate `sum/count` which is the average.<br>
</details>

<!--
Challenge Question
--------

<details><summary>
How can you calculate a running standard deviation using a state DStream?
</summary>
1. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm<br>  <-- Wrong!
</details>
-->



Join 
----

How can I detect if an incoming credit card transaction is from a
canceled card?

- You can join DStreams against a batch RDD.
- Store the historical data in the batch RDD.
- Join it with the incoming DStream RDDs to determine next action.
- Note: You must get the batch RDD using the `ssc.SparkContext`.

```python
dataset = ... # some RDD
windowedStream = stream.window(20)
joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
```

Detecting Bad Customers
-----------------------
Create a streaming app that can join the incoming orders with our
previous knowledge of whether this customer is good or bad.

- Create the streaming app.

In [4]:
%%file test_join.py
# Import modules.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

from pprint import pprint

import time

# Create the StreamingContext.

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)


# For testing create prepopulated QueueStream of streaming customer orders. 

print 'Initializing queue of customer transactions'
transaction_rdd_queue = []
for i in xrange(5): 
    transactions = [(customer_id, None) for customer_id in xrange(10)]
    transaction_rdd = ssc.sparkContext.parallelize(transactions)
    transaction_rdd_queue.append(transaction_rdd)
pprint(transaction_rdd_queue)

# Batch RDD of whether customers are good or bad. 

print 'Initializing bad customer rdd from batch sources'
# (customer_id, is_good_customer)
customers = [
        (0,True),
        (1,False),
        (2,True),
        (3,False),
        (4,True),
        (5,False),
        (6,True),
        (7,False),
        (8,True),
        (9,False) ]
customer_rdd = ssc.sparkContext.parallelize(customers)

# Join the streaming RDD and batch RDDs to filter out bad customers.
print 'Creating queue stream'
ds = ssc\
    .queueStream(transaction_rdd_queue)\
    .transform(lambda rdd: rdd.join(customer_rdd))\
    .filter(lambda (customer_id, (customer_data, is_good_customer)): is_good_customer)

ds.pprint()

ssc.start()
time.sleep(6)
ssc.stop()

Overwriting test_join.py


- Lets run this and see what happens.

In [5]:
%%sh
# $SPARK_HOME/bin/spark-submit 
python test_join.py

Initializing ssc
Initializing queue of customer transactions
[ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[3] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:423]
Initializing bad customer rdd from batch sources
Creating queue stream
-------------------------------------------
Time: 2016-06-23 06:58:23
-------------------------------------------
(0, (None, True))
(8, (None, True))
(4, (None, True))
(2, (None, True))
(6, (None, True))

-------------------------------------------
Time: 2016-06-23 06:58:24
-------------------------------------------
(0, (None, True))
(8, (None, True))
(4, (None, True))
(2, (None, True))
(6, (None, True))

-------------------------------------------
Time: 2016-06-23 06:58:25
-------------------------------------------
(0, (None, True))
(

[Stage 0:>                                                          (0 + 0) / 4][Stage 0:>                                                          (0 + 2) / 4]                                                                                

Challenge Question
--------

<details><summary>
If you are joining with a large batch RDD how can you minimize the
shuffling of the records?
</summary>
1. Use `partitionBy` on the incoming RDDs as well as on the batch
RDD.<br>
2. This will ensure that records are partitioned by their keys.<br>
3. This can make a real difference in the performance of your Big Data
streaming app.<br>
</details>

Cluster View
------------

<img src="images/streaming-daemons.png">

Checkpointing
-------------

How can I protect my streaming app against failure?

- Streaming apps run for much longer than batch apps.

- They can run for days and weeks.

- So fault-tolerance is important for them.

- To enable recovery from failure you must enable checkpointing.

- If a checkpointed application crashes, you restart it and it
  recovers the state of the RDDs when it crashed.

In [6]:
%%file test_checkpointing.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from text_file_util import xrange_write
    
from pprint import pprint
    
def updateFunction(newValues, runningCount):
    if runningCount is None:
       runningCount = 0
    return sum(newValues, runningCount)  
    
checkpointDir = 'ckpt'
    
def functionToCreateContext():
    ssc = StreamingContext(SparkContext(), batchDuration=2)
    
    # Add new values with previous running count to get new count
    ds = ssc.textFileStream('input') \
        .map(lambda x: int(x) % 10) \
        .map(lambda x: (x,1)) \
        .updateStateByKey(updateFunction)
    ds.pprint()
    ds.count().pprint()
    
    # Set up checkpoint
    ssc.checkpoint(checkpointDir)
    return ssc
    
print 'Initializing ssc'
ssc = StreamingContext.getOrCreate(
    checkpointDir, functionToCreateContext)
    
print 'Starting ssc'
ssc.start()
    
# Write data to textFileStream
xrange_write()

Writing test_checkpointing.py


- Lets run this and see what happens.

In [7]:
%%sh
$SPARK_HOME/bin/spark-submit test_checkpointing.py
#echo $SPARK_HOME/bin/spark-submit test_checkpointing.py

Process is terminated.


- The program will run forever. To terminate hit `Ctrl-C`.

<br>
<br> 
<br>

----