![](https://www.mapr.com/sites/default/files/otherpageimages/spark-streaming.png)

Why Streaming?
-------------------

The world is expecting faster answers...
<img src="images/realtime.jpg">

By the end of this session, you should be able to:
-----

- Compare and contrast batch and streaming processing
- Describe and list common stream processing frameworks
- Draw the Spark Streaming framework
- Define a Dstream and perform common operations on it
- Write PySpark code for stream processing

---
From Batch to "Real-time"
---

Spark, like MapReduce, was designed to process data as a batch job.

Nightly batch jobs process large amounts of data and generate insights.

What if we want to react immediately instead of wait 24 hours.

Streaming solves this problem - It lets you process data immediately in near-realtime.

---
What is an example streaming application?
-----

![](http://www.h2htech.com/wp-content/uploads/2015/09/CyberSecurity.jpg)

Suppose you have an intrusion detection system.

You process log files to determine if the system is under attack.

Batch processing will take 24 hours to raise an intrusion alert.

Spark Streaming can detect an intrustion in minutes or even seconds.

----
Streaming Frameworks
----

1. Minibatch - Same as Batch, only very small and hopefully fast
2. "True" Streaming - Tuple-by-tuple

----
Stream Processing Frameworks
-----

1. [Apache Storm](http://storm.apache.org/)
2. [Twitter's Heron](Streaming Frameworks)
3. Spark Streaming

How "Realtime" is Spark?  Spark Streaming vs Storm
------------------------

How does Spark Streaming compare with Storm?

- Storm is another system for realtime processing of events.

- Here is a comparison of Storm and Spark Streaming.

Comparison           |Winner     |Spark Streaming      |Storm
----------           |------     |---------------      |-----
Processing Model     |  -        |Mini batches         |Record-at-a-time
Latency              |Storm      |Few seconds          |Sub-second
Fault tolerance      |Spark      |Exactly once         |At least once (may be duplicates)
Batch integration    |Spark      |Spark                |Requires different framework
API                  |Spark      |Simpler              |Complex
Production use       |Storm      |2013                 |2011

----
Spark Streaming
----

1. Features
2. Framework
3. Demo

Micro-Batch Concept
-------------------

How does Spark Streaming work?

- Events are grouped into micro batched RDDs.

- Each RDD contains events from the last few seconds.

- Incoming event stream is turned into RDD stream.

- These micro batched RDDs are joined with existing data to raise alerts.

Spark Streaming RDDs
--------------------

How does Spark Streaming integrate with Spark?

- Spark Streaming converts incoming events into micro batched RDDs.

- These are then processed by the regular Spark APIs.

<img src="images/streaming-arch.png">

<img src="images/streaming-flow-micro-batches.png">

---
Check for understanding
---

<details><summary>
What could be a downside of minibatch?
</summary>
Sometimes the processing takes longer than batch window, then the system becomes overwhelmed.
</details>

Spark Stack
-----------

How does Spark Streaming fit into the rest of Spark?

- Spark Streaming is a subsystem of Spark.

- Spark Streaming enables handling realtime events.

<img src="images/spark-stack.png">


Spark Streaming Big Picture
---------------------------

- Spark Streaming can consume events from multiple sources.

- These are processed and written out to HDFS, databases, and other
  systems.

<img src="images/streaming-input-output-components.png">


DStream Concept
---------------

- A DStream is a stream of RDDs.

- Think of a DStream as an infinite sequence of RDDs.

<img src="images/streaming-dstream-as-rdds.png">

- The incoming events are batched together into RDDs.

<img src="images/streaming-dstream-time-i.png">

Challenge Question
--------

<details><summary>
What happens to an event that is half in batch `time=1` and half in
batch `time=2`? Which batch does it go to?
</summary>
1. It goes to batch `time=2`.<br>
2. Incomplete events are meaningless.<br>
3. RDDs are formed from fully-formed events.
</details>

------
Spark Streaming Demo 
-----

In [2]:
from pyspark.streaming import StreamingContext
from pyspark import SparkContext

In [3]:
# Stop any already running context
# sc.stop() 

# Start a new Spark context
sc = SparkContext("local[*]", "MyFirstStream") 

# Create a Spark Streaming Context with batch interval of 1 second
ssc = StreamingContext(sc, batchDuration=1) 

In [4]:
# Create sample data
data = [[5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0]] 

# Put sample data in a RDD queue
rdd_queue = [sc.parallelize(_, 1) for _ in data] 

In [5]:
rdd_queue

[ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[3] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:423]

In [6]:
# Add our queue to the stream
input_stream = ssc.queueStream(rdd_queue) 

# Just print the contents of each RDD as it streams by
# Remeber in Python 2 print is a not grown-up function (it is a statement)
from __future__ import print_function 

# input_stream.foreachRDD(lambda rdd: print(rdd.collect()))

input_stream.map(lambda x: x + 1).pprint()

In [8]:
# Start stream processing
ssc.start()

-------------------------------------------
Time: 2016-06-23 15:00:39
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:40
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:41
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:42
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:43
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:44
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:45
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:46
-------------------------------------------

-------------------------------------------
Time: 2016-06-23 15:00:47
----------

In [10]:
# Stop stream processing
ssc.stop(stopSparkContext=False)

----
Common Spark Streaming data sources
----




Notes
-----

- The `StreamingContext` is stored in `ssc`.

- `ssc.socketTextStream` creates a `DStream`.

- DStreams transformations like `flatMap`, `map`, `reduceByKey` 
  create new DStreams.

- DStreams output operations like `pprint` are like RDD actions.

- Except DStream output operations do not cause execution.

Challenge Question
--------

<details><summary>
When you execute `pprint` on a DStream will anything be printed?
</summary>
1. Nothing is printed.<br>
2. The printing happens when we call `ssc.start()` and when data flows in.
</details>

RDDs vs DStreams
----------------

How are DStream different from RDDs?

- DStream transformations and output operations define an assembly line.
  
- Nothing happens until data comes in.

- When data comes in DStream output operations trigger the execution
  of DStream transformations.

<img src="images/donuts.jpg">

Transformations and Output Operations
=====================================

DStream Transformations
-----------------------

How are DStream transformations different from RDD transformations?

- DStream transformations define what will happen to RDDs when they
  arrive.
  
- DStream transformations produce new DStreams that will contain 
  transformed RDDs.

- Nothing happens until data arrives.

<img src="images/streaming-dstream-ops.png">

Transforming DStreams
---------------------

Transformation                                 |For Each Incoming RDD
--------------                                 |---------------------
`ds.map(lambda line: line.upper())`            |Uppercase `line` 
`ds.flatMap(lambda line: line.split())`        |Split `line` into words
`ds.filter(lambda line: line.strip() != '')`   |Exclude `line` if it is empty
`ds.repartition(10)`                           |Repartition RDD into 10 partitions
`ds.reduceByKey(lambda v1,v2: v1+v2)`          |For each key sum values 
`ds.groupByKey()`                              |For each key group values into iterable

Generic Transformations
-----------------------

How can I apply an arbitrary transformation on the incoming RDDs?

- DStreams have some but not all of the transformations as RDDs.

- For example, `sortByKey()` is not supported on DStreams.

- Instead DStreams provide `transform()` 

- `transform()` lets you translate any RDD transformation to DStreams.

- These two have the same effect.

```python
ds.transform(lambda rdd: rdd.flatMap(lambda line: line.split()))
```

```python
ds.flatMap(lambda line: line.split())
```

Challenge Question
--------

<details><summary>
How can you write `sortByKey()` for DStreams?
</summary>
```python
ds.transform(lambda rdd: rdd.sortByKey())
```
</details>

Challenge Question
--------

Consider this code:

```python
ds.transform(lambda rdd: rdd.flatMap(lambda line: line.split()))
```

<details><summary>
Where does `lambda line: ...` execute? 
</summary>
On the executors.
</details>


<details><summary>
Where does `lambda rdd: ...` execute? 
</summary>
On the driver.
</details>


DStream Output Operations
-------------------------

Expression                                     |Meaning
----------                                     |-------
`ds.foreachRDD(lambda rdd: func(rdd.first()))` |Call `func()` on `first()` of each incoming RDD
`ds.pprint(num=10)`                            |Print first 10 elements of each incoming RDD
`ds.saveAsTextFiles('foo',suffix=None)`        |Save each incoming RDD's partitions to disk

Notes
-----

- These output operations only execute when RDDs start arriving.

- `foreachRDD` is a generic output operation.

- `foreachRDD` lets you define arbitrary output operations on incoming RDDs.


Challenge Question
--------

<details><summary>
Print the count of incoming RDDs.
</summary>
```python
# Enable print as a function
from __future__ import print_function

# Define the output operation
ds.foreachRDD(lambda rdd: print(rdd.count()))
```
</details>

<details><summary>
Where will the lambda inside the `foreachRDD` execute?
</summary>
1. It will execute on the driver.<br>
2. This is because RDDs are defined on the driver, not on the executors.<br>
</details>

Testing Streaming Apps Using QueueStream
----------------------------------------

Manually testing apps using `nc` is quite tedious. Is there an
easier more automatable way to do this?

- *Queue streams* enable you to create preprogrammed streams perfect
  for automated testing and test-driven development.

Counting Event Types
--------------------

Count how many events of different types are in incoming stream in
each micro-batch.

- Here is the code.

In [13]:
range(3) * 2 + range(5) * 2

[0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

In [14]:
%%file test_queue_stream.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

from pprint import pprint

import time
import random

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)

print 'Initializing event_rdd_queue'
## This is my test data
event_rdd_queue = []
for i in xrange(5):
    events = range(5) * 10 + range(10) * 10
    event_rdd = ssc.sparkContext.parallelize(events)
    event_rdd_queue.append(event_rdd)
pprint(event_rdd_queue)

print 'Building DStream pipeline'
ds = ssc\
    .queueStream(event_rdd_queue) \
    .map(lambda event: (event, 1)) \
    .reduceByKey(lambda v1,v2: v1+v2)
ds.pprint()

print 'Starting ssc'
ssc.start()
time.sleep(6)

print 'Stopping ssc'
ssc.stop(stopSparkContext=True, stopGraceFully=True)

Writing test_queue_stream.py


- Lets run this and see what happens.

In [15]:
%%sh
$SPARK_HOME/bin/spark-submit test_queue_stream.py

Initializing ssc
Initializing event_rdd_queue
[ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[3] at parallelize at PythonRDD.scala:423,
 ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:423]
Building DStream pipeline
Starting ssc
-------------------------------------------
Time: 2016-06-23 15:16:16
-------------------------------------------
(0, 20)
(8, 10)
(4, 20)
(1, 20)
(5, 10)
(9, 10)
(2, 20)
(6, 10)
(3, 20)
(7, 10)

-------------------------------------------
Time: 2016-06-23 15:16:17
-------------------------------------------
(0, 20)
(8, 10)
(4, 20)
(1, 20)
(5, 10)
(9, 10)
(2, 20)
(6, 10)
(3, 20)
(7, 10)

-------------------------------------------
Time: 2016-06-23 15:16:18
-------------------------------------------
(0, 20)
(8, 10)
(4, 20)
(1, 20)
(5, 10)
(9, 10)
(2, 20)
(6, 10)
(3, 20)
(7, 10)

-----

[Stage 0:>                                                          (0 + 0) / 4][Stage 0:>                                                          (0 + 4) / 4]                                                                                

Aggregating RDD
===============

Merging DStreams
----------------

Transformation      |Effect
--------------      |------
`ds1.union(ds2)`    |Combine RDD in `ds1` with RDD in same batch in `ds2`
`ds1.join(ds2)`     |Join RDD in `ds1` with RDD in same batch in `ds2`

Note
----

- For `union` or `join` the DStreams must have identical batch
  durations.

- The batches are matched up based on timestamps.


Windowing Operations
--------------------

How can I process multiple RDDs within a window of time?

```python
ds2 = ds1.window(windowDuration=30, slideDuration=10)
```

- Batches RDDs into 30-second windows 

- Produces new window every 10 seconds

<img src="images/streaming-dstream-window.png">

Windowing Operations
--------------------

Calculate the average of a series of heads and tails using a
window.

In [5]:
%%file test_window.py

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

from pprint import pprint

import time

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)

print 'Initializing rdd_queue'
rdd_queue = []
for i in xrange(5): 
    rdd_data = xrange(1000)
    rdd = ssc.sparkContext.parallelize(rdd_data)
    rdd_queue.append(rdd)
pprint(rdd_queue)

print 'Creating queue stream'
ds = ssc\
    .queueStream(rdd_queue)\
    .map(lambda x: (x % 10, 1))\
    .window(windowDuration=4,slideDuration=2)\
    .reduceByKey(lambda v1,v2:v1+v2)
ds.pprint()

print 'Starting ssc'
ssc.start()
time.sleep(20)

print 'Stopping ssc'
ssc.stop(stopSparkContext=True, stopGraceFully=True)

Overwriting test_window.py


- Lets run this and see what happens.

In [6]:
%%sh
$SPARK_HOME/bin/spark-submit test_window.py

Initializing ssc
Initializing rdd_queue
[PythonRDD[5] at RDD at PythonRDD.scala:43,
 PythonRDD[6] at RDD at PythonRDD.scala:43,
 PythonRDD[7] at RDD at PythonRDD.scala:43,
 PythonRDD[8] at RDD at PythonRDD.scala:43,
 PythonRDD[9] at RDD at PythonRDD.scala:43]
Creating queue stream
Starting ssc
-------------------------------------------
Time: 2016-02-18 16:29:03
-------------------------------------------
(0, 200)
(8, 200)
(4, 200)
(1, 200)
(5, 200)
(9, 200)
(2, 200)
(6, 200)
(3, 200)
(7, 200)

-------------------------------------------
Time: 2016-02-18 16:29:05
-------------------------------------------
(0, 400)
(8, 400)
(4, 400)
(1, 400)
(5, 400)
(9, 400)
(2, 400)
(6, 400)
(3, 400)
(7, 400)

-------------------------------------------
Time: 2016-02-18 16:29:07
-------------------------------------------
(0, 300)
(8, 300)
(4, 300)
(1, 300)
(5, 300)
(9, 300)
(2, 300)
(6, 300)
(3, 300)
(7, 300)

-------------------------------------------
Time: 2016-02-18 16:29:09
--------------------

16/02/18 16:28:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/18 16:28:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

----
Summary
----
- Streaming is the new black for data processing
- It is a different way of processing. That requires its own idioms and logic
- Spark Streaming is a mini-batch system based on the DStream abstraction
- DStreams have 

<br>
<br> 
<br>

----