# Spark Streaming from Kafka

Example of using Spark to connect to Kafka and using Spark Streaming to process a Kafka stream of Python 'dict' objects.

## Notes

Useful references.

https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html


## Prep environment

Need some packages to talk to Kafka.

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'


In [2]:
from ast import literal_eval

In [3]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# define the update function
def updateTotalCount(currentCount, countState):
    if countState is None:
       countState = 0
    return sum(currentCount, countState)

# create spark and streaming contexts
sc = SparkContext("local[*]", "KafkaDirectStream")
ssc = StreamingContext(sc, 10)

# defining the checkpoint directory
ssc.checkpoint("/tmp")

## Start a Kafka stream for Spark to subscribe

With lsst-dm/alert_stream, in an external shell:

docker run -it       --network=alertstream_default       alert_stream python bin/sendAlertStream.py my-stream 10 --no-stamps --encode-off

docker run -it       --network=alertstream_default       alert_stream python bin/sendAlertStream.py my-stream 10 --no-stamps --encode-off

## Create output for Spark to print

kafkaStream is the raw dstream of Spark tuples.

alert_dstream applies a 'filter' that converts to the raw alert stream.

alertId_dstream applies a 'filter' that just grabs alertId's.

alertId_counts does a count of the events in alertId_dstream for the configured window (10 seconds).


In [4]:
kafkaStream = KafkaUtils.createDirectStream(ssc, ['my-stream'], {'bootstrap.servers': 'kafka:9092',
            'auto.offset.reset': 'smallest', 'group.id': 'spark-group' })  

kafkaStream.pprint()

In [5]:
alert_dstream = kafkaStream.map(lambda alert: literal_eval(alert[1]))
alert_dstream.pprint()

In [6]:
def filter_alertId(alert):
    return alert['alertId']

In [7]:
alertId_dstream = alert_dstream.map(filter_alertId)
alertId_dstream.pprint()

In [8]:
alertId_counts = alertId_dstream.countByValue()
alertId_counts.pprint()

## Start the streaming context

Output pprints of the streams above appear.

In [9]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2017-04-20 23:59:10
-------------------------------------------
(None, "{'alertId': 1231321321, 'l1dbId': 222222222, 'diaSource': {'diaSourceId': 281323062375219200, 'ccdVisitId': 111111, 'midPointTai': 1480360995, 'filterName': 'r', 'ra': 351.570546978, 'decl': 0.126243049656, 'ra_decl_Cov': {'raSigma': 0.00028, 'declSigma': 0.00028, 'ra_decl_Cov': 0.00029}, 'x': 112.1, 'y': 121.1, 'x_y_Cov': {'xSigma': 1.2, 'ySigma': 1.1, 'x_y_Cov': 1.2}, 'snr': 41.1, 'psFlux': 1241.0, 'flags': 0}, 'prv_diaSources': [{'diaSourceId': 281323062375219198, 'ccdVisitId': 111111, 'midPointTai': 1480360995, 'filterName': 'r', 'ra': 351.570546978, 'decl': 0.126243049656, 'ra_decl_Cov': {'raSigma': 0.00028, 'declSigma': 0.00028, 'ra_decl_Cov': 0.00029}, 'x': 112.1, 'y': 121.1, 'x_y_Cov': {'xSigma': 1.2, 'ySigma': 1.1, 'x_y_Cov': 1.2}, 'snr': 41.1, 'psFlux': 1241.0, 'flags': 0}, {'diaSourceId': 281323062375219199, 'ccdVisitId': 111111, 'midPointTai': 1480360995

KeyboardInterrupt: 

In [10]:
sc.stop()

In [11]:
ssc.stop()