# Spark Streaming from Kafka

Example of using Spark to connect to Kafka and using Spark Streaming to process a Kafka stream of alerts in non-Avro Python 'str' format.

## Notes

Useful references.

https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html

https://www.hugopicado.com/2016/09/17/spark-stateful-streaming-with-python.html

## Prep environment

Need some packages to talk to Kafka.

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'


In [2]:
from ast import literal_eval

In [3]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# create spark and streaming contexts
sc = SparkContext("local[*]", "KafkaDirectStream")
ssc = StreamingContext(sc, 10)

# defining the checkpoint directory
ssc.checkpoint("/tmp")

## Start a Kafka stream for Spark to subscribe

With lsst-dm/alert_stream, in an external shell:

docker run -it       --network=alertstream_default       alert_stream python bin/sendAlertStream.py my-stream 10 --no-stamps --encode-off --repeat --max-repeats 3


## Create output for Spark to print

kafkaStream is the raw Spark dstream.

alert_dstream applies a map that converts to the raw alert stream.

alertId_dstream applies a map that just grabs alertId's for each alert.

filter_all demonstrates a filtered stream that should catch all the alerts.

filter_empty demonstrates a filtered stream that should be empty.

In [4]:
kafkaStream = KafkaUtils.createDirectStream(ssc, ['my-stream'], {'bootstrap.servers': 'kafka:9092',
            'auto.offset.reset': 'smallest', 'group.id': 'spark-group' })  

kafkaStream.pprint()

In [5]:
alert_dstream = kafkaStream.map(lambda alert: literal_eval(alert[1]))
alert_dstream.count().map(lambda x:'Alerts in this window: %s' % x).pprint()  
alert_dstream.pprint()

In [6]:
def map_alertId(alert):
    return alert['alertId']

In [7]:
alertId_dstream = alert_dstream.map(map_alertId)
alertId_dstream.count().map(lambda x:'AlertId alerts in this window: %s' % x).pprint()  
alertId_dstream.pprint()

In [8]:
def filter_allRa(alert):
    if alert['diaSource']['ra'] > 350:
        return True
    else:
        return False

In [9]:
filter_all = alert_dstream.filter(filter_allRa)
filter_all.count().map(lambda x:'Filter_all alerts in this window: %s' % x).pprint()  
filter_all.pprint()

In [10]:
def filter_emptyRa(alert):
    if alert['diaSource']['ra'] < 350:
        return True
    else:
        return False

In [11]:
filter_empty = alert_dstream.filter(filter_emptyRa)
filter_empty.count().map(lambda x:'Filter_empty alerts in this window: %s' % x).pprint()  
filter_empty.pprint()

## Start the streaming context

Output pprints of the streams above appear.

In [12]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2017-04-27 18:40:50
-------------------------------------------
(None, "{'alertId': 1231321321, 'l1dbId': 222222222, 'diaSource': {'diaSourceId': 281323062375219200, 'ccdVisitId': 111111, 'midPointTai': 1480360995, 'filterName': 'r', 'ra': 351.570546978, 'decl': 0.126243049656, 'ra_decl_Cov': {'raSigma': 0.00028, 'declSigma': 0.00028, 'ra_decl_Cov': 0.00029}, 'x': 112.1, 'y': 121.1, 'x_y_Cov': {'xSigma': 1.2, 'ySigma': 1.1, 'x_y_Cov': 1.2}, 'snr': 41.1, 'psFlux': 1241.0, 'flags': 0}, 'prv_diaSources': [{'diaSourceId': 281323062375219198, 'ccdVisitId': 111111, 'midPointTai': 1480360995, 'filterName': 'r', 'ra': 351.570546978, 'decl': 0.126243049656, 'ra_decl_Cov': {'raSigma': 0.00028, 'declSigma': 0.00028, 'ra_decl_Cov': 0.00029}, 'x': 112.1, 'y': 121.1, 'x_y_Cov': {'xSigma': 1.2, 'ySigma': 1.1, 'x_y_Cov': 1.2}, 'snr': 41.1, 'psFlux': 1241.0, 'flags': 0}, {'diaSourceId': 281323062375219199, 'ccdVisitId': 111111, 'midPointTai': 1480360995

KeyboardInterrupt: 

-------------------------------------------
Time: 2017-04-27 18:42:10
-------------------------------------------



In [13]:
ssc.stop()

In [14]:
sc.stop()