# Spark Streaming from Kafka with Avro formatted data

Example of using Spark to connect to Kafka and using Spark Streaming to process a Kafka stream of Avro (schemaless) objects.

## Set up schemas for decoding

Schemas are pulled automatically from Github during Docker build.

In [1]:
!ls ../../sample-avro-alert/schema

alert.avsc   diaobject.avsc  simple.avsc
cutout.avsc  diasource.avsc  ssobject.avsc


In [2]:
schema_files = ["../../sample-avro-alert/schema/diasource.avsc",
                    "../../sample-avro-alert/schema/diaobject.avsc",
                    "../../sample-avro-alert/schema/ssobject.avsc",
                    "../../sample-avro-alert/schema/cutout.avsc",
                    "../../sample-avro-alert/schema/alert.avsc"]

In [3]:
import fastavro
import avro.schema
import json

In [4]:
def loadSingleAvsc(file_path, names):
    """Load a single avsc file.
    """
    with open(file_path) as file_text:
        json_data = json.load(file_text)
    schema = avro.schema.SchemaFromJSONData(json_data, names)
    return schema

In [5]:
def combineSchemas(schema_files):
    """Combine multiple nested schemas into a single schema.
    """
    known_schemas = avro.schema.Names()

    for s in schema_files:
        schema = loadSingleAvsc(s, known_schemas)
    return schema.to_json()

In [6]:
schema = combineSchemas(schema_files)

In [7]:
schema

{'doc': 'sample avro alert schema v1.0',
 'fields': [{'doc': 'add descriptions like this',
   'name': 'alertId',
   'type': 'long'},
  {'name': 'l1dbId', 'type': 'long'},
  {'name': 'diaSource',
   'type': {'fields': [{'name': 'diaSourceId', 'type': 'long'},
     {'name': 'ccdVisitId', 'type': 'long'},
     {'default': None, 'name': 'diaObjectId', 'type': ['long', 'null']},
     {'default': None, 'name': 'ssObjectId', 'type': ['long', 'null']},
     {'default': None, 'name': 'parentDiaSourceId', 'type': ['long', 'null']},
     {'name': 'midPointTai', 'type': 'double'},
     {'name': 'filterName', 'type': 'string'},
     {'name': 'ra', 'type': 'double'},
     {'name': 'decl', 'type': 'double'},
     {'name': 'ra_decl_Cov',
      'type': [{'fields': [{'name': 'raSigma', 'type': 'float'},
         {'name': 'declSigma', 'type': 'float'},
         {'name': 'ra_decl_Cov', 'type': 'float'}],
        'name': 'ra_decl_Cov',
        'namespace': 'lsst.alert',
        'type': 'record'}]},
     {'

## Prep Spark environment

Need some packages to talk to Kafka.

In [8]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'

In [9]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# create spark and streaming contexts
sc = SparkContext("local[*]", "KafkaDirectStreamAvro")
ssc = StreamingContext(sc, 10)

# defining the checkpoint directory
ssc.checkpoint("/tmp")

## Start a Kafka stream for Spark to subscribe

No stamps, uses Avro schemaless encoding.

With lsst-dm/alert_stream, in an external shell:

docker run -it       --network=alertstream_default       alert_stream python bin/sendAlertStream.py my-stream 10 --no-stamps --repeat --max-repeats 3

## Create output for Spark to print

kafkaStream is configured with decoding applied to the value directly.

alerts grabs the actual alert messages.

alertIds applies a map that just grabs individual alertId's.

filter_all demonstrates a filtered stream that should catch all the alerts.

filter_empty demonstrates a filtered stream that should be empty.

In [10]:
import io

def decoder(msg):
    bytes_io = io.BytesIO(msg)
    bytes_io.seek(0)
    alert = fastavro.schemaless_reader(bytes_io, schema)
    return alert

In [11]:
kafkaStream = KafkaUtils.createDirectStream(ssc, ['my-stream'], {'bootstrap.servers': 'kafka:9092',
            'auto.offset.reset': 'smallest', 'group.id': 'spark-group' }, valueDecoder=decoder)

alerts = kafkaStream.map(lambda x: x[1])
alerts.pprint()

In [12]:
def map_alertId(alert):
    return alert['alertId']

In [13]:
alertIds = alerts.map(map_alertId)
alertIds.count().map(lambda x:'AlertId alerts in this window: %s' % x).pprint()  
alertIds.pprint()

In [14]:
def filter_allRa(alert):
    return alert['diaSource']['ra'] > 350

In [15]:
filter_all = alerts.filter(filter_allRa)
filter_all.count().map(lambda x:'Filter_all alerts in this window: %s' % x).pprint()  
filter_all.pprint()

In [16]:
def filter_emptyRa(alert):
    return alert['diaSource']['ra'] < 350

In [17]:
filter_empty = alerts.filter(filter_emptyRa)
filter_empty.count().map(lambda x:'Filter_empty alerts in this window: %s' % x).pprint()  
filter_empty.pprint()

## Start the streaming context

Output pprints of the streams above appear.

In [18]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2017-04-27 19:06:50
-------------------------------------------
{'prv_diaSources': [{'dipLnL': None, 'dipNdata': None, 'decl': 0.126243049656, 'dipMeanFlux': None, 'ra': 351.570546978, 'extendedness': None, 'dipLength': None, 'psLnL': None, 'iyyPSF': None, 'fpBkgdErr': None, 'dipDecl': None, 'dip_Cov': None, 'ixxPSF': None, 'ccdVisitId': 111111, 'psDecl': None, 'trailLength': None, 'trailLnL': None, 'trailAngle': None, 'totFlux': None, 'dipChi2': None, 'ssObjectId': None, 'ra_decl_Cov': {'raSigma': 0.0002800000074785203, 'declSigma': 0.0002800000074785203, 'ra_decl_Cov': 0.0002899999963119626}, 'spuriousness': None, 'apFlux': None, 'midPointTai': 1480360995.0, 'totFluxErr': None, 'ps_Cov': None, 'diffFluxErr': None, 'dipAngle': None, 'psChi2': None, 'parentDiaSourceId': None, 'ixy': None, 'fpBkgd': None, 'trailFlux': None, 'ixx': None, 'trailNdata': None, 'snr': 41.099998474121094, 'iyy': None, 'dipFluxDiff': None, 'apFluxErr': None, 'd

KeyboardInterrupt: 

In [19]:
ssc.stop()

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------
AlertId alerts in this window: 0

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------
Filter_all alerts in this window: 0

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------
Filter_empty alerts in this window: 0

-------------------------------------------
Time: 2017-04-27 19:08:30
-------------------------------------------



In [20]:
sc.stop()