# Processing Sensor Data from Kafka with Structured Streaming

The intention of this example is to explore how to consume and produce data with Structured Streaming API.

Using the data produced by the Akka ingestion microservice and produced to Kafka, we will: 
 - use the Kafka `source` to consume events from the `sensor-raw` topic in Kafka
 - implement the application logic using the Dataset API
 - use the `memory` sink to visualize the data
 - use the `kafka` sink to publish our results to a different topic and make it available downstream.
 - have some fun!  

##Common Definitions
We define a series of parameters of our current environment

In [ ]:
val sourceTopic = "sensor-raw"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = "172.17.0.2:9092" // local
// val kafkaBootstrapServer = "10.2.2.191:1025" // fast-data-ec2

sourceTopic: String = sensor-raw
targetTopic: String = sensor-processed
kafkaBootstrapServer: String = 172.17.0.2:9092


# PART I: Read and Visualize a Stream from Kafka

## Read a stream from Kafka
We use the kafka source to subscribe to the `sourceTopic` that contains the raw sensor data.
This results in a streaming dataframe that we use to operate on the underlying data

In [ ]:
val rawData = sparkSession.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaBootstrapServer)
      .option("subscribe", sourceTopic)
      .option("startingOffsets", "latest")
      .load()

rawData: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]


In [ ]:
rawData.isStreaming

res3: Boolean = true


In [ ]:
rawData.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



## Extract the payload from the `value` field

- The data in Kafka is contained in the `value` field of the message envelop. 
- To parse that payload, we need to know the schema of the data in the stream.
- We use a `case class` to define that schema. 

_Tip: This method is more convenient than using the sql schema definition directly._

### Declare the Schema

In [ ]:
case class SensorData(id: String, ts: Long, value: Double)

defined class SensorData


In [ ]:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema

import org.apache.spark.sql.Encoders
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,StringType,true), StructField(ts,LongType,false), StructField(value,DoubleType,false))


### Parse the Data
To transform the data in the payload, we need to:
- convert that binary value field to string
- use the `json` support in Spark to transform our incoming data into a structured streaming `Dataset`
- select the fields from the record to facilitate further processing


In [ ]:
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]

rawValues: org.apache.spark.sql.Dataset[String] = [value: string]
jsonValues: org.apache.spark.sql.DataFrame = [record: struct<id: string, ts: bigint ... 1 more field>]
sensorData: org.apache.spark.sql.Dataset[SensorData] = [id: string, ts: bigint ... 1 more field]


In [ ]:
sensorData.printSchema()

root
 |-- id: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- value: double (nullable = true)



## Visualize the Stream
To view the streaming data, we will use the `memory` sink and query the resulting table to get samples of the data.

In [ ]:
val visualizationQuery = sensorData.writeStream
  .queryName("visualization")    // this query name will be the SQL table name
  .outputMode("append")
  .format("memory")
  .start()

visualizationQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@4f153d5a


## Explore the Data
The `memory` sink creates an in-memory SQL table (like a `tempTable`) that we can query using Spark SQL
The result of the query is a static `Dataframe` that contains a snapshot of the data.

In [ ]:
val sampleDataset = sparkSession.sql("select * from visualization")


sampleDataset: org.apache.spark.sql.DataFrame = [id: string, ts: bigint ... 1 more field]


In [ ]:
sampleDataset

res19: org.apache.spark.sql.DataFrame = [id: string, ts: bigint ... 1 more field]


In [ ]:
sampleDataset.count

res21: Long = 70394


In [ ]:
sampleDataset.count

res29: Long = 118065


## Visualize the Data
We make a custom live update by querying the stream every so often for the latest updates

In [ ]:
val dummy = Seq((System.currentTimeMillis, 0.1), (System.currentTimeMillis, 0.1))

val chart = CustomPlotlyChart(dummy,
                  layout="{title: 'sensor data sample', xaxis: {title: 'time(seconds)'}, yaxis: {title: 'value'}}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2' }")

dummy: Seq[(Long, Double)] = List((1544049205981,0.1), (1544049205981,0.1))
chart: notebook.front.widgets.charts.CustomPlotlyChart[Seq[(Long, Double)]] = <CustomPlotlyChart widget>


### Some Thread Magic ahead: Async update of our visualization
We will use a plain old Thread to run a recurrent query on our in-memory table and update the chart accordingly.


In [ ]:
@volatile var running = true

running: Boolean = true


In [ ]:
import scala.concurrent.duration._
import scala.annotation.tailrec

val updater = new Thread() {
  @tailrec
  def visualize(): Unit = {
    val lastMinute = System.currentTimeMillis - 1.minute.toMillis
    val data = sampleDataset.where($"ts" > lastMinute and $"id" === "office").as[SensorData]
                            .map{case SensorData(id, ts, value) => (ts/1000%3600, value)}.collect
    if (data.size > 0 )chart.applyOn(data)
    if (running) {
      Thread.sleep(1.second.toMillis)
      visualize()
    } else ()
  } 
  
  override def run() {
    visualize()
  }
}.start()


import scala.concurrent.duration._
import scala.annotation.tailrec
updater: Unit = ()


In [ ]:
chart

res47: notebook.front.widgets.charts.CustomPlotlyChart[Seq[(Long, Double)]] = <CustomPlotlyChart widget>


In [ ]:
visualizationQuery.stop()


In [ ]:
running = false

running: Boolean = false


# PART II: Improve Data Quality

## Reduce noise with a moving average, using sliding windows


In [ ]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


In [ ]:
val toSeconds = udf((ts:Long) => ts/1000)
val sensorMovingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType))
                                          .withWatermark("timestamp", "30 seconds")
                                          .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
                                          .agg(avg($"temp"))

toSeconds: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(LongType)))
sensorMovingAverage: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
sensorMovingAverage.printSchema

root
 |-- id: string (nullable = true)
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- avg(temp): double (nullable = true)



In [ ]:
val windowedSensorQuery = sensorMovingAverage.writeStream
  .queryName("movingAverage")    // this query name will be the table name
  .outputMode("append")  
  .format("memory")
  .start()

windowedSensorQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@34822989


### Get the data from the in-memory table

In [ ]:
val movingAvgDF = sparkSession.sql("select * from movingAverage")

movingAvgDF: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
movingAvgDF

res39: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


### Chart the Moving Average Data

In [ ]:
import org.apache.spark.sql.functions._
val lastMinute: Long = System.currentTimeMillis/1000 - 5.minute.toSeconds
val mAvgSample = movingAvgDF.select($"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
                   .where($"timestamp" > lastMinute and $"id" === "office")
                   .orderBy($"timestamp")
                   .as[(Long, Double)]
                   .collect().map{case (ts, v) => (ts  % 3600,v)}


CustomPlotlyChart(mAvgSample,
                  layout=s"{title: 'moving average sensor data'}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2'}")

import org.apache.spark.sql.functions._
lastMinute: Long = 1544023424
mAvgSample: Array[(Long, Double)] = Array((1600,34.5), (1610,33.833333333333336), (1620,32.80952380952381), (1630,32.793103448275865), (1640,32.55172413793103), (1650,37.7), (1660,39.41379310344828))
res41: notebook.front.widgets.charts.CustomPlotlyChart[Array[(Long, Double)]] = <CustomPlotlyChart widget>


In [ ]:
// stop the ancilliary visualization queries
windowedSensorQuery.stop()
visualizationQuery.stop()
running = false

running: Boolean = false


## Write our moving average data to our `sensor-clean` topic

In [ ]:
// First we prepare the schema to comply with the (key, value) model of Kafka
val kafkaFormat = sensorMovingAverage
.select($"id", $"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
.select($"id" as "key", to_json(struct($"id", $"timestamp", $"temp")) as "value")

In [ ]:
val kafkaWriterQuery = kafkaFormat.writeStream
  .queryName("kafkaWriter") 
  .outputMode("append") 
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBootstrapServer)
  .option("topic", targetTopic)
  .option("checkpointLocation", "/tmp/spark/checkpoint6")
  .option("failOnDataLoss", "false")
  .start()

## View Progress

In [ ]:
val progress = kafkaWriterQuery.recentProgress

In [ ]:
progress.map(entry  => (entry.inputRowsPerSecond, entry.processedRowsPerSecond))