# Reading Sensor Data from Kafka with Structured Streaming

The intention of this example is to explore the main aspects of the Structured Streaming API.

We will: 
 - use the Kafka `source` to consume events from the `sensor-raw` topic in Kafka
 - implement the application logic using the Dataset API
 - use the `memory` sink to visualize the data
 - use the `kafka` sink to publish our results to a different topic and make it available downstream.
 - have some fun!  

##Common Definitions
We define a series of parameters common to  the notebook

In [ ]:
val sourceTopic = "sensor-raw"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = "172.17.0.2:9092" // local
//val kafkaBootstrapServer = "10.2.2.191:1025" // fast-data-ec2

sourceTopic: String = sensor-raw
targetTopic: String = sensor-processed
kafkaBootstrapServer: String = 172.17.0.2:9092


## Read a stream from Kafka
We use the kafka source to subscribe to the `sourceTopic` that contains the raw sensor data.
This results in a streaming dataframe that we use to operate on the underlying data

In [ ]:
val rawData = sparkSession.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaBootstrapServer)
      .option("subscribe", sourceTopic)
      .option("startingOffsets", "latest")
      .load()

rawData: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]


In [ ]:
rawData.isStreaming

res4: Boolean = true


In [ ]:
rawData.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [ ]:
// There are Dataframe operations that don't make sense over a stream
rawData.count()

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.sc

## Explore the data stream
To view the streaming data, we will use the `memory` sink and query the resulting table to get samples of the data.

In [ ]:
val rawDataStream = rawData.writeStream
  .queryName("raw_data")    // this query name will be the SQL table name
  .outputMode("append")
  .format("memory")
  .start()

rawDataStream: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5ce5babf


In [ ]:
val sampleDataset = sparkSession.sql("select * from raw_data")

sampleDataset: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]


In [ ]:
sampleDataset.isStreaming

res12: Boolean = false


In [ ]:
sampleDataset.show()

+----+--------------------+----------+---------+------+--------------------+-------------+
| key|               value|     topic|partition|offset|           timestamp|timestampType|
+----+--------------------+----------+---------+------+--------------------+-------------+
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25229|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25230|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25231|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25232|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25233|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25234|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25235|1970-01-01 00:59:...|            0|
|null|[7B 22 69 64 22 3...|sensor-raw|        0| 25236|1970-01-01 00:59:...|            0|

In [ ]:
rawDataStream.stop()

# Parse the Data
The actual payload is contained in the 'value' field that we get from the kafka topic (see above).
We first need to convert that binary value field to string and then use the `json` support in Spark to transform our incoming data into a structured streaming `Dataset`


## Declare the schema of the data in the stream 
We need to declare the schema of the data in the stream in order to parse it.

We use a case class to define the schema. It's much more convenient that using the sql types directly.

In [ ]:
case class SensorData(id: String, ts: Long, temp: Double, hum: Double)

defined class SensorData


In [ ]:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema

import org.apache.spark.sql.Encoders
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,StringType,true), StructField(ts,LongType,false), StructField(temp,DoubleType,false), StructField(hum,DoubleType,false))


In [ ]:
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]

rawValues: org.apache.spark.sql.Dataset[String] = [value: string]
jsonValues: org.apache.spark.sql.DataFrame = [record: struct<id: string, ts: bigint ... 2 more fields>]
sensorData: org.apache.spark.sql.Dataset[SensorData] = [id: string, ts: bigint ... 2 more fields]


In [ ]:
sensorData.printSchema()

root
 |-- id: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- temp: double (nullable = true)
 |-- hum: double (nullable = true)



## Explore the Data
The `memory` sink creates an in-memory SQL table (like a `tempTable`) that we can query using Spark SQL
The result of the query is a static `Dataframe` that contains a snapshot of the data.

In [ ]:
val visualizationQuery = sensorData.writeStream
  .queryName("visualization")    // this query name will be the SQL table name
  .outputMode("append")
  .format("memory")
  .start()

visualizationQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@20f95bc4


In [ ]:
val sampleDataset = sparkSession.sql("select * from visualization")

sampleDataset: org.apache.spark.sql.DataFrame = [id: string, ts: bigint ... 2 more fields]


In [ ]:
// This is a static Dataset!
sampleDataset.isStreaming

res27: Boolean = false


### Our dataset is backed by the streaming data, it will update each time we execute an action, delivering the latest data.

In [ ]:
sampleDataset.count

res29: Long = 232


In [ ]:
sampleDataset.count

res33: Long = 613


## Visualize the Data
We will make a custom live update by querying the stream every so often for the latest updates

In [ ]:
val zero = Seq((System.currentTimeMillis, 0.1))

val chart = CustomPlotlyChart(zero,
                  layout=s"{title: 'sensor data sample'}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2' }")
chart

zero: Seq[(Long, Double)] = List((1527071737184,0.1))
chart: notebook.front.widgets.charts.CustomPlotlyChart[Seq[(Long, Double)]] = <CustomPlotlyChart widget>
res35: notebook.front.widgets.charts.CustomPlotlyChart[Seq[(Long, Double)]] = <CustomPlotlyChart widget>


## Async update of our visualization
We will use a plain old Thread to run a recurrent query on our in-memory table and update the chart accordingly.


In [ ]:
@volatile var running = true

running: Boolean = true


In [ ]:
import scala.concurrent.duration._
import scala.annotation.tailrec

val updater = new Thread() {
  @tailrec
  def visualize(): Unit = {
    val lastMinute = System.currentTimeMillis - 1.minute.toMillis
    val data = sampleDataset.where($"ts" > lastMinute and $"id" === "office").as[SensorData]
                            .map{case SensorData(id, ts, temp, hum) => (ts/1000%3600, temp)}.collect
    chart.applyOn(data)
    if (running) {
      Thread.sleep(1.second.toMillis)
      visualize()
    } else ()
  } 
  
  override def run() {
    visualize()
  }
}.start()


import scala.concurrent.duration._
import scala.annotation.tailrec
updater: Unit = ()


In [ ]:
visualizationQuery.stop()

In [ ]:
running = false

# Using Event Time
## Improving the data with sliding windows

In [ ]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


In [ ]:
val toSeconds = udf((ts:Long) => ts/1000)
val tempBySensorMovingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType))
                                          .withWatermark("timestamp", "30 seconds")
                                          .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
                                          .agg(avg($"temp"))

toSeconds: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(LongType)))
tempBySensorMovingAverage: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
tempBySensorMovingAverage.printSchema

root
 |-- id: string (nullable = true)
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- avg(temp): double (nullable = true)



In [ ]:
val windowedSensorQuery = tempBySensorMovingAverage.writeStream
  .queryName("moving_average")    // this query name will be the table name
  .outputMode("append")  
  .format("memory")
  .start()

windowedSensorQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@7c4d5e31


### ------

In [ ]:
val movingAvgDF = sparkSession.sql("select * from moving_average")

movingAvgDF: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
movingAvgDF

res48: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
import org.apache.spark.sql.functions._
val lastMinute: Long = System.currentTimeMillis/1000 - 5.minute.toSeconds
val mAvgSample = movingAvgDF.select($"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
                   .where($"timestamp" > lastMinute and $"id" === "office")
                   .orderBy($"timestamp")
                   .as[(Long, Double)]
                   .collect().map{case (ts, v) => (ts  % 3600,v)}


CustomPlotlyChart(mAvgSample,
                  layout=s"{title: 'moving average sensor data'}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2'}")

import org.apache.spark.sql.functions._
lastMinute: Long = 1527071888
mAvgSample: Array[(Long, Double)] = Array((2370,8.629999999999999), (2380,8.81060606060606), (2390,9.027169811320753), (2400,9.251833333333332), (2410,9.523500000000002), (2420,9.758166666666664), (2430,10.04466666666667), (2440,9.993166666666667), (2450,9.71133333333333), (2460,9.426333333333336), (2470,9.329833333333335), (2480,9.61683333333333), (2490,9.872333333333332), (2500,10.294666666666666), (2510,10.378000000000002), (2520,10.229333333333331))
res52: notebook.front.widgets.charts.CustomPlotlyChart[Array[(Long, Double)]] = <CustomPlotlyChart widget>


In [ ]:
// stop the ancilliary visualization query
windowedSensorQuery.stop()

## Write our moving average data to our `sensor-clean` topic

In [ ]:
val kafkaFormat = tempBySensorMovingAverage.select(
  $"id", $"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
                                                  .select($"id" as "key", to_json(struct($"id", $"timestamp", $"temp")) as "value")

kafkaFormat: org.apache.spark.sql.DataFrame = [key: string, value: string]


In [ ]:
val kafkaWriterQuery = kafkaFormat.writeStream
  .queryName("kafkaWriter")    // this query name will be the table name
  .outputMode("append") 
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBootstrapServer)
  .option("topic", targetTopic)
  .option("checkpointLocation", "/tmp/spark/checkpoint2")
  .option("failOnDataLoss", "false")
  .start()

kafkaWriterQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@4c91d31


In [ ]:
kafkaWriterQuery.recentProgress

res64: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array()
