# Reading Sensor Data from Kafka with Structured Streaming

The intention of this example is to explore the main aspects of the Structured Streaming API.

We will: 
 - use the Kafka `source` to consume events from the `sensor-raw` topic in Kafka
 - implement the application logic using the Dataset API
 - use the `memory` sink to visualize the data
 - use the `kafka` sink to publish our results to a different topic and make it available downstream.
 - have some fun!  

##Common Definitions
We define a series of parameters common to  the notebook

In [ ]:
val sourceTopic = "sensor-raw"
val targetTopic = "sensor-processed"
//val kafkaBootstrapServer = "172.17.0.2:9092" // local
val kafkaBootstrapServer = "10.2.2.191:1025" // fast-data-ec2

## Read a stream from Kafka
We use the kafka source to subscribe to the `sourceTopic` that contains the raw sensor data.
This results in a streaming dataframe that we use to operate on the underlying data

In [ ]:
val rawData = sparkSession.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaBootstrapServer)
      .option("subscribe", sourceTopic)
      .option("startingOffsets", "latest")
      .load()

In [ ]:
rawData.isStreaming

In [ ]:
rawData.printSchema()

## Declare the schema of the data in the stream 
We need to declare the schema of the data in the stream in order to parse it.

We use a case class to define the schema. It's much more convenient that using the sql types directly.

In [ ]:
case class SensorData(id: String, ts: Long, temp: Double, hum: Double)

In [ ]:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema

## Parse the Data
The actual payload is contained in the 'value' field that we get from the kafka topic (see above).
We first need to convert that binary value field to string and then use the `json` support in Spark to transform our incoming data into a structured streaming `Dataset`


In [ ]:
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]

In [ ]:
sensorData.printSchema()

## Explore the data stream
To view the streaming data, we will use the `memory` sink and query the resulting table to get samples of the data.

In [ ]:
val visualizationQuery = sensorData.writeStream
  .queryName("visualization")    // this query name will be the SQL table name
  .outputMode("append")
  .format("memory")
  .start()

## Explore the Execution
We can use the queryProgressObject to examine the performance characteristics of the on-going query

In [ ]:
val progress = visualizationQuery.recentProgress

In [ ]:
progress.map(entry  => (entry.inputRowsPerSecond, entry.processedRowsPerSecond))

## Explore the Data
The `memory` sink creates an in-memory SQL table (like a `tempTable`) that we can query using Spark SQL
The result of the query is a static `Dataframe` that contains a snapshot of the data.

In [ ]:
val sampleDataset = sparkSession.sql("select * from visualization")

In [ ]:
// This is a static Dataset!
sampleDataset.isStreaming

### Our dataset is backed by the streaming data, it will update each time we execute an action, delivering the latest data.

In [ ]:
sampleDataset.count

## Visualize the Data
We will make a custom live update by querying the stream every so often for the latest updates

In [ ]:
val dummy = Seq((System.currentTimeMillis, 0.1))

val chart = CustomPlotlyChart(dummy,
                  layout=s"{title: 'sensor data sample'}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2' }")
chart

## Async update of our visualization
We will use a plain old Thread to run a recurrent query on our in-memory table and update the chart accordingly.


In [ ]:
@volatile var running = true

In [ ]:
import scala.concurrent.duration._
import scala.annotation.tailrec

val updater = new Thread() {
  @tailrec
  def visualize(): Unit = {
    val lastMinute = System.currentTimeMillis - 1.minute.toMillis
    val data = sampleDataset.where($"ts" > lastMinute and $"id" === "dth-001").as[SensorData]
                            .map{case SensorData(id, ts, temp, hum) => (ts/1000%3600, temp)}.collect
    chart.applyOn(data)
    if (running) {
      Thread.sleep(1.second.toMillis)
      visualize()
    } else ()
  } 
  
  override def run() {
    visualize()
  }
}.start()


In [ ]:
visualizationQuery.stop()

In [ ]:
running = false

# Improve the data with sliding windows

In [ ]:
import org.apache.spark.sql.types._

In [ ]:
val toSeconds = udf((ts:Long) => ts/1000)
val tempBySensorMovingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType))
                                          .withWatermark("timestamp", "30 seconds")
                                          .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
                                          .agg(avg($"temp"))

In [ ]:
tempBySensorMovingAverage.printSchema

In [ ]:
val windowedSensorQuery = tempBySensorMovingAverage.writeStream
  .queryName("movingAverage4")    // this query name will be the table name
  .outputMode("append")  
  .format("memory")
  .start()

In [ ]:
val movingAvgDF = sparkSession.sql("select * from movingAverage4")

In [ ]:
movingAvgDF

In [ ]:
import org.apache.spark.sql.functions._
val lastMinute: Long = System.currentTimeMillis/1000 - 5.minute.toSeconds
val mAvgSample = movingAvgDF.select($"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
                   .where($"timestamp" > lastMinute and $"id" === "dth-001")
                   .orderBy($"timestamp")
                   .as[(Long, Double)]
                   .collect().map{case (ts, v) => (ts  % 3600,v)}


CustomPlotlyChart(mAvgSample,
                  layout=s"{title: 'moving average sensor data'}",
                  dataOptions="""{type: 'line'}""",
                  dataSources="{x: '_1', y: '_2'}")

In [ ]:
// stop the ancilliary visualization query
windowedSensorQuery.stop()

## Write our moving average data to our `sensor-clean` topic

In [ ]:
val kafkaFormat = tempBySensorMovingAverage.select($"id", $"window.start".cast(LongType) as "timestamp", $"avg(temp)" as "temp")
                                                  .select($"id" as "key", to_json(struct($"id", $"timestamp", $"temp")) as "value")

In [ ]:
val kafkaWriterQuery = kafkaFormat.writeStream
  .queryName("kafkaWriter")    // this query name will be the table name
  .outputMode("append") 
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBootstrapServer)
  .option("topic", targetTopic)
  .option("checkpointLocation", "/tmp/spark/checkpoint2")
  .option("failOnDataLoss", "false")
  .start()

In [ ]:
kafkaWriterQuery.recentProgress