# Improve Data Quality
Our current raw data, direct from sensors, contains a lot of noise.
We can remove the noise by creating a moving average of the data stream

In this Notebook, we explore the event-time capabilities of Structured Streaming.
We are going to see:
- How to define a field as `timestamp`
- The use of watermarks to control old and out-of-order data
- How to use SQL operators with time-based aggregations 

##Common Definitions
We define a series of parameters of our current environment

In [ ]:
val sourceTopic = "sensor-raw"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = "172.17.0.2:9092" // local
// val kafkaBootstrapServer = "10.2.2.191:1025" // fast-data-ec2


sourceTopic: String = sensor-raw
targetTopic: String = sensor-processed
kafkaBootstrapServer: String = 172.17.0.2:9092


In [ ]:
// cleanup the checkpoint dir before we start
:sh rm -rf /tmp/spark/checkpoint


import sys.process._




## Read the data stream from Kafka and Extract + Transform the Payload
We use the kafka source to subscribe to the `sourceTopic` that contains the raw sensor data.
This results in a streaming dataframe that we use to operate on the underlying data.

_Tip: We saw already how to do this in:_ [Processing Sensor Data from Kafka with Structured Streaming](./raw_sensor_stream_Structured_Streaming.snb.ipynb)

In [ ]:
val rawData = sparkSession.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaBootstrapServer)
      .option("subscribe", sourceTopic)
      .option("startingOffsets", "latest")
      .load()

rawData: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]


In [ ]:
import org.apache.spark.sql.Encoders
// Schema definition as case class
case class SensorData(id: String, ts: Long, value: Double)
// schema definition as SparkSQL struct
val schema = Encoders.product[SensorData].schema

import org.apache.spark.sql.Encoders
defined class SensorData
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,StringType,true), StructField(ts,LongType,false), StructField(value,DoubleType,false))


In [ ]:
// Payload extraction from the `value` field as a `String`
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
// Parse the `String` data as a JSON object
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
// Create a strongly-typed Dataset using the `case class` definition
val sensorData = jsonValues.select("record.*").as[SensorData]

rawValues: org.apache.spark.sql.Dataset[String] = [value: string]
jsonValues: org.apache.spark.sql.DataFrame = [record: struct<id: string, ts: bigint ... 1 more field>]
sensorData: org.apache.spark.sql.Dataset[SensorData] = [id: string, ts: bigint ... 1 more field]


# Improve the data with sliding windows

In [ ]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


In [ ]:
val time = System.currentTimeMillis
val t = new java.sql.Timestamp(time)

time: Long = 1544050939204
t: java.sql.Timestamp = 2018-12-06 00:02:19.204


In [ ]:
val toTimestamp = udf((ts:Long) => new java.sql.Timestamp(ts))
val sensorMovingAverage = sensorData.withColumn("timestamp", toTimestamp($"ts"))
                                          .withWatermark("timestamp", "10 seconds")
                                          .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
                                          .agg(avg($"value") as "avg_value")

toTimestamp: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,TimestampType,Some(List(LongType)))
sensorMovingAverage: org.apache.spark.sql.DataFrame = [id: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [ ]:
sensorMovingAverage.printSchema

root
 |-- id: string (nullable = true)
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- avg_value: double (nullable = true)



# Write our moving average data to our `sensor-clean` topic
To write data to Kafka, we need to transform our data into a `(key, value)` pair, where the `key` is used for partitioning in Kafka and the `value` contains our payload.

We will use JSON Encoding for our data structure.

In [ ]:
// First we prepare the schema to comply with the (key, value) model of Kafka
// don't be confused with the different `value` fields. One is from our data, the other is the Kafka payload
val kafkaFormat = sensorMovingAverage
.select($"id", $"window.start".cast(LongType) as "timestamp", $"avg_value" as "value")
.select($"id" as "key", to_json(struct($"id", $"timestamp", $"value")) as "value")

kafkaFormat: org.apache.spark.sql.DataFrame = [key: string, value: string]


In [ ]:
// We write to the `targetTopic`
val kafkaWriterQuery = kafkaFormat.writeStream
  .queryName("kafkaWriter") 
  .outputMode("append") 
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBootstrapServer)
  .option("topic", targetTopic)
  .option("checkpointLocation", "/tmp/spark/checkpoint")
  .option("failOnDataLoss", "false")
  .start()

kafkaWriterQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1f06a70d


## View Progress

In [ ]:
val progress = kafkaWriterQuery.recentProgress

progress: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] =
Array({
  "id" : "19a6d168-d0ba-458e-a12d-cf3dedd8dffd",
  "runId" : "b036974a-f023-4ab2-b904-a6e97bb720ec",
  "name" : "kafkaWriter",
  "timestamp" : "2018-12-05T23:43:00.816Z",
  "batchId" : 1436,
  "numInputRows" : 51,
  "inputRowsPerSecond" : 34.78854024556617,
  "processedRowsPerSecond" : 41.9063270336894,
  "durationMs" : {
    "addBatch" : 1167,
    "getBatch" : 2,
    "getOffset" : 1,
    "queryPlanning" : 30,
    "triggerExecution" : 1217,
    "walCommit" : 16
  },
  "eventTime" : {
    "avg" : "2018-12-05T23:43:00.080Z",
    "max" : "2018-12-05T23:43:00.800Z",
    "min" : "2018-12-05T23:42:59.360Z",
    "watermark" : "2018-12-05T23:42:49.330Z"
  },
  "stateOperators" : [ {
    "numRowsTotal" : 3087,
    "...