# Sensor Anomaly Detection

In this notebook we quickly explore some specific aspects of multi-level state management in Spark Streaming

By combining local and distributed state, we implement a simple sensor trend tracker that can help us identify and report anomalies.

## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.
This component is a participant in a streaming pipeline.

It expects to receive moving averages of sensor data in the form of (id, timestamp, value) 

In [ ]:
import org.apache.spark.streaming.Seconds
val topic = "sensor-processed"
val kafkaBootstrapServer = "172.17.0.2:9092"
val threshold = 4.0
val interval = Seconds(10)
val targetDir = "/tmp/anomaly/model"

import org.apache.spark.streaming.Seconds
topic: String = sensor-processed
kafkaBootstrapServer: String = 172.17.0.2:9092
threshold: Double = 4.0
interval: org.apache.spark.streaming.Duration = 10000 ms
targetDir: String = /tmp/anomaly/model


# Create a Streaming Standard Deviation Model
The classical formula for standard deviation : ![Standard Deviation Equation](https://wikimedia.org/api/rest_v1/media/math/render/svg/00eb0cde84f0a838a2de6db9f382866427aeb3bf) requires that all data is known beforehand.

To obtain the standard deviation of a streaming dataset, we need an algorithm that provides an approximation of the `stdev` value.

Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance  we've chosen to implement `M2`

In [ ]:
case class M2(n:Int, mean: Double, m2:Double) {
  def variance: Option[Double] = {
    if (n<2) None else Some(m2/(n-1))
  }
  def stdev: Option[Double] = variance.map(Math.sqrt)
  }
  object M2 extends Serializable {
    val Zero = M2(0,0.0,0.0)
  }

defined class M2
defined object M2


In [ ]:
// this needs to be outside of the class b/c of Spark Notebook serialization
var entries:Map[String, M2] = Map.empty

entries: Map[String,M2] = Map()


In [ ]:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.DStream
  
class M2Model() extends Serializable {
  
  def trainOn(dstream: DStream[(String, Double)]): Unit = {
    dstream.foreachRDD{rdd => 
                       val newEntriesRDD = rdd.map{case (id, x) => 
                                                val current = entries.get(id)
                                                val updated = current.map{case M2(n, mean, m2) => {
                                                  val np = n + 1
                                                  val delta = x - mean
                                                  val meanp = mean + delta/np
                                                  val mp2 = m2 + delta*(x - meanp)
                                                  (id, M2(np, meanp, mp2))
                                                  }
                                                 }.getOrElse(id -> M2.Zero)
                                                 updated
                                               }
                       val newEntries: Array[(String, M2)] = newEntriesRDD.collect
                       entries = entries ++ newEntries
                      }
  }
  def predictOnValues(dstream: DStream[(String, Double)]): DStream[(String, Double, Double, Double)] = {
    for { 
      (id, value) <- dstream
      m2 <- entries.get(id)
      stdev <- m2.stdev
    } yield (id, value, m2.mean, stdev)
  }
}

             (id, value) <- dstream
                            ^
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.DStream
defined class M2Model


## We create our Streaming Context

In [ ]:
import org.apache.spark.streaming.StreamingContext
@transient val streamingContext = new StreamingContext(sparkSession.sparkContext, interval)

import org.apache.spark.streaming.StreamingContext
streamingContext: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@6cd18bf1


## Our stream source will be a a Direct Kafka Stream


In [ ]:
import org.apache.kafka.clients.consumer.ConsumerRecord
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka._

val kafkaParams = Map[String, String](
  "metadata.broker.list" -> kafkaBootstrapServer,
  "group.id" -> "sensor-tracker-group",
  "auto.offset.reset" -> "largest"
)

val topics = Set(topic)
@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
     streamingContext, kafkaParams, topics)

// kafka_010 APIs don't work on the Spark Notebook

// @transient val stream = KafkaUtils.createDirectStream[String, String](
//   streamingContext,
//   PreferConsistent,
//   Subscribe[String, String](topics, kafkaParams)
// )



import org.apache.kafka.clients.consumer.ConsumerRecord
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka._
kafkaParams: scala.collection.immutable.Map[String,String] = Map(metadata.broker.list -> 172.17.0.2:9092, group.id -> sensor-tracker-group, auto.offset.reset -> largest)
topics: scala.collection.immutable.Set[String] = Set(sensor-processed)
stream: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@177fbef8


# Providing Schema information for our streaming data
Now that we have a DStream of fresh data processed in a 2-second interval, we can start focusing on the gist of this example.
First, we want to define and apply a schema to the data we are receiving.
In Scala, we can define a schema with a `case class`

In [ ]:
case class SensorData(id: String, timestamp: Long, temp: Double)

defined class SensorData


# Create our Model
We will train an online standard deviation algorithm and use it to score the incoming data.

In [ ]:
val model = new M2Model()

model: M2Model = M2Model@3807fd3e


# Convert the incoming JSON to `SensorData`
See how we interop with SparkSQL from Spark Streaming to use the JSON parsing facilities.

In [ ]:
val spark = sparkSession
import spark.implicits._
@transient val sensorDataStream = stream.transform{rdd => 
                                        val jsonData = rdd.map{case (k,v)  => v}
                                        val ds = sparkSession.createDataset(jsonData)
                                        val jsonDF = spark.read.json(ds)
                                        val sensorDataDS = jsonDF.as[SensorData]
                                        sensorDataDS.rdd
                                       }

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5f582770
import spark.implicits._
sensorDataStream: org.apache.spark.streaming.dstream.DStream[SensorData] = org.apache.spark.streaming.dstream.TransformedDStream@57927739


# Prepare the stream to train our model

In [ ]:
@transient val inputData = sensorDataStream.transform {sensorDataRDD =>  
                                                       sensorDataRDD.map{case SensorData(id,ts,value) => (id, value)}}                                                            

inputData: org.apache.spark.streaming.dstream.DStream[(String, Double)] = org.apache.spark.streaming.dstream.TransformedDStream@2d934c2c


## Use the data to train the model

In [ ]:
model.trainOn(inputData)

## Score the streaming data using the trained model

In [ ]:
@transient val scored = model.predictOnValues(inputData)

scored: org.apache.spark.streaming.dstream.DStream[(String, Double, Double, Double)] = org.apache.spark.streaming.dstream.FlatMappedDStream@33d992b


### Visualize the relation between the values and their standard deviation

In [ ]:
val scatterChart = new ScatterChart(Seq((0.0,0.0)))
scatterChart


scatterChart: notebook.front.widgets.charts.ScatterChart[Seq[(Double, Double)]] = <ScatterChart widget>
res15: notebook.front.widgets.charts.ScatterChart[Seq[(Double, Double)]] = <ScatterChart widget>


### Ouput Operations give us access to the data

In [ ]:
scored.foreachRDD{rdd =>
  val data = rdd.collect.map{case (id, value, mean, std) => (value, std)}
  scatterChart.applyOn(data)
}

In [ ]:
// Declare UI Widgets to see the data
val outputBox = ul(20)
outputBox.append("---")
val debugBox = ul(15)
debugBox.append("---")

outputBox: notebook.front.widgets.HtmlList = <HtmlList widget>
debugBox: notebook.front.widgets.HtmlList = <HtmlList widget>


## Detected Anomalies

In [ ]:

outputBox

res21: notebook.front.widgets.HtmlList = <HtmlList widget>


### Stream Content log (for mental sanity purposes)

In [ ]:
debugBox

res23: notebook.front.widgets.HtmlList = <HtmlList widget>


## Anomaly Detection using Streaming Standard Deviation Threshold
Values beyond the threshhold-times the standard deviation around the mean are considered irregular and deserve scrutiny  

In [ ]:
@transient val suspects = scored.filter{case (id, value, mean, std) => 
                                        (value > mean + std * threshold) || (value < mean - std * threshold)
                                       }

suspects: org.apache.spark.streaming.dstream.DStream[(String, Double, Double, Double)] = org.apache.spark.streaming.dstream.FilteredDStream@6a92de5e


## `foreachRDD` Stream Output
The output operation lets us materialize the results.

In this notebook, we are going to output the results to the UI widgets we declared earlier.

In [ ]:
suspects.foreachRDD{rdd => 
                      val top20 = rdd.take(20).map(_.toString)
                      val total = s"total anomalies found: ${rdd.count}"
                      outputBox(total +: top20)
                    }                  

In [ ]:
inputData.foreachRDD{rdd => 
                    val sample = rdd.take(20).map(_.toString)
                    debugBox.appendAll(sample)
                   } 

In [ ]:
case class IdM2(id:String, m2: M2)

defined class IdM2


In [ ]:
val spark = sparkSession
import spark.implicits._

inputData.window(Seconds(30)).foreachRDD{ (rdd,time) => 
                                         if (!rdd.isEmpty) {
                                           val modelDF = entries.map{case (id, m2) => IdM2(id, m2)}.toSeq.toDF
                                           modelDF.write.mode("overwrite").json(s"$targetDir/sensors-m2-${time.milliseconds}.json")
                                         }
                                        }

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5f582770
import spark.implicits._


## Starting the Context  initiates the stream processing

In [ ]:
streamingContext.start()

## `stop` destroys the streamingContext and stops the streaming computation  

In [ ]:
// Be careful not to stop the context if you want the streaming process to continue
streamingContext.stop(false)

### We can 'snoop' in the values of our model. 
The values are local to this process. Only computing them is done distributedly. 

In [ ]:
entries("office").stdev

res41: Option[Double] = Some(1.400086241511657)


In [ ]:
entries("office")

res43: M2 = M2(31,20.454768088208496,56.85374395997885)
