d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Lab 9: Drift Monitoring Algorithms

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
 - Create a two dummy datasets for gradual and sudden drift
 - Use <a href="https://scikit-multiflow.github.io/scikit-multiflow/skmultiflow.drift_detection.html#module-skmultiflow.drift_detection" target="_blank">the package `skmultiflow`</a> for comparing the DDM and EDDM algorithms

In [3]:
%run "./../Includes/Classroom-Setup"

## Creating the Data

The EDDM algorithm looks to improve performance on gradual drift while maintaining DDM's strong performance on abrupt drift.  In this lab, compare the two to see how they compare.

Create two datasets, one with abrupt drift and one with gradual drift.

-sandbox
Get a sense for how `numpy` creates a random sample from the binomial distribution.  It takes three parameters: number of trials, probability of success, and size of the output.  Adjust `p`, the second parameter, to change how abrupt our drift is.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.binomial.html#numpy.random.binomial" target="_blank">See the docs here.</a>

In [6]:
import numpy as np

data_points = 10
np.random.binomial(1, 0, data_points)

In [7]:
np.random.binomial(1, 1, data_points)

In [8]:
np.random.binomial(1, .5, data_points)

Now create a dataset with gradual drift.

In [10]:
gradual_drift = []

for i in range(1000):
  gradual_drift.append(np.random.binomial(1, i/1000., 1)[0])

Also create a dataset with sudden drift.

In [12]:
sudden_drift = np.random.binomial(1, .2, 1000).tolist()

for i in range(499, 1000):
    sudden_drift[i] = 1

Visualize the two datasets.

In [14]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

plt.subplot(2, 1, 1)
plt.scatter(range(len(sudden_drift)), sudden_drift, alpha=.1)
plt.title("Sudden Drift")

plt.subplot(2, 1, 2)
plt.scatter(range(len(gradual_drift)), gradual_drift, alpha=.1)
plt.title("Gradual Drift")

display(fig)

## Compare the DDM and EDDM

Compare the two algorithms on how they detect gradual vs sudden drift.

In [16]:
from skmultiflow.drift_detection.ddm import DDM
from skmultiflow.drift_detection.eddm import EDDM

eddm = EDDM()
ddm = DDM()

-sandbox
Start with the gradual drift dataset.  For each data point, add it to both `eddm` and `ddm` and then print out the index where it detects a warning zone and where it detects change.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="https://scikit-multiflow.github.io/scikit-multiflow/skmultiflow.drift_detection.html#module-skmultiflow.drift_detection" target="_blank">See the `skmultiflow` docs for details.</a>

In [18]:
# ANSWER
for i, g in enumerate(gradual_drift):
  ddm.add_element(g)
  eddm.add_element(g)
  if ddm.detected_warning_zone():
     print("Warning zone detected by DDM at index {} data {}".format(i, g))
  if ddm.detected_change():
     print("Change detected by DDM at index {}".format(i))
      
  if eddm.detected_warning_zone():
     print("Warning zone detected by EDDM at index {} data {}".format(i, g))
  if eddm.detected_change():
     print("Change detected by EDDM at index {}".format(i))

Now do the same for sudden drift.

In [20]:
# ANSWER
for i, s in enumerate(sudden_drift):
  ddm.add_element(s)
  eddm.add_element(s)
  if ddm.detected_warning_zone():
     print("Warning zone detected by DDM at index {} data {}".format(i, s))
  if ddm.detected_change():
     print("Change detected by DDM at index {}".format(i))
      
  if eddm.detected_warning_zone():
     print("Warning zone detected by EDDM at index {} data {}".format(i, s))
  if eddm.detected_change():
     print("Change detected by EDDM at index {}".format(i))

Which performed better?

Try changing the parameters for the two drift direction algorithms and observe the results.

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>