# Azure Machine Learning Responsible AI - Datadrift Detector

## Overview

Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as data drift, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.

In this notebook, you learn how to monitor for data drift between the training dataset and inference data of a deployed model. In the context of machine learning, trained machine learning models may experience degraded prediction performance because of drift. With Azure Machine Learning, you can monitor data drift and the service can send an email alert to you when drift is detected.

In this lab, you'll configure data drift monitoring for datasets.

## What is Data Drift?
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons where model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.


## Prerequisites and Azure Machine Learning Basics

Please, before anything set up with a working config file that has information on your workspace, subscription id, etc located on:

- './notebooks/notebook-settings/config.json' 

## Imports

In [None]:
import os
import sys
import requests
import azureml.core
import warnings
warnings.filterwarnings('ignore')

from scripts.data_drift import DataDrift
from azureml.datadrift import DataDriftDetector
from scripts.azureml_service import AzureMLService
from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget
from azureml.core import Workspace, Run, Experiment, Datastore, Dataset

sys.path.append(os.path.abspath("../utils"))
from attach_compute import get_compute_aml
from workspace import get_workspace

### Initialize Workspace

Now you're ready to connect to your workspace using the Azure ML SDK.

Note: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [None]:
ws = Workspace.from_config("../notebooks-settings/config.json")
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

### Create a Datadrift Pipeline Compute

#### Retrieve or create a Azure Machine Learning compute
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

If we could not find the compute with the given name in the previous cell, then we will create a new compute here. We will create an Azure Machine Learning Compute containing **STANDARD_D2_V2 CPU VMs**. This process is broken down into the following steps:

1. Create the configuration
2. Create the Azure Machine Learning compute

**This process will take about 3 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.**

#### List of Compute Targets on the workspace

In [None]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

In [None]:
aml_compute_datadrift = "df-compute"
vm_size = "STANDARD_DS3_V2"
get_compute_aml(ws, aml_compute_datadrift, vm_size)
print("Azure Machine Learning Compute attached")

## Configure DataDrift Detector

### Create a Baseline Dataset
To monitor a dataset for data drift, you must register a baseline dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future.

Using Azure Machine Learning, data drift will be monitored through our deployment. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A second dataset - usually model input data gathered from a deployment - is tested against the baseline dataset.

In the following cell the data drift detector will be activated and a DataDriftDetector will run at the specified, scheduled frequency.if the datadrift_coefficient reaches the given drift_threshold (0.1 by default) an alert will be sent trough application insights.

In [None]:
datadrift = DataDrift(workspace=ws,
          service_name="heart-disease-service",
          dataset_name="heart_disease_preprocessed_train",
          model_name="heart_disease_model_automl",
          compute_name=aml_compute_datadrift)

Inside DataDrift class you're ready to create a data drift monitor for the heart-disease data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time. 

To run the data drift monitor, you'll need a compute target. In this lab, you'll use the compute cluster you created previously (if it doesn't exist, it will be created). Also, this class is already done to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.**notebooks\monitoring\scripts\data_drift.py**

In [None]:
datadrift.main()

If you wanna to see the DataDrift Detector for the automl model:

In [None]:
DataDriftDetector.get(ws, "<model_name>", "<model_version>")

If we view the details of our model in the models section of Azure Machine Learning a new information about the status of our data drift monitor service will appear.

<img src="images/drift_service.png" alt="Data drift monitor">


## Making predictions

In the next cell we test our deployed service with an inference dataset that contains some examples.

If our data drift service is enable, both datasets (training and inference) are profiled and input to the data drift monitoring service. A machine learning model is trained to detect differences between the two datasets. The model's performance is converted to the drift coefficient, which measures the magnitude of drift between the two datasets.

To view results in your workspace in Azure Machine Learning studio, navigate to the model page. On the details tab of the model, the data drift configuration is shown. A Data drift tab is now available visualizing the data drift metrics

<img src="images/drift-ui.png" alt="Data drift graphs">

In [None]:
azml_service = AzureMLService(ws, 'heart-disease-service')
azml_service.make_request('heart_disease_preprocessed_inference')

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: data_2020-01-29.csv; which is the approach taken on **re-train-pipeline** notebook that you can finde at **notebooks\retrain\retrain_pipeline.ipynb**

## Visualize when Data Drift make alerts

You can also visualize the data drift metrics in Azure Machine Learning studio by following these steps:

1. On the Datasets page, view the Dataset monitors tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click Refresh).
4. Examine the charts in the Drift overview section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the Feature detail section at the bottom, which enable you to see various measures of drift for individual features.

Note: For help understanding the data drift metrics, see the How to monitor datasets in the Azure Machine Learning documentation.

## Explore Further

This notebook is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also configure data drift monitoring for services deployed in an Azure Kubernetes Service (AKS) cluster. For more information about this, see [Detect data drift on models deployed to Azure Kubernetes Service (AKS)](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-data-drift) in the Azure Machine Learning documentation.

## Delete DataDrift Detector

If you wanna to delete your Data Drift Detector execute the following cell:

In [None]:
monitor = DataDriftDetector.get(ws, "<model_name>", "<model_version>")
monitor.delete(wait_for_completion=True)