# Monitoring Data Drift

**Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as *data drift*, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.**

In this lab, you'll configure data drift monitoring for datasets.

In [5]:
#!pip install azureml-sdk

In [6]:
# !pip install azureml-widgets

## Before you start

In addition to the latest version of the **azureml-sdk** and **azureml-widgets** packages, you'll need the **azureml-datadrift** package to run the code in this notebook. Run the cell below to verify that it is installed.

In [4]:
# !pip install azureml-datadrift

## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

> **Note**: If you haven't already established an authenticated session with your Azure subscription, you'll be prompted to authenticate by clicking a link, entering an authentication code, and signing into Azure.

In [2]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Ready to work with aml-DSTeam-RnD-001


## Create a *baseline* dataset

To monitor a dataset for data drift, you must register a *baseline* dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future. 

In [3]:
# !pip install azureml-dataset-runtime --upgrade

In [4]:
!/anaconda/envs/jupyter_env/bin/python -m pip install azureml-dataset-runtime --upgrade



In [5]:
from azureml.core import Datastore, Dataset


# Upload the baseline data
default_ds = ws.get_default_datastore()
print(default_ds)
default_ds.upload_files(files=['diabetes.csv', 'diabetes2.csv'],
                       target_path='diabetes-baseline',
                       overwrite=True, 
                       show_progress=True)

# Create and register the baseline dataset
print('Registering baseline dataset...')
baseline_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-baseline/*.csv'))
baseline_data_set = baseline_data_set.register(workspace=ws, 
                           name='diabetes baseline',
                           description='diabetes baseline data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline dataset registered!')

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-1d3c75d1-ab20-47c9-9c7e-0b9676d85f97",
  "account_name": "amldsteamrnd006960506712",
  "protocol": "https",
  "endpoint": "core.windows.net"
}
Uploading an estimated of 2 files
Uploading diabetes.csv
Uploaded diabetes.csv, 1 files out of an estimated total of 2
Uploading diabetes2.csv
Uploaded diabetes2.csv, 2 files out of an estimated total of 2
Uploaded 2 files
Registering baseline dataset...
Baseline dataset registered!


## Create a *target* dataset

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift **as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals**. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: *data_2020-01-29.csv*; which is the approach taken in the following code:

In [6]:
import datetime as dt
import pandas as pd

print('Generating simulated data...')

# Load the smaller of the two data files
data = pd.read_csv('diabetes2.csv')

# We'll generate data for the past 6 weeks
weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:
    
    # Get the date X weeks ago
    data_date = dt.date.today() - dt.timedelta(weeks=weekno)
    
    # Modify data to ceate some drift
    data['Pregnancies'] = data['Pregnancies'] + 1
    data['Age'] = round(data['Age'] * 1.2).astype(int)
    data['BMI'] = data['BMI'] * 1.1
    
    # Save the file with the date encoded in the filename
    file_path = 'diabetes_{}.csv'.format(data_date.strftime("%Y-%m-%d"))
    data.to_csv(file_path)
    file_paths.append(file_path)

# Upload the files
path_on_datastore = 'diabetes-target'
default_ds.upload_files(files=file_paths,
                       target_path=path_on_datastore,
                       overwrite=True,
                       show_progress=True)

# Use the folder partition format to define a dataset with a 'date' timestamp column
partition_format = path_on_datastore + '/diabetes_{date:yyyy-MM-dd}.csv'
target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, path_on_datastore + '/*.csv'),
                                                       partition_format=partition_format)

# Register the target dataset
print('Registering target dataset...')
target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws,
                                                                          name='diabetes target',
                                                                          description='diabetes target data',
                                                                          tags = {'format':'CSV'},
                                                                          create_new_version=True)

print('Target dataset registered!')

Generating simulated data...
Uploading an estimated of 6 files
Uploading diabetes_2023-05-05.csv
Uploaded diabetes_2023-05-05.csv, 1 files out of an estimated total of 6
Uploading diabetes_2023-05-12.csv
Uploaded diabetes_2023-05-12.csv, 2 files out of an estimated total of 6
Uploading diabetes_2023-05-19.csv
Uploaded diabetes_2023-05-19.csv, 3 files out of an estimated total of 6
Uploading diabetes_2023-05-26.csv
Uploaded diabetes_2023-05-26.csv, 4 files out of an estimated total of 6
Uploading diabetes_2023-06-02.csv
Uploaded diabetes_2023-06-02.csv, 5 files out of an estimated total of 6
Uploading diabetes_2023-06-09.csv
Uploaded diabetes_2023-06-09.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Registering target dataset...
Target dataset registered!


## Create a data drift monitor

Now you're ready to create a data drift monitor for the diabetes data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time.

### Create a compute target

To run the data drift monitor, you'll need a compute target. Run the following cell to specify a compute cluster (if it doesn't exist, it will be created).

> **Important**: Change *your-compute-cluster* to the name of your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "agcluster"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


> **Note**: Compute instances and clusters are based on standard Azure virtual machine images. For this exercise, the *Standard_DS11_v2* image is recommended to achieve the optimal balance of cost and performance. If your subscription has a quota that does not include this image, choose an alternative image; but bear in mind that a larger image may incur higher cost and a smaller image may not be sufficient to complete the tasks. Alternatively, ask your Azure administrator to extend your quota.

### Define the data drift monitor

Now you're ready to use a **DataDriftDetector** class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [8]:
!/anaconda/envs/jupyter_env/bin/python -m pip install of azureml-datadrift



In [10]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['Pregnancies', 'Age', 'BMI']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'mslearn-diabates-drifts', baseline_data_set, target_data_set,
                                                      compute_target=cluster_name, 
                                                      frequency='Week', 
                                                      feature_list=features, 
                                                      drift_threshold=.3, 
                                                      latency=24)
monitor

2023-06-09 09:58:24,363 - azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector - ERROR - DataDriftDetector already exists. Name is mslearn-diabates-drifts, please use get_by_name() to retrieve it. - activity_id:4282f58a-5999-4509-9bba-b805154b9dbe activity_name:constructor activity_type:InternalCall tenant_id:None subscription_id:c59b6c0a-0bc0-4b69-bd03-020b2171f742 resource_group:RG-AmlWS-DSTeam-RnD workspace_id:1d3c75d1-ab20-47c9-9c7e-0b9676d85f97 workspace_location:centralindia compute_type:None compute_size:None compute_nodes_min:None compute_nodes_max:None image_id:None dd_id:None dd_type:DatasetBased freq:Week interval:None scheduling:None threshold:None latency:24 total_features:3 services:None train_dataset_id:None baseline_dataset_id:9a00ccda-4861-4c1d-bc84-baedfc0d0544 target_dataset_id:f12e98e2-95df-4837-849b-2c9eb282eebd log_env:sdk client telemetry_event_id:5cb501a9-3fb9-42b8-ad9d-18398d703f68 sdk_version:1.49.0 telemetry_component_name:azureml.

KeyError: 'DataDriftDetector already exists. Name is mslearn-diabates-drifts, please use get_by_name() to retrieve it.'

## Backfill the data drift monitor

You have a baseline dataset and a target dataset that includes simulated weekly data collection for six weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

> **Note** This may take some time to run, as the compute target must be started to run the backfill analysis. The widget may not always update to show the status, so click the link to observe the experiment status in Azure Machine Learning studio!

In [19]:
from azureml.widgets import RunDetails

backfill = monitor.backfill(dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'mslearn-diabates-drifts-Monitor-Runs_1683550879660',
 'target': 'agcluster',
 'status': 'Completed',
 'startTimeUtc': '2023-05-08T13:03:24.892737Z',
 'endTimeUtc': '2023-05-08T13:08:01.185431Z',
 'services': {},
   'message': 'target dataset id:94c7d440-4eee-4858-9fea-5d1a614cef97 do not contain sufficient amount of data after timestamp filteringMinimum needed: 50 rows.Skipping calculation for time slice 2023-03-26 00:00:00 to 2023-04-02 00:00:00.'}],
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '2fa9d6e2-7d19-40ce-984d-5d590ac9ca26',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '70b1e199-166e-48bd-87e2-260b504c37f3'}, 'consumptionDetails': {'type': 'Reference'}}, {'dataset': {'id': '94c7d440-4eee-4858-9fea-5d1a614cef97'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': '_generate_s

## Analyze data drift

You can use the following code to examine data drift for the points in time collected in the backfill run.

In [20]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

end_date 2023-05-14
frequency Week
start_date 2023-03-26
Datadrift percentage {'days_from_start': [7, 14, 21, 28, 35, 42], 'drift_percentage': [74.19152901127207, 87.23985219136877, 91.74192122865539, 94.96492628559955, 97.58354951107833, 99.23199438682525]}


You can also visualize the data drift metrics in [Azure Machine Learning studio](https://ml.azure.com) by following these steps:

1. On the **Datasets** page, view the **Dataset monitors** tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click **Refresh**).
4. Examine the charts in the **Drift overview** section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the **Feature detail** section at the bottom, which enable you to see various measures of drift for individual features.

> **Note**: For help understanding the data drift metrics, see the [How to monitor datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets#understanding-data-drift-results) in the Azure Machine Learning documentation.

## Explore further

This lab is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also collect data from published services and use it as a target dataset for datadrift monitoring. See [Collect data from models in production](https://docs.microsoft.com/azure/machine-learning/how-to-enable-data-collection) for details.


# ---------------------------------------------------
# ---------------------------------------------------
## My Code on Data Drift Monitor
# ---------------------------------------------------
# ---------------------------------------------------

In [30]:
import pandas as pd
df = pd.read_csv("../data/train.csv") 
df_baseline = df[df.date < "2017-01-01"]

### Baseline dataset
###### To monitor a dataset for data drift, you must register a baseline dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future.

In [34]:
df_baseline

Unnamed: 0,date,store,item,sales
0,2013-01-01,1,1,13
1,2013-01-02,1,1,11
2,2013-01-03,1,1,14
3,2013-01-04,1,1,13
4,2013-01-05,1,1,10
...,...,...,...,...
912630,2016-12-27,10,50,60
912631,2016-12-28,10,50,43
912632,2016-12-29,10,50,68
912633,2016-12-30,10,50,63


### Target Dataset
###### Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. 

In [11]:
import pandas as pd
df_target = pd.read_csv('../data/target_new.csv')
df_target.head(20)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,date,store,item,sales
0,0,1461,2017-01-01,14,63,211
1,1,1462,2017-01-02,14,43,400
2,2,1463,2017-01-03,23,42,323
3,3,1464,2017-01-04,17,59,387
4,4,1465,2017-01-05,13,40,467
5,5,1466,2017-01-06,15,59,464
6,6,1467,2017-01-07,21,57,265
7,7,1468,2017-01-08,17,64,491
8,8,1469,2017-01-09,18,52,131
9,9,1470,2017-01-10,20,70,224


In [38]:
df_baseline.to_csv("../data/baseline.csv")

### Accessing Workspace

In [1]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Ready to work with aml-DSTeam-RnD-001


### Register both dataset

In [12]:
default_ds = Datastore.get(ws,'my_datastore')

In [13]:
from azureml.core import Datastore, Dataset


# Upload the baseline data
#default_ds = ws.get_default_datastore()
print(default_ds)
default_ds.upload_files(files=['../data/baseline.csv'],
                       target_path='forecast-baseline',
                       overwrite=True, 
                       show_progress=True)

# Create and register the baseline dataset
print('Registering baseline dataset...')
baseline_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'forecast-baseline/*.csv'))
baseline_data_set = baseline_data_set.register(workspace=ws, 
                           name='new forecast baseline',
                           description='forecast baseline data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline dataset registered!')

######################################################


# Upload the baseline data
#default_ds = ws.get_default_datastore()
print(default_ds)
default_ds.upload_files(files=['../data/target_new.csv'],
                       target_path='forecast-target',
                       overwrite=True, 
                       show_progress=True)

# Create and register the baseline dataset
print('Registering target dataset...')
target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'forecast-target/*.csv'))
target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws, 
                           name='new forecast target',
                           description='forecast target data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline target registered!')

{
  "name": "my_datastore",
  "container_name": "data-container",
  "account_name": "amldsteamrnd006960506712",
  "protocol": "https",
  "endpoint": "core.windows.net"
}
Uploading an estimated of 1 files
Uploading ../data/baseline.csv
Uploaded ../data/baseline.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Registering baseline dataset...
Baseline dataset registered!
{
  "name": "my_datastore",
  "container_name": "data-container",
  "account_name": "amldsteamrnd006960506712",
  "protocol": "https",
  "endpoint": "core.windows.net"
}
Uploading an estimated of 1 files
Uploading ../data/target_new.csv
Uploaded ../data/target_new.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Registering target dataset...
Baseline target registered!


### Use a compute target
###### To run the data drift monitor, you'll need a compute target. Run the following cell to specify a compute cluster (if it doesn't exist, it will be created).

In [14]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "agcluster"  ##used to run the monitor

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

Found existing cluster, use it.


### Run Data drift monitor
###### Now you're ready to use a DataDriftDetector class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [15]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['sales']
# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'demanddriftmonitor1', baseline_data_set, target_data_set,
                                                      compute_target=cluster_name, 
                                                      frequency='Week', ## Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".
                                                      feature_list=features, 
                                                      drift_threshold=.3, ## threshold for alert
                                                      latency=24) ## Delay in hours for data to appear in dataset.
                                                    
monitor

{'_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>, '_workspace': Workspace.create(name='aml-DSTeam-RnD-001', subscription_id='c59b6c0a-0bc0-4b69-bd03-020b2171f742', resource_group='RG-AmlWS-DSTeam-RnD'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': 'fd23646e-1ece-4d82-8372-44b6ea8853df', '_compute_target_name': 'agcluster', '_drift_threshold': 0.3, '_baseline_dataset_id': 'fceb38de-005c-4540-9ce8-3ad6938410c8', '_target_dataset_id': 'c83cd376-69c2-4928-8349-500cac1b6f75', '_feature_list': ['sales'], '_latency': 24, '_name': 'demanddriftmonitor1', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x7fb540538ac0>}

### Sample Data drift monitor is below

In [None]:
# from azureml.datadrift import DataDriftDetector, AlertConfiguration

#    alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling

#    monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
#                                                          compute_target='cpu-cluster',         # compute target for scheduled pipeline and backfills
#                                                          frequency='Week',                     # how often to analyze target data
#                                                          feature_list=None,                    # list of features to detect drift on
#                                                          drift_threshold=None,                 # threshold from 0 to 1 for email alerting
#                                                          latency=0,                            # SLA in hours for target data to arrive in the dataset
#                                                          alert_config=alert_config)            # email addresses to send alert

### Backfill the data drift monitor
###### You have a baseline dataset and a target dataset that includes simulated data for eight weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

In [16]:
from azureml.widgets import RunDetails
import datetime as dt

backfill = monitor.backfill(dt.datetime(2017,3,15) - dt.timedelta(weeks=8), dt.datetime(2017,3,15) )

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'demanddriftmonitor1-Monitor-Runs_1686327654534',
 'target': 'agcluster',
 'status': 'Completed',
 'startTimeUtc': '2023-06-09T16:23:27.528584Z',
 'endTimeUtc': '2023-06-09T16:35:08.866428Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': 'bf6f20a4-0b5c-4d1f-99ae-3d0bb82e4173',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': 'fceb38de-005c-4540-9ce8-3ad6938410c8'}, 'consumptionDetails': {'type': 'Reference'}}, {'dataset': {'id': 'c83cd376-69c2-4928-8349-500cac1b6f75'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': '_generate_script_datasets.py',
  'useAbsolutePath': False,
  'arguments': ['--baseline_dataset_id',
   'fceb38de-005c-4540-9ce8-3ad6938410c8',
   '--target_dataset_id',
   'c83cd376-69c2-4928-8349-500cac1b6f75',
   '--workspace_name',
   'aml-DSTeam-RnD-001',

### Analyze Data Drift
###### You can use the following code to examine data drift for the points in time collected in the backfill run.

In [17]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

start_date 2017-01-15
end_date 2017-03-19
frequency Week
Datadrift percentage {'days_from_start': [0, 7, 14, 21, 28, 35, 42, 49, 56], 'drift_percentage': [89.87316518653998, 89.98313724831581, 89.39593272805344, 88.59031328138313, 89.67290735858239, 88.60276127883476, 89.53615409933195, 89.5423203716511, 89.12407119486292]}


# -----------------------------------------------------
# -----------------------------------------------------
# --------------------- END ---------------------------
# -----------------------------------------------------
# -----------------------------------------------------