# Azure ML Model Monitoring Demo - Model Training

Series of sample notebooks designed to showcase [AML's continuous model monitoring capabilities](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-model-performance?view=azureml-api-2&tabs=azure-cli). The series of notebooks in this repo have been developed to perform core operations including model training, deployment, simulated production data scoring, and inference data collection. These notebooks have been designed to be run in order and include the following steps:

- 00. Data Upload - Load time-series weather data from a local CSV into an AML datastore, and register as training & evaluation datasets
- <b>01. Model Training - Train a custom temperature prediction regression model using Mlflow & Scikit-Learn and register into your AML workspace</b>
- 02. Model Deployment - Deploy your newly trained model to a Managed Online Endpoint with production data collection configured.
- 03. Production Data Simulation - Send time-series data to your endpoint at a slow rate to simulate production inferencing. All submitted data will be collected automatically.
- 04. Monitoring Configuration - Configure a production model data monitor looking for drift in inferencing data, and scored results which can indicate that retraining should be performed.
- 05. Offline Monitoring - Sample notebook showcasing how to identify drift in data from datasets scored outside of Azure ML.

<b>This notebook utilizes the previously registered `weather-training-data` dataset to train a custom regression model for predicting temperature based on other environmental attributes. Here, we train and register a model (`Temperature_Prediction_Model`) using Mlflow and Scikit-learn and have incorporated preprocessing logic into a scikit pipeline for seamless inferencing once deployed. After training our model, we will also score and save ALL training and evaluation data for post hoc analyses.</b>

### Import required packages

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from mlflow import set_tracking_uri
import mltable
import mlflow
import pandas as pd
import json
import os
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
from azureml.fsspec import AzureMachineLearningFileSystem

### Install missing packages/updated versions of mlflow

In [None]:
# !pip install azure-ai-ml mlflow==1.30.0 mlflow-skinny==1.30.0

### Establish connection to AML workspace using the v2 SDK

In [None]:
subscription_id = "<your_subscription_id>"
resource_group = "<your_resource_group>"
workspace_name = "<your_workspace_name>"

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
workspace = ml_client.workspaces.get(workspace_name)
tracking_uri = workspace.mlflow_tracking_uri

set_tracking_uri(tracking_uri)


### Retrieve training dataset from AML workspace and load into a Pandas dataframe

In [None]:
dataset_name = 'weather-training-data'

data = ml_client.data.get(dataset_name, version='5')
dataset = mltable.from_delimited_files(paths=[{'pattern': data._referenced_uris[0]}])
df = dataset.to_pandas_dataframe()
df


### Create an experiment and submit a training run

In [None]:
import mlflow
import pandas as pd
mlflow.autolog(log_input_examples=True, log_model_signatures=True)

experiment_name = 'Temperature_Prediction_Model_Training'
run_name = 'Random_Forest_Regressor_Trial'

mlflow.set_experiment(experiment_name)

run_id = None

X = df.drop('temperature', axis=1)  
y = df['temperature']  

with mlflow.start_run(run_name=run_name) as run:
    
    from sklearn.compose import ColumnTransformer  
    from sklearn.pipeline import Pipeline  
    from sklearn.impute import SimpleImputer  
    from sklearn.preprocessing import StandardScaler, OneHotEncoder  
    from sklearn.ensemble import RandomForestRegressor  
    from sklearn.model_selection import train_test_split  


    # # Dynamically select numerical and categorical features  
    numeric_features = X.select_dtypes(include=['int64', 'float64', 'int32']).columns  
    categorical_features = X.select_dtypes(include=['object']).columns  

    # # Define preprocessing for numeric columns (scale them)  
    numeric_transformer = Pipeline(steps=[  
        ('imputer', SimpleImputer(strategy='median')),  
        ('scaler', StandardScaler())])  

    # # Define preprocessing for categorical features (encode them)  
    categorical_transformer = Pipeline(steps=[  
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])  



    # Combine preprocessing steps  
    preprocessor = ColumnTransformer(  
        transformers=[  
           ('num', numeric_transformer, numeric_features), 
            ('cat', categorical_transformer, categorical_features)

    ]) 

    # Create preprocessing and training pipeline  
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),  
                               ('regressor', RandomForestRegressor())
                              ])  
    
    try:
        df.drop('datetime', axis=1)
    except Exception as e:
        pass

    # Load your data  
    X = df.drop('temperature', axis=1)  
    y = df['temperature']  

    # Split your data into train and test datasets  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  

    # Train model  
    pipeline.fit(X_train, y_train)  
    
    run_id = run.info.run_id


### Load logged model and verify that data scoring works as expected

In [None]:
run_path = f'runs:/{run_id}/model'
loaded_model = mlflow.sklearn.load_model(run_path)

loaded_model.predict(X)

### Register your newly trained model

<i>Note: As part of your standard workflow, you should implement some A/B testing logic to compare the performance of your newly trained model against your previously trained model(s) to ensure performance meets expectations.

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
    path="azureml://jobs/{}/outputs/artifacts/paths/model/".format(run_id),
    name="Temperature_Prediction_Model",
    description="Sample regression model from Azure ML model monitoring demo",
    type=AssetTypes.MLFLOW_MODEL
)

ml_client.models.create_or_update(run_model)

### Score all weather data and register for post-hoc analysis

Here, we score all of our data (training & evaluation), and save into our AML workspace to enable us to run a simulated drift analysis later on.

In [None]:
os.makedirs('./scored_data', exist_ok=True)

dataset_name = 'weather-training-data'
data = ml_client.data.get(dataset_name, version='4')
dataset = mltable.from_delimited_files(paths=[{'pattern': data._referenced_uris[0]}])
df = dataset.to_pandas_dataframe()
preds = loaded_model.predict(df)
df['Predicted_Temperature'] = preds
df.to_csv(f'./scored_data/{dataset_name}_scored.csv', index=False)

dataset_name = 'weather-evaluation-data'
data = ml_client.data.get(dataset_name, version='4')
dataset = mltable.from_delimited_files(paths=[{'pattern': data._referenced_uris[0]}])
df = dataset.to_pandas_dataframe()
preds = loaded_model.predict(df)
df['Predicted_Temperature'] = preds
df.to_csv(f'./scored_data/{dataset_name}_scored.csv', index=False)

dataset_name = 'weather-full-data'
data = ml_client.data.get(dataset_name, version='4')
dataset = mltable.from_delimited_files(paths=[{'pattern': data._referenced_uris[0]}])
df = dataset.to_pandas_dataframe()
preds = loaded_model.predict(df)
df['Predicted_Temperature'] = preds
df.to_csv(f'./scored_data/{dataset_name}_scored.csv', index=False)

### Upload & register all scored datasets in AML workspace

In [None]:
from azureml.fsspec import AzureMachineLearningFileSystem

datastore_name = 'workspaceblobstore' # default
path_on_datastore = 'weather_data'

# long-form Datastore uri format:
uri = f'azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{datastore_name}/paths/'
uri

# instantiate file system using following URI
fs = AzureMachineLearningFileSystem(uri)

# you can specify recursive as False to upload a file
fs.upload(lpath='./scored_data/weather-training-data_scored.csv', rpath='weather_data/scored_data', recursive=False, **{'overwrite': 'MERGE_WITH_OVERWRITE'})
fs.upload(lpath='./scored_data/weather-evaluation-data_scored.csv', rpath='weather_data/scored_data', recursive=False, **{'overwrite': 'MERGE_WITH_OVERWRITE'})
fs.upload(lpath='./scored_data/weather-full-data_scored.csv', rpath='weather_data/scored_data', recursive=False, **{'overwrite': 'MERGE_WITH_OVERWRITE'})

tbl = mltable.from_delimited_files([{'pattern': uri + 'weather_data/scored_data/weather-training-data_scored.csv'}])
tbl.save('./training_data_scored')

training_data = Data(
    path = './training_data_scored',
    type = AssetTypes.MLTABLE,
    description = 'January to March 2019 Weather Data',
    name='scored-weather-training-data',
    version="2"
)
ml_client.data.create_or_update(training_data)

tbl = mltable.from_delimited_files([{'pattern': uri + 'weather_data/scored_data/weather-evaluation-data_scored.csv'}])
tbl.save('./eval_data_scored')

eval_data = Data(
    path = './eval_data_scored',
    type = AssetTypes.MLTABLE,
    description = 'April to October 2019 Weather Data',
    name='scored-weather-evaluation-data',
    version="2"
)
ml_client.data.create_or_update(eval_data)

tbl = mltable.from_delimited_files([{'pattern': uri + 'weather_data/scored_data/weather-full-data_scored.csv'}])
tbl.save('./full_data_scored')

full_data = Data(
    path = './full_data_scored',
    type = AssetTypes.MLTABLE,
    description = 'January to October 2019 Weather Data',
    name='scored-weather-full-data',
    version="2"
)
ml_client.data.create_or_update(full_data)