## Simulating live model drift monitoring: can you find which drift has occurred?

**Exercise:** Your model has been in production for 5 years and you would like to assess whether model drift has occurred, and if so, why.

During this exercise you will learn how to work with NannyML

### Introduction to NannyML

Data drift is mainly a problem when you expect it will impact model performance. However, it is not always possible to calculate model performance when collection of new ground truth labels is delayed or impossible.

NannyML introduces so-called "Nanny-models" which **estimate the impact of data changes on "child" model performance without the need for new ground truth labels**. Besides, the package offers some handy prioritization and visualisation tools which may be helpful when performing a root-cause analysis when a harmful model drift has occurred.


### Importing Modules

In [None]:
#COMMENT THIS LINE AFTER THE FIRST RUN
!pip install nannyml

In [None]:
import pandas as pd
import nannyml as nml
import numpy as np
from IPython.display import display
from random import uniform
from configs.paths import FOLDER_CONFIG_FILES, FOLDER_DATA_RAW, FOLDER_MODELS
import warnings
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from src.data.download_from_azure import download_data_from_azure_to_raw, read_config
warnings.filterwarnings("ignore")

### Configurations

In [None]:
target_col = 'Revenue'

num_cols = [
    'Administrative',
    'Administrative_Duration', 
    'Informational',
    'Informational_Duration', 
    'ProductRelated', 
    'ProductRelated_Duration',
    'BounceRates', 
    'ExitRates', 
    'PageValues', 
    'SpecialDay'
]

cat_cols = [
    'Month',
    'OperatingSystems',
    'Browser',
    'Region',
    'TrafficType',
    'VisitorType',
    'Weekend'
]

### Download dataset from Azure Blob storage using config file with credentials  

In [None]:
config_file_name = "config_metyis.yaml" #put the correct company name here
config = read_config(FOLDER_CONFIG_FILES / config_file_name)['azure-ld-best-practices']
AZURE_STORAGE_CONNECTION_STRING = config['AZURE_STORAGE_CONNECTION_STRING']
CONTAINER_NAME = config['CONTAINER_NAME']

for i in range(5):
    download_data_from_azure_to_raw(
        filename=f"online_shoppers_intention_{2018+i}.csv",
        azure_storage_connection_string=AZURE_STORAGE_CONNECTION_STRING,
        container_name=CONTAINER_NAME,
        folder_data_raw = FOLDER_DATA_RAW
    )

### Loading and preparing data for NannyML
Prepare the dataset to work with the package and compare the _analysis_ dataset with the _reference_ dataset used in training

NannyML requires the following columns for the reference and analysis datasets:
- y_pred_proba (based on a pre-trained model: use your own if possible)
- y_pred (based on a pre-trained model: use your own if possible)
- identifier (see script below)
- timestamp (see script below)

In [None]:
# Load the reference (train) dataset
reference= (
    pd.read_csv(FOLDER_DATA_RAW / "online_shoppers_intention.csv")
    .astype({col:"category" for col in cat_cols})
    .assign(
        identifier = lambda d: d.index,
        timestamp = pd.Timestamp(f"2017-01-01"),
    )
)
display(reference.head())

In [None]:
# Load your fitted model here or fit a model using the code below
fitted_model = joblib.load(FOLDER_MODELS / "fitted_model.pkl")

# Comment the following code in this cell if you use your own model
X = reference.drop(columns="Revenue")
y = reference["Revenue"]

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
fitted_model = clf.fit(X, y)
fitted_model

In [None]:
# Load the unseen analysis datasets
analysis_datasets = [
    pd.read_csv(FOLDER_DATA_RAW / f"online_shoppers_intention_{2018+i}.csv")
    .assign(timestamp = pd.Timestamp(f"{2018+i}-01-01")) 
    for i in range(5)
]
analysis = pd.concat(analysis_datasets).assign(identifier = lambda d: d.index)
display(analysis.head())

In [None]:
# Find which predict_proba outcome refers to the "True" class
class_index = np.where(fitted_model.classes_)[0][0]

# Add y_pred_proba and y_pred to analysis dataset
reference = reference.assign(
    y_pred = fitted_model.predict(X),
    y_pred_proba = fitted_model.predict_proba(X)[:,class_index],    
)

analysis = analysis.assign(
    y_pred = fitted_model.predict(analysis),
    y_pred_proba = fitted_model.predict_proba(analysis)[:,class_index],    
)

### Analysis using NannyML
The _performance estimator_ and _univariate drift calculator_ have already been implemented for you below.
You can refer to for example the [quickstart manual](https://nannyml.readthedocs.io/en/stable/quick.html) to understand other functionalities that may be interesting.

In [None]:
# Choose a chunker or set a chunk size
chunk_period = "1Y"

# initialize, specify required data columns, fit estimator and estimate
estimator = nml.CBPE(
    y_pred_proba='y_pred_proba',
    y_pred='y_pred',
    y_true='Revenue',
    metrics=['roc_auc'],
    timestamp_column_name='timestamp',
    chunk_period=chunk_period,
    problem_type='classification_binary',
)
estimator = estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)

# Show results
figure = estimated_performance.plot()
figure.show()

In [None]:
# Define feature columns
feature_column_names = num_cols + cat_cols

# Let's initialize the object that will perform the Univariate Drift calculations
univariate_calculator = nml.UnivariateDriftCalculator(
    column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_period=chunk_period,
    continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
    categorical_methods=['chi2', 'jensen_shannon'],
)
univariate_calculator = univariate_calculator.fit(reference)
univariate_results = univariate_calculator.calculate(analysis)

# Plot drift results for all continuous columns
figure = (univariate_results
    .filter(
        column_names=univariate_results.continuous_column_names, 
        methods=['jensen_shannon']
    )
    .plot(kind='distribution')
)
figure.show()

# Plot drift results for all categorical columns
figure = (univariate_results
    .filter(
        column_names=univariate_results.categorical_column_names, 
        methods=['chi2']
    )
    .plot(kind='drift')
)
figure.show()