### Model Drift Analysis: load the model from Model Catalog

Model Drift Analysis require two dataset containing not only the features (xi) but also the target.

It means that, in order to monitor Model's performances and detect Model drift, we need, in some way, to collect data and analyze the results in order to define the "ground truth".

In this NB I have put a prototype that can be used to **start working on Model Drift**.

The dataset used is again the Employee Attrition Data and the model is based on LightGBM (GBM) and Sklearn pipeline.

We simulate a Data Drift (adding a "shift" to some features) in order to make performances worse.

In the First Part of the NB we train a model on a reference dataset and we save the pipeline + the metrics computed on a reference validation dataset.
In the second part we reload the model (pipeline) and we re-evaluate the metrics on a new dataset.
All the  results are saved in a csv file that can be easily loaded in a DB.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import os
import tempfile

import ads
from ads import set_auth

# to save to Model Catalog
from ads.catalog.model import ModelCatalog
from ads.common.model_metadata import UseCaseType, MetadataCustomCategory
from ads.model.framework.sklearn_model import SklearnModel

# used to serialize the pipeline
from pickle import dump, load

import lightgbm as lgb

from sklearn.metrics import classification_report
from sklearn.metrics import get_scorer, make_scorer, f1_score, roc_auc_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

# added to handle with pipelines
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from ads.dataset.factory import DatasetFactory

import logging
import warnings

import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

%matplotlib inline

In [2]:
# we need ads 2.5.10 or greater
print(ads.__version__)

2.6.2


In [3]:
# set RP
set_auth(auth='resource_principal')



### First Part: Load the data

In [4]:
#
# definisco le funzioni che identificano le categorie di colonne
#
def cat_cols_selector(df, target_name):
    # the input is the dataframe
    
    # cols with less than THR values are considered categoricals
    THR = 10
    
    nunique = df.nunique()
    types = df.dtypes
    
    col_list = []
    
    for col in df.columns:
        if ((types[col] == 'object') or (nunique[col] < THR)):
            # print(col)
            if col != target_name:
                col_list.append(col)
    
    return col_list

def num_cols_selector(df, target_name):
    THR = 10
    
    types = df.dtypes
    nunique = df.nunique()
    
    col_list = []
    
    for col in df.columns:
        if (types[col] != 'object') and (nunique[col] >= THR): 
            # print(col)
            if col != target_name:
                col_list.append(col)
    
    return col_list

def load_as_dataframe(path):
    ds = DatasetFactory.open(path,
                             target="Attrition").set_positive_class('Yes')

    ds_up = ds.up_sample()

    # drop unneeded columns
    cols_to_drop = ['Directs','name', 'Over18','WeeklyWorkedHours','EmployeeNumber']

    ds_used = ds_up.drop(columns=cols_to_drop)
    
    df_used = ds_used.to_pandas_dataframe()
    
    # train, test split (lo faccio direttamente sui dataframe)
    df_train, df_test = train_test_split(df_used, shuffle=True, test_size=0.2, random_state = 1234)

    print("# of samples in train set", df_train.shape[0])
    print("# of samples in test set", df_test.shape[0])
    
    return df_train, df_test

In [5]:
# load the dataset and do upsampling
TARGET = 'Attrition'

attrition_path = "/opt/notebooks/ads-examples/oracle_data/orcl_attrition.csv"

df_train, df_test = load_as_dataframe(attrition_path)

X_train, y_train = df_train.drop([TARGET], axis=1), df_train[TARGET]
X_test, y_test = df_test.drop([TARGET], axis=1), df_test[TARGET]

# uso ancora la classe dataset per fare l'upsampling

cat_cols = cat_cols_selector(df_train, TARGET)
num_cols = num_cols_selector(df_train, TARGET)

print()
print(f'Numerical columns: {num_cols} ({len(num_cols)})')
print()
print(f'Categorical columns: {cat_cols} ({len(cat_cols)})')

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

# of samples in train set 1972
# of samples in test set 494

Numerical columns: ['Age', 'SalaryLevel', 'CommuteLength', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'YearsinIndustry', 'YearsOnJob', 'YearsAtCurrentLevel', 'YearsSinceLastPromotion', 'YearsWithCurrManager'] (13)

Categorical columns: ['TravelForWork', 'JobFunction', 'EducationalLevel', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TrainingTimesLastYear', 'WorkLifeBalance'] (17)


### Second Part: analysis on a new dataset

we load the model from the Catalog

In [6]:
# take Model OCID from UI
MODEL_OCID = "ocid1.datasciencemodel.oc1.eu-milan-1.amaaaaaangencdyayr37s6ihur3m7gb2mi2ujl5hfx57rkudm5bzjqy5kcja"

# load ADS model from Model Catalog
ads_model = SklearnModel.from_model_catalog(model_id=MODEL_OCID,
                                        model_file_name="model.joblib",
                                        artifact_dir=tempfile.mkdtemp())

Start loading model.joblib from model directory /tmp/tmp7qp2kift ...
Model is successfully loaded.
Start loading model.joblib from model directory /tmp/tmp7qp2kift ...
Model is successfully loaded.


In [7]:
# take the inner Sklearn pipeline
pipe = ads_model.estimator

pipe.steps

[('preprocessor',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('imputer', SimpleImputer()),
                                                   ('standard_scaler',
                                                    StandardScaler())]),
                                   ['Age', 'SalaryLevel', 'CommuteLength',
                                    'HourlyRate', 'MonthlyIncome', 'MonthlyRate',
                                    'NumCompaniesWorked', 'PercentSalaryHike',
                                    'YearsinIndustry', 'YearsOnJob',
                                    'YearsAtCurrentLevel',
                                    'YearsSinceLastPromotion',
                                    'YearsWithCurrManager']),
                                  ('c...
                                                    OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                   unknown_value=-

### Simulate some changes in the Dataset

In [8]:
# simulate some changes in the dataset
# we use again the test set, but with a "Data Drift"

print("Simulate a drift in data...")

X_test['SalaryLevel'] = X_test['SalaryLevel'] + 3000
X_test['Age'] = X_test['Age'] + 20

Simulate a drift in data...


In [9]:
# scoring: compute new metrics
test_pred = pipe.predict(X_test)
test_probas = pipe.predict_proba(X_test)

print('Validation set result:')

roc_auc = round(roc_auc_score(y_test, test_probas[:,1]), 4)
acc = round(accuracy_score(y_test, test_pred), 4)

print(f"Acc: {acc}, AUC: {roc_auc}")

Validation set result:
Acc: 0.9069, AUC: 0.9619


In [None]:
# we can see that metrics are worse if compared to those registered in the model catalog

### Getting the reference dataset and mettrics from the Model Catalog

In [10]:
# I can get thecustom metrics as a Pandas Dataframe
meta_df = ads_model.metadata_custom.to_dataframe()

meta_df.head(10)

Unnamed: 0,Key,Value,Description,Category
0,ClientLibrary,ADS,,Other
1,CondaEnvironment,oci://conda_envs@frqap2zhtzbe/conda_environments/cpu/mygeneralml_p37_cpu_/1.0/mygeneralml_p37_cpu_v1_0,The conda environment where the model was trained.,Training Environment
2,CondaEnvironmentPath,oci://conda_envs@frqap2zhtzbe/conda_environments/cpu/mygeneralml_p37_cpu_/1.0/mygeneralml_p37_cpu_v1_0,The URI of the training conda environment.,Training Environment
3,EnvironmentType,published,"The conda environment type, can be published or datascience.",Training Environment
4,ModelArtifacts,"input_schema.json, test_json_output.json, model.joblib, runtime.yaml, score.py, output_schema.json",The list of files located in artifacts folder.,Training Environment
5,ModelSerializationFormat,joblib,The model serialization format.,Training Profile
6,SlugName,mygeneralml_p37_cpu_v1_0,The slug name of the training conda environment.,Training Environment
7,metrics on reference set,"{'accuracy': 0.9514, 'roc_auc': 0.9939}",Metrics evaluated on reference dataset,Performance
8,reference dataset,oci://drift_input@frqap2zhtzbe/reference.csv,Reference dataset url. From this dataset have been extracted train/validation dataset,Training and Validation Datasets


In [11]:
# metrics
ref_metrics = meta_df[meta_df['Key'] == "metrics on reference set"]['Value'].values[0]

ref_metrics

"{'accuracy': 0.9514, 'roc_auc': 0.9939}"

In [None]:
# il reference dataset:
ref_url = meta_df[meta_df['Key'] == "reference dataset"]['Value'].values[0]

ref_url

In [None]:
# cosi leggo il dataset di riferimento, la cui url è presa dai metadati
ref_df = pd.read_csv(ref_url)

ref_df.head()