### Model Drift Analysis: load the model from Model Catalog

Model Drift Analysis require two dataset containing not only the features (xi) but also the target.

It means that, in order to monitor Model's performances and detect Model drift, we need, in some way, to collect data and analyze the results in order to define the "ground truth".

In this NB I have put a prototype that can be used to **start working on Model Drift**.

The dataset used is again the Employee Attrition Data and the model is based on LightGBM (GBM) and Sklearn pipeline.

We simulate a Data Drift (adding a "shift" to some features) in order to make performances worse.

In the First Part of the NB we train a model on a reference dataset and we save the pipeline + the metrics computed on a reference validation dataset.
In the second part we reload the model (pipeline) and we re-evaluate the metrics on a new dataset.
All the  results are saved in a csv file that can be easily loaded in a DB.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import os
import tempfile

import ads
from ads import set_auth

# to save to Model Catalog
from ads.catalog.model import ModelCatalog
from ads.common.model_metadata import UseCaseType, MetadataCustomCategory
from ads.model.framework.sklearn_model import SklearnModel

# used to serialize the pipeline
from pickle import dump, load

import lightgbm as lgb

from sklearn.metrics import classification_report
from sklearn.metrics import get_scorer, make_scorer, f1_score, roc_auc_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

# added to handle with pipelines
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from ads.dataset.factory import DatasetFactory

import logging
import warnings

import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

%matplotlib inline

In [2]:
# we need ads 2.5.10 or greater
print(ads.__version__)

2.5.10


In [3]:
# set RP
set_auth(auth='resource_principal')



### First Part: train the model

In [4]:
#
# definisco le funzioni che identificano le categorie di colonne
#
def cat_cols_selector(df, target_name):
    # the input is the dataframe
    
    # cols with less than THR values are considered categoricals
    THR = 10
    
    nunique = df.nunique()
    types = df.dtypes
    
    col_list = []
    
    for col in df.columns:
        if ((types[col] == 'object') or (nunique[col] < THR)):
            # print(col)
            if col != target_name:
                col_list.append(col)
    
    return col_list

def num_cols_selector(df, target_name):
    THR = 10
    
    types = df.dtypes
    nunique = df.nunique()
    
    col_list = []
    
    for col in df.columns:
        if (types[col] != 'object') and (nunique[col] >= THR): 
            # print(col)
            if col != target_name:
                col_list.append(col)
    
    return col_list

def load_as_dataframe(path):
    ds = DatasetFactory.open(path,
                             target="Attrition").set_positive_class('Yes')

    ds_up = ds.up_sample()

    # drop unneeded columns
    cols_to_drop = ['Directs','name', 'Over18','WeeklyWorkedHours','EmployeeNumber']

    ds_used = ds_up.drop(columns=cols_to_drop)
    
    df_used = ds_used.to_pandas_dataframe()
    
    

    # train, test split (lo faccio direttamente sui dataframe)
    df_train, df_test = train_test_split(df_used, shuffle=True, test_size=0.2, random_state = 1234)

    print("# of samples in train set", df_train.shape[0])
    print("# of samples in test set", df_test.shape[0])
    
    return df_train, df_test

In [5]:
# load the dataset and do upsampling
TARGET = 'Attrition'

attrition_path = "/opt/notebooks/ads-examples/oracle_data/orcl_attrition.csv"

df_train, df_test = load_as_dataframe(attrition_path)

# uso ancora la classe dataset per fare l'upsampling

cat_cols = cat_cols_selector(df_train, TARGET)
num_cols = num_cols_selector(df_train, TARGET)

print()
print(f'Numerical columns: {num_cols} ({len(num_cols)})')
print()
print(f'Categorical columns: {cat_cols} ({len(cat_cols)})')

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

# of samples in train set 1972
# of samples in test set 494

Numerical columns: ['Age', 'SalaryLevel', 'CommuteLength', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'YearsinIndustry', 'YearsOnJob', 'YearsAtCurrentLevel', 'YearsSinceLastPromotion', 'YearsWithCurrManager'] (13)

Categorical columns: ['TravelForWork', 'JobFunction', 'EducationalLevel', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TrainingTimesLastYear', 'WorkLifeBalance'] (17)


In [6]:
X_train, y_train = df_train.drop([TARGET], axis=1), df_train[TARGET]
X_test, y_test = df_test.drop([TARGET], axis=1), df_test[TARGET]

### Second Part: analysis on a new dataset

we should load the model from the Catalog

In [7]:
# take Model OCID from UI
MODEL_OCID = "ocid1.datasciencemodel.oc1.eu-frankfurt-1.amaaaaaangencdyasojemavtoshdggls4rg27i2qctcin6xz3yi3yevhnaha"

# load ADS model from Model Catalog
ads_model = SklearnModel.from_model_catalog(model_id=MODEL_OCID,
                                        model_file_name="model.pkl",
                                        artifact_dir=tempfile.mkdtemp())

Start loading model.joblib from model directory /tmp/tmpboallk34 ...
Model is successfully loaded.
Start loading model.joblib from model directory /tmp/tmpboallk34 ...
Model is successfully loaded.


In [8]:
# take the inner Sklearn pipeline
clf = ads_model.estimator

In [9]:
# restart from the dataset (in reality we'll have a new dataset, here we're simulating the changes)
df_train, df_test = load_as_dataframe(attrition_path)

X_test, y_test = df_test.drop([TARGET], axis=1), df_test[TARGET]

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

# of samples in train set 1972
# of samples in test set 494


In [10]:
# simulate some changes in the dataset
# we use again the test set, but with a "Data Drift"

X_test['SalaryLevel'] = X_test['SalaryLevel'] - 3000
X_test['Age'] = X_test['Age'] + 20

In [11]:
# scoring: compute new metrics
test_pred = clf.predict(X_test)
test_probas = clf.predict_proba(X_test)

print('Validation set result:')

roc_auc = round(roc_auc_score(y_test, test_probas[:,1]), 4)
acc = round(accuracy_score(y_test, test_pred), 4)

print(f"Acc: {acc}, AUC: {roc_auc}")

Validation set result:
Acc: 0.913, AUC: 0.9651


In [12]:
# we can see that metrics are worse if compared to those registered in the model catalog

In [13]:
# add a second row to the file

# read old file
model_metrics = pd.read_csv("model_metrics.csv")

# compute new row
now = datetime.now().strftime('%Y-%m-%d %H:%M')
dict_ref = [{
    "ts_date": now,
    "model_name": "lgb1",
    "algorithm": "lightgbm",
    "accuracy": acc,
    "roc_auc": roc_auc
}]

# new df
df_new_metrics = pd.DataFrame(dict_ref)

new_model_metrics = pd.concat([model_metrics, df_new_metrics])

# save to an updated file
new_model_metrics.to_csv("model_metrics.csv", index=None)

### Getting the reference dataset from the Model Catalog

In [20]:
# I can get thecustom metrics as a Pandas Dataframe
meta_df = ads_model.metadata_custom.to_dataframe()

meta_df.head(10)

Unnamed: 0,Key,Value,Description,Category
0,ClientLibrary,ADS,,Other
1,CondaEnvironment,oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/General Machine Learning for CPUs on Python 3.7/1.0/generalml_p37_cpu_v1,The conda environment where the model was trained.,Training Environment
2,CondaEnvironmentPath,oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/General Machine Learning for CPUs on Python 3.7/1.0/generalml_p37_cpu_v1,The URI of the training conda environment.,Training Environment
3,EnvironmentType,data_science,"The conda environment type, can be published or datascience.",Training Environment
4,ModelArtifacts,"score.py, model.joblib, runtime.yaml",The list of files located in artifacts folder.,Training Environment
5,ModelSerializationFormat,joblib,The model serialization format.,Training Profile
6,SlugName,generalml_p37_cpu_v1,The slug name of the training conda environment.,Training Environment
7,metrics on reference set,"{'accuracy': 0.9494, 'roc_auc': 0.9951}",Metrics evaluated on reference dataset,Performance
8,reference dataset,oci://drift_input@frqap2zhtzbe/reference.csv,Reference dataset url. From this dataset have been extracted train/validation dataset,Training and Validation Datasets


In [36]:
from ads.model.generic_model import GenericModel

def get_reference_dataset_url(gen_model):
    # take the custom metadata as Pandas df
    meta_df = gen_model.metadata_custom.to_dataframe()
    
    # get only one row
    condition = (meta_df['Key'] == "reference dataset")
    ref_ds_url = meta_df.loc[condition]['Value']

    # it is a np array... take the only row
    return ref_ds_url.values[0]

In [42]:
MODEL_OCID = "ocid1.datasciencemodel.oc1.eu-frankfurt-1.amaaaaaangencdyasojemavtoshdggls4rg27i2qctcin6xz3yi3yevhnaha"

# load model from Model Catalog
# for reading custom metadata I can use GenericModel
generic_model = GenericModel.from_model_catalog(model_id=MODEL_OCID,
                                                # only for temporary use
                                                model_file_name="gen_model.pkl",
                                                artifact_dir=tempfile.mkdtemp())

ref_url = get_reference_dataset_url(generic_model)

Start loading model.joblib from model directory /tmp/tmp3xwuasve ...
Model is successfully loaded.
Start loading model.joblib from model directory /tmp/tmp3xwuasve ...
Model is successfully loaded.


In [43]:
print(f"Reference Dataset url: {ref_url}")

Reference Dataset url: oci://drift_input@frqap2zhtzbe/reference.csv


In [45]:
# get the reference Dataset
ref_df = pd.read_csv(ref_url)

ref_df.head()

Unnamed: 0,TravelForWork,MonthlyRate,PercentSalaryHike,CommuteLength,SalaryLevel,YearsOnJob,JobInvolvement,PerformanceRating,Gender,TrainingTimesLastYear,...,HourlyRate,MonthlyIncome,OverTime,JobSatisfaction,EducationField,JobFunction,EducationalLevel,NumCompaniesWorked,StockOptionLevel,YearsWithCurrManager
0,infrequent,19146,22,2,5640,2,2,4,Male,2,...,33,4775,No,4,Life Sciences,Software Developer,L2,6,2,2
1,none,3395,23,2,5678,23,2,4,Male,3,...,74,10748,No,3,Life Sciences,Software Developer,L1,3,1,4
2,infrequent,4510,18,15,2022,5,3,3,Female,2,...,72,4963,Yes,2,Life Sciences,Software Developer,L4,9,3,3
3,none,17071,16,25,6782,1,4,3,Female,2,...,100,13194,Yes,2,Life Sciences,Product Management,L3,4,0,0
4,infrequent,18725,23,10,1980,4,3,4,Male,4,...,96,2075,No,4,Life Sciences,Software Developer,L4,3,2,3
