# Performance Monitoring

<img src="img/mlops_prediction_serving_and_monitoring.png" width=1200/>

## Stakeholders
On the monitoring side of ML models, there are multiple interested parties, and we should take the requirements for monitoring from the different stakeholders involved. One example of a typical set of stakeholders is the following:
- **Data scientists:** evaluating model performance and data drift that might negatively affect that performance.
- **Software engineers:** metrics that assess whether their products have reliable and correct access to the APIs that are serving models.
- **Data engineers:** ensure that the data pipelines are reliable and pushing data reliably, at the right velocity, and in line with the correct schemas.
- **Business/product stakeholders:** interested in the core impact of the overall solution on their customer base.

## Monitoring dimensions
The most widely used dimensions of monitoring in the ML industry are the following:
- **Data drift:** This corresponds to significant changes in the input data used either for training or inference in a model. It might indicate a change of the modeled premise in the real world, which will require the model to be retrained, redeveloped, or even archived if it's no longer suitable. This can be easily detected by monitoring the distributions of data used for training the model versus the data used for scoring or inference over time.
- **Target drift:** In line with the change of regimens in input data, we often see the same change in the distribution of outcomes of the model over a period of time. The common periods are months, weeks, or days, and might indicate a significant change in the environment that would require model redevelopment and tweaking.
- **Model (performance) drift:** This involves looking at whether the performance metrics such as accuracy for classification problems, or root mean square error, start suffering a gradually worsening over time. This is an indication of an issue with the model requiring investigation and action from the model developer or maintainer.
- **Platform and infrastructure metrics:** This type of metrics is not directly related to modeling, but with the systems infrastructure that encloses the model. It implies abnormal CPU, memory, network, or disk usage that will certainly affect the ability of the model to deliver value to the business.
- **Business metrics:** Very critical business metrics, such as the profitability of the models, in some circumstances should be added to the model operations in order to ensure that the team responsible for the model can monitor the ability of the model to deliver on its business premise.

An emergent open source tool in the space of monitoring model performance is called [Evidently AI](https://evidentlyai.com/).

Reference: https://github.com/evidentlyai/evidently/

In [None]:
# Run in DS-Workbench Jupyter terminal
# https://evidentlyai.com
# pip install evidently

In [None]:
# Run in DS-Workbench Jupyter terminal
# pip install xgboost

In [None]:
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.model_selection import train_test_split

import mlflow

from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab, NumTargetDriftTab, CatTargetDriftTab, ClassificationPerformanceTab

from data_utils import get_train_test_split_for_stock
from config import *

Get a reference dataset, basically a training dataset. We will add a set of features to the pandas DataFrame so evidently will be able to use the feature names in the drift reports

## Reference dataset (used for model training, golden dataset)

In [None]:
df_ref = pd.read_csv(os.path.join(PATH_TO_DATA_PIPELINE, "training", "data.csv"))
df_ref

## Input data for evaluation
These are input values for which we want to make predictions. We load input data for scoring with the intention to calculate the `distribution difference` between the data in the reference training set and the data to be scored.

In [None]:
df_input = pd.read_csv(os.path.join(PATH_TO_DATA, "performance_monitoring/input/input_data_for_evaluation.csv"))
df_input

In [None]:
# There is no TARGET in this dataset
df_input.shape

## Scored input dataset for evaluation
This dataset is the input data for evaluation with predicted target values done with the model (input data for evaluation + predictions).

In [None]:
df_input_scored = pd.read_csv(os.path.join(PATH_TO_DATA, "performance_monitoring/input/scored_input_data_for_evaluation.csv"))
df_input_scored

## Set Experiment

In [None]:
mlflow.set_experiment('SP_Model_Monitoring')

## Data (input features) drift

Compare recent data with the past. Learn which features changed and if key models drivers shifted.

In [None]:
with mlflow.start_run(run_name="Data drift") as run:
    
    drift_dashboard = Dashboard(tabs=[DataDriftTab()])
    drift_dashboard.calculate(df_ref, df_input_scored, column_mapping = None)
    
    drift_dashboard.save(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "input_data_drift.html"))
    drift_dashboard._save_to_json(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "input_data_drift.json"))
    
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "input_data_drift.html"))
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "input_data_drift.json"))

## Target and prediction drift
Understand how model predictions and target change over time. If the ground truth is delayed, catch the model decay in advance.

Reference: https://evidentlyai.com/blog/evidently-014-target-and-prediction-drift

In [None]:
with mlflow.start_run(run_name="Target drift") as run:
    
    model_target_drift = Dashboard(tabs=[CatTargetDriftTab()])
    model_target_drift.calculate(df_ref, df_input_scored)
    
    model_target_drift.save(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "target_drift.html"))
    model_target_drift._save_to_json(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "target_drift.json"))
    
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "target_drift.html"))
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "target_drift.json"))

## Model drift
Monitoring model drift is extremely important to ensure that your model is still delivering at its optimal performance level. From this analysis, you can make a decision on whether to retrain your model or even develop a new one from scratch.

Reference: https://evidentlyai.com/blog/evidently-018-classification-model-performance

In [None]:
# Read again, as something overrides original df (a bug?)
df_ref = pd.read_csv(os.path.join(PATH_TO_DATA_PIPELINE, "training", "data.csv"))
df_input_scored = pd.read_csv(os.path.join(PATH_TO_DATA, "performance_monitoring/input/scored_input_data_for_evaluation.csv"))

In [None]:
# Reference data
X_train = df_ref.loc[:, df_ref.columns != 'target']
y_train = df_ref.loc[:, df_ref.columns == 'target']

X_train.shape, y_train.shape

In [None]:
#X_train

In [None]:
#y_train

In [None]:
# Data from production (recent data)
X_test = df_input_scored.iloc[:, :-1]
y_test = df_input_scored.iloc[:, -1]

X_test.shape, y_test.shape

In [None]:
# Get logistig regression model
logged_model = '/data/artifacts/3/135560bf2a324a609db4b0950d48fd5b/artifacts/model' # logistic reg

# Load model as a PyFuncModel.
model = mlflow.pyfunc.load_model(logged_model)

with mlflow.start_run(run_name="Model drift") as run:
    
    train_proba_predict = model.predict(X_train) # reference
    test_proba_predict = model.predict(X_test) # production
    
    train_predictions = [1. if y_cont > CLASS_THRESHOLD else 0. for y_cont in train_proba_predict]
    test_predictions = [1. if y_cont > CLASS_THRESHOLD else 0. for y_cont in test_proba_predict]
       
    # Add target and prediction columns
    ref_model_results = X_train.copy()
    ref_model_results['target'] = y_train
    ref_model_results['prediction'] = train_predictions
    latest_model_results = X_test.copy()
    latest_model_results['target'] = y_test
    latest_model_results['prediction'] = test_predictions
    
    model_performance = Dashboard(tabs=[ClassificationPerformanceTab()])
    model_performance.calculate(ref_model_results, latest_model_results)#, column_mapping=column_mapping)
    
    model_performance.save(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "model_drift.html"))
    model_performance._save_to_json(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "model_drift.json"))
    
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "model_drift.html"))
    mlflow.log_artifact(os.path.join(PATH_TO_PERFORMANCE_REPORTS, "model_drift.json"))

## Show 
- MLflow HTML report
- /data/reports_data_drift/input_data_drift.json

## Infrastructure monitoring and alerting
We should use `infrastructure monitoring tools`, like AWS CloudWatch, and then report to MLFlow.

At a higher level, we can split the infrastructure monitoring and alerting components into the following three items:
- **Resource metrics:** refers to metrics regarding the hardware infrastructure where the system is deployed (CPU utilization, memory utilization, network data transfer, disk I/O)
- **System metrics:** refers to metrics regarding the system infrastructure where the system is deployed (request throughput, request latencies, validation metrics)
- **Alerting:** For alerting, we use any of the metrics and set up a threshold that we consider acceptable