## Synopsis

The following Notebook contains an orchestration on monitoring Jobs in ML Lifecycle, being one of the core principle of ML-Ops. The Notebook provides a an overview on setting monitoring mechanism on ML pipeline code. Monitoring plays essential role in self-governing ML model to re-train itself.
Kindly Note: This Notebook is continuation to [Pipeline Notebook](https://) 


### Dataset
Dataset used in below example has been cloned from Kaggle platform, [link](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease) 
Description: 2020 annual CDC survey data of 400k adults related to their health status, with context to having a heart disease. (Additional Information can be extracted from Kaggle Link).

Scope of Notebook:

1. **Monitoring**: It covers sub modules of Monitoring like:
* Model Drift: Decay of Model performance in production environment, observed from new monitoring data. 
* Data Drift: Deviation of feature distrbution between production data (trained) to  new validation (Monitoring Data)  
* Feature score: Change of relationship between categorical varaibles and Continous variables. 


Limitations:

1. The aim of exercise covers the monitoring piece in isolation, which can be consumed by plugging it into respective project. These codes are not linked to any project pipelines and are controlled through external Git. 


#### Data Versioning: (Hugging Face)
`https://huggingface.co/datasets/mozay22/heart_disease/tree/main` 

#### Code Repo: (GitHub)
`https://github.com/mohdtaher2022/ML_Ops_Practices` 


#### Pre-Requisites

Python, SQL, PySpark, ML Lifecycle, Statistics

Kindly Note: The Notebook has been executed on a remote server, change in  path referencing might be required in host server. Python packages are printed below to observe similar results executed in exercise.

### Setting up Spark Infrastructure and Pre-requisite Libraries  

In [2]:
%%capture
!apt-get install openjdk-8-jdk

<IPython.core.display.Javascript object>

In [3]:
import os
#Set the JAVA_HOME env variable
os.environ["JAVA_HOME"]="/usr/lib/jvm/java-8-openjdk-amd64"

<IPython.core.display.Javascript object>

In [4]:
%%capture
!echo $JAVA_HOME
!pip install pyspark==3.0.0
!pip install -q findspark
!pip install datasets

<IPython.core.display.Javascript object>

In [1]:
# Avoids scroll-in-the-scroll in the entire Notebook
from IPython.display import Javascript
def resize_colab_cell():
  display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 5000})'))
get_ipython().events.register('pre_run_cell', resize_colab_cell)

### Dowloading Dataset

In [5]:
!git lfs install
!rm -rf heart_disease
!git clone https://huggingface.co/datasets/mozay22/heart_disease

<IPython.core.display.Javascript object>

Error: Failed to call git rev-parse --git-dir --show-toplevel: "fatal: not a git repository (or any of the parent directories): .git\n"
Git LFS initialized.
Cloning into 'heart_disease'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 30 (delta 10), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (30/30), done.
Filtering content: 100% (4/4), 29.13 MiB | 12.66 MiB/s, done.


In [6]:
os.environ['dir'] = os.getcwd()  ; os.environ['repo'] = 'heart_disease' ;  
os.environ['file_1'] = 'test_df.zip'; os.environ['file_2'] = 'validation_df.zip' ;
os.environ['repo_2'] = 'ML_Ops_Practices' ; os.environ['file_3'] = 'Models/saved_models_1.zip' ; 

<IPython.core.display.Javascript object>

In [7]:
%%capture
!unzip $dir/$repo/$file_2 -d output/
!unzip $dir/$repo/$file_1 -d output/

<IPython.core.display.Javascript object>

#### Importing Libraries

In [8]:
# 3. Start Spark Session
import findspark
findspark.init()

#import the necessary dependencies
import sys
import os
import operator
import json

# Importing Specific Dataset of Heart Disease
from datasets import load_dataset


# data wrangling
import numpy as np
import pandas as pd
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import pyspark.sql.types  as st
import pyspark.sql.functions  as sf
from pyspark.sql.functions import rand 
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, types as T, functions as F
from pyspark.sql.functions import udf
pd.options.display.html.table_schema = True


# machine learning
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import Transformer
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel \
, GBTClassifier , GBTClassificationModel, LogisticRegression, LogisticRegressionModel
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics ,BinaryClassificationMetrics
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.stat import ChiSquareTest


# Stats Modules
# Stats libs
import scipy
import statsmodels.api as sm
from statsmodels.formula.api import ols

# KS Test
from scipy import stats

# Sklearn Model
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, roc_auc_score
from sklearn.metrics import *

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# misc
import math
from operator import add
from functools import reduce
from datetime import datetime
import operator
import re
import random

# Dropping the display of Scientific Notations.
# for pandas 
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# for Numpys
np.set_printoptions(suppress=True,formatter={'float_kind':'{:16.3f}'.format}, linewidth=130)

<IPython.core.display.Javascript object>

#### Loading Dataset

In [None]:
%%time
# Building App using Spark Session
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "8g") \
    .appName('my-cool-app') \
    .getOrCreate()
sc=spark.sparkContext
spark.conf.set("spark.sql.shuffle.partitions", "5")

# To read from Local File
load_path = os.path.join(os.getcwd(),"/content/output/content/test_df.parquet")
load_path_val = os.path.join(os.getcwd(),"/content/output/content/validation_df.parquet")

test_ = spark.read.parquet(load_path) # we had transformed the test data in previous excercise
val_ = spark.read.parquet(load_path_val) # New Validation Data (Monitoring Data) 
test_.show(3); val_.show(3);

+-----+--------------+------------+---------+----------------+-----------+-------------------+----------+---------------+----------+--------------------+----------+-----------------+--------------+-----------------------------+----------------------------------+----------+----------+-------------+----------+---------------------+--------------------+-----------------------+
|  BMI|PhysicalHealth|MentalHealth|SleepTime|HeartDisease_Yes|Smoking_Yes|AlcoholDrinking_Yes|Stroke_Yes|DiffWalking_Yes|Sex_Female|PhysicalActivity_Yes|Asthma_Yes|KidneyDisease_Yes|SkinCancer_Yes|Diabetic_Yes_during_pregnancy|Race_American_IndianAlaskan_Native|Race_Asian|Race_Black|Race_Hispanic|Race_White|GenHealth_transformed|Diabetic_transformed|AgeCategory_transformed|
+-----+--------------+------------+---------+----------------+-----------+-------------------+----------+---------------+----------+--------------------+----------+-----------------+--------------+-----------------------------+-------------------

### Loading Codes from Git. 

In [None]:
!rm -rf ML_Ops_Practices
!git clone https://github.com/mohdtaher2022/ML_Ops_Practices.git

Cloning into 'ML_Ops_Practices'...
remote: Enumerating objects: 251, done.[K
remote: Counting objects:   3% (1/33)[Kremote: Counting objects:   6% (2/33)[Kremote: Counting objects:   9% (3/33)[Kremote: Counting objects:  12% (4/33)[Kremote: Counting objects:  15% (5/33)[Kremote: Counting objects:  18% (6/33)[Kremote: Counting objects:  21% (7/33)[Kremote: Counting objects:  24% (8/33)[Kremote: Counting objects:  27% (9/33)[Kremote: Counting objects:  30% (10/33)[Kremote: Counting objects:  33% (11/33)[Kremote: Counting objects:  36% (12/33)[Kremote: Counting objects:  39% (13/33)[Kremote: Counting objects:  42% (14/33)[Kremote: Counting objects:  45% (15/33)[Kremote: Counting objects:  48% (16/33)[Kremote: Counting objects:  51% (17/33)[Kremote: Counting objects:  54% (18/33)[Kremote: Counting objects:  57% (19/33)[Kremote: Counting objects:  60% (20/33)[Kremote: Counting objects:  63% (21/33)[Kremote: Counting objects:  66% (22/33)[Kremote:

### Loading Modules from Git.

In [None]:
# Functions
helper_func = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Functions/helper_functions.py')).read()
pipeline_func = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Functions/pipeline_func.py')).read()
retrain_n_validation = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Functions/retrain_n_validation.py')).read()
# Configs
feature_eng_configs = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Configs/feature_eng_config.py')).read()
model_params = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Configs/model_params.py')).read()
# Variables
variables = open(os.path.join(os.getcwd(),'ML_Ops_Practices/Utilities/Variables/env_var.py')).read()

# Variables
exec(variables) ;
# Functions
exec(helper_func) ; exec(pipeline_func); exec(retrain_n_validation);  
# Configs
exec(feature_eng_configs) ; exec(model_params) ; 

#### Loading Model Artifacts

In [None]:
# Loading model Artifacts
# Threshold 
train_artifacts_loc = os.path.join(os.getcwd(), 'ML_Ops_Practices/Model_artifacts/recall_pr_threshold.json')
train_model_info =  load_json(train_artifacts_loc)
print(train_model_info)

# Feature Data types
feature_path_ = os.path.join(os.getcwd(), 'ML_Ops_Practices/Model_artifacts/feature_dtypes.json')
model_features = load_json(feature_path_)

{'Model_name': 'LogisticRegression', 'selected_metric': 'precision', 'precision': '0.4', 'recall': '0.27474462839027824', 'cut-off': '0.22'}


### Terminologies

There are primarily two dataset used in the entire exercise, mentioning below:

**Test Data / Prod Data**: Dataset on which model was evaluated, Model & Feature Scores will be generated from  this dataset to compare with new dataset.

**Monitoring Data / Validation Data**: New Inflow data getting streamed into business, scores will be evaluated based on model's performance on validation dataset.  

#### Feature Eng on Monitoring Data (Validation Data)
( Executed Through Pipeline shown in Pipeline 1)

In [None]:

# All Steps for Feature engineering are saved in the module: feature_eng_config.py
#  Columns to be excluded can be altered from env_var.py

# Test Data : Test Data From Train and test split. 

# The create_feature_pipeline was not transformed prior to loading the test data for the following reasons:
# Addtional weightage on data size, by data getting duplicated adding vector (It almost doubles the data Size). 

Feature_eng_Pipeline =  Pipeline(stages=[create_feature_pipeline])
Featpip = Feature_eng_Pipeline.fit(test_)
test_transform = Featpip.transform(test_)

# Monitoring Data
# Pipeline Execution for Feature Engineering Steps (Validation Data)
Feature_eng_Pipeline_val =  Pipeline(stages=[step_1_one_hot_enc, step_2_diabetic_enc_pregnancy,
                                             step_3_ordinal_mapping, step_4_regex, create_feature_pipeline])
Featpip_val = Feature_eng_Pipeline_val.fit(val_)
val_transform = Featpip_val.transform(val_)

In [None]:
test_transform.show(3)
val_transform.show(3)

+-----+--------------+------------+---------+----------------+-----------+-------------------+----------+---------------+----------+--------------------+----------+-----------------+--------------+-----------------------------+----------------------------------+----------+----------+-------------+----------+---------------------+--------------------+-----------------------+--------------------+
|  BMI|PhysicalHealth|MentalHealth|SleepTime|HeartDisease_Yes|Smoking_Yes|AlcoholDrinking_Yes|Stroke_Yes|DiffWalking_Yes|Sex_Female|PhysicalActivity_Yes|Asthma_Yes|KidneyDisease_Yes|SkinCancer_Yes|Diabetic_Yes_during_pregnancy|Race_American_IndianAlaskan_Native|Race_Asian|Race_Black|Race_Hispanic|Race_White|GenHealth_transformed|Diabetic_transformed|AgeCategory_transformed|            features|
+-----+--------------+------------+---------+----------------+-----------+-------------------+----------+---------------+----------+--------------------+----------+-----------------+--------------+-------

### Executing Monitoring Pipeline.

1. Model Scores.
2. Feature Scores.
3. Data Drift

### 1. Model Scores

#### Loading Model.

In [None]:
%%capture
!unzip $dir/$repo_2/$file_3 -d output/
# (Selected Model was pushed to Git post retraining as Zip file)
prod_model = loadModel(input_model = 'LogisticRegression')

### Monitoring Pipeline Steps & Execution
Below are pipeline Steps to generate Model Score Summary on validation Dataset. (Recall, Precision, Accuracy, ROC_AUC, avg_pr etc..)


In [None]:
# step 1
# Model Cut-off Threshold:
model_threshold_ = np.float(train_model_info['cut-off'])
#  Model Inference
Model_inference_step1 = model_inference_pipeline(model_ = prod_model,apply_cutoff=True, model_threshold = model_threshold_)

# Step 2
# Label Dict
label_dictionary_ = {0.0: "Heart Disease No", 1.0: "Heart Disease Yes"}
# Model Scores
model_scores_step2 =  model_scores_pipeline(prediction_col = 'prediction_with_threshold',
                                        lable_col =  'HeartDisease_Yes',
                                        label_dict =  label_dictionary_, prob_col = 'prob_raw', 
                                        model_verion = 1 , model_name =  train_model_info['Model_name'])

In [None]:
# Pipeline Orchestration: Model Performance 
monitoring_Pipeline =  Pipeline(stages=[Model_inference_step1, model_scores_step2])

# Monitoring Pipeline (Test)
monitoring_pc = monitoring_Pipeline.fit(test_transform)
test_transform_summary = monitoring_pc.transform(test_transform)

# Monitoring Pipeline (val)
monitoring_pc_val = monitoring_Pipeline.fit(val_transform)
val_transform_summary = monitoring_pc_val.transform(val_transform)

In [None]:
# Display of model results on monitoring Data. 
val_transform_summary.show()

+-------------+-----------------+--------------------+------------------+------------------+------------+
|Model_version|   Label_category|         Metric_name|      Metric_value|        classifier|capture_date|
+-------------+-----------------+--------------------+------------------+------------------+------------+
|            1| Heart Disease No|           precision|0.9342530889293479|LogisticRegression|  2022-11-16|
|            1| Heart Disease No|              recall|0.9624601657106437|LogisticRegression|  2022-11-16|
|            1| Heart Disease No|          F1 Measure|0.9481468857634903|LogisticRegression|  2022-11-16|
|            1|Heart Disease Yes|           precision|0.4042768386071377|LogisticRegression|  2022-11-16|
|            1|Heart Disease Yes|              recall|0.2733222623815571|LogisticRegression|  2022-11-16|
|            1|Heart Disease Yes|          F1 Measure|0.3261452383727707|LogisticRegression|  2022-11-16|
|            1| Weighted_Overall|     Weighted

### Feature Scores

Feature Scores calculates relationship between Y-Label and Input Features. In other words change in X variable influencing the Y varaible in statistically proven manner.

Chi-square test:  Categorical to Categorical test at P-Value significance.  
Annova test: continuous to categorical test at P-Value significance test.  

*Lesser P-value has more significance

Both test helps in determining Level of Significance using Statistical test.  

In [None]:
# Features Score
feature_score_step = input_feature_pipeline(cat_var = model_features['category_col'] ,
                                            cont_var = model_features['Cont_cols'],
                                            label_feature = 'HeartDisease_Yes', model_version = 1)

In [None]:
# Feature Score Pipeline
feature_scores_pipeline =  Pipeline(stages=[feature_score_step])

# Monitoring Pipeline (Test)
feature_score_pc = feature_scores_pipeline.fit(test_transform)
feature_summary_test = feature_score_pc.transform(test_transform)

# Monitoring Pipeline (val)
feature_score_pc_val = feature_scores_pipeline.fit(val_transform)
feature_summary_val = feature_score_pc_val.transform(val_transform)

------------------- Calculating Chi scores-------------------
------------------- Calculating Anova scores-------------------
------------------- Calculating Chi scores-------------------
------------------- Calculating Anova scores-------------------


In [None]:
feature_summary_val.show()

+-------------+--------------+--------------------+--------------------+------------+------------+
|Model_version|Label_category|        Feature_name|         Metric_name|Metric_value|capture_date|
+-------------+--------------+--------------------+--------------------+------------+------------+
|            1|   Categorical|         Smoking_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical| AlcoholDrinking_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical|          Stroke_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical|     DiffWalking_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical|          Sex_Female|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical|PhysicalActivity_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|            1|   Categorical|          Asthma_Yes|Chi-Square test -...|         0.0|  2022-11-16|
|         

### Data Drift


Data drift compares statistical relationship between two datasets, i.e.  Production and Monitoring Data,  There are two test types performed  

Jensen Shannon divergence Test:  Compares  divergence in distribution between two categorical  data, Higher the score, higher the divergence. 

E.g. In Data Inflow the data for Asthama patient is not being recorded in new production environment. through JS Divergence test, we can monitor level of divergence within categorical variable.

Kolmogorov–Smirnov test: Compares  divergence in distribution between two continous data, score are measured through P-Value Significance, Lower the P-Value higher the divergence.



In [None]:
# Data Drift Summary
data_drif_summary = data_drift(df1 = test_transform, df2 = val_transform , cat_feature = model_features['category_col'],
                               cont_feature = model_features['Cont_cols'], dep_var = 'HeartDisease_Yes', model_version =  1)

In [None]:
data_drif_summary.show()

+-------------+--------------+--------------------+--------------------+------------+------------+
|Model_version|Label_category|        Feature_name|         Metric_name|Metric_value|capture_date|
+-------------+--------------+--------------------+--------------------+------------+------------+
|            1|   Categorical|         Smoking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical| AlcoholDrinking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Stroke_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|     DiffWalking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Sex_Female|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|PhysicalActivity_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Asthma_Yes|       JS_Divergence|         0.0|  2022-11-16|
|         

## Log Structure For Monitoring

Log tables are baically a master table that keep append new monitoring data over time, like time over time when new data is generated within the business. log table record performance of model, features getting dispersed over time. It'll  be reffered by model re-train modules to determine whether model has entered Model drift or concept drift over time, as it helps in understanding the gap between the production and the montitoring data performance. 




#### Terminologies

**Base score**: Score generated by model on Base dataset, i.e. first time the score that were generated by the model, base model score is stored  to understand the gap between the model.

**Prod score**: Score generated by production model with the latest score in the production after retrain 

**Monitoring score**: Score generated from latest monitoring data recorded within the system.

### Model Performance Monitoring Log Structure. 

In [None]:
# Monitoring Log Table Creation
test_transform_summary.registerTempTable('prod_model_score')

# Log table Schema.
Log_table =  spark.sql("""SELECT  Model_version AS Model_version, Label_category,  Metric_name, Metric_value as Base_Model_Metric_Score, 
                            Metric_value as Prod_Model_Metric_Score, Metric_value as Monitor_Metric_Score, Metric_value as Score_delta , 
                            capture_date AS capture_date from prod_model_score""")

Log_table_structure_df =  spark.createDataFrame([],Log_table.schema)

In [None]:

# Monitoring Log Table Creation
test_transform_summary.registerTempTable('prod_model_score')
test_transform_summary.registerTempTable('base_model_score')
# Registering New Monitoring Info Received.
val_transform_summary.registerTempTable('Monitoring_score')


new_monitor_log = spark.sql("""SELECT A.Model_version, A.Label_category, A.Metric_name, B.Metric_value AS Base_Model_Metric_Score,  
                               A.Metric_value AS Prod_Model_Metric_Score,  C.Metric_value AS Monitor_Metric_Score, 
                              ROUND((C.Metric_value / A.Metric_value) - 1, 3) AS Score_delta , CURRENT_DATE() AS capture_date FROM 
                               prod_model_score A LEFT JOIN base_model_score B ON A.Metric_name = B.Metric_name AND A.Label_category =B.Label_category
                                                  LEFT JOIN Monitoring_score C ON A.Metric_name = C.Metric_name AND A.Label_category =C.Label_category""")
# Monitoring score captured
new_monitor_log.registerTempTable('Monitoring_score_new_log_entry')

# Capturing Monitoring info in Log Table with Delta.
Log_table_structure_df = Log_table_structure_df.union(new_monitor_log)
Log_table_structure_df.show()

+-------------+-----------------+--------------------+-----------------------+-----------------------+--------------------+-----------+------------+
|Model_version|   Label_category|         Metric_name|Base_Model_Metric_Score|Prod_Model_Metric_Score|Monitor_Metric_Score|Score_delta|capture_date|
+-------------+-----------------+--------------------+-----------------------+-----------------------+--------------------+-----------+------------+
|            1| Weighted_Overall| Weighted F(1) Score|     0.8954605375421371|     0.8954605375421371|    0.89511457160274|        0.0|  2022-11-16|
|            1| Weighted_Overall|             roc_auc|     0.8318491969096977|     0.8318491969096977|   0.832689617799035|      0.001|  2022-11-16|
|            1|Heart Disease Yes|          F1 Measure|    0.32939189189189183|    0.32939189189189183|  0.3261452383727707|      -0.01|  2022-11-16|
|            1| Heart Disease No|              recall|     0.9631729913290034|     0.9631729913290034|  0.

### Insight
We observe that the delta between the performance on New Monitoring data and production data is quite consistent. as we see on Key figure like avg_pr, ROC_AUC figures are performing quite in consistency with the prod model.   


### **Sample Snapshot of Model performance observing model recall.** 
<img src="https://raw.githubusercontent.com/mohdtaher2022/ML_Ops_Practices/main/Model_artifacts/Images/Model_Trend_line.PNG" alt="Alternative text" />


The figure is having Capture date as X-axis and scores of Base , prod, monitor on Y-axis. We see that scores deviate beetween monitor and production time to time and base score remains constant over time. Once the re-train module detects dip, it starts auto retraining the production.


### Feature Score Monitoring Log Structure. 

In [None]:
# Monitoring Log Table Creation
feature_summary_test.registerTempTable('prod_feature_score')

# Features Log table Schema.
Features_log_table =  spark.sql("""SELECT  Model_version AS Model_version, Label_category,  Feature_name, Metric_name,
                                    Metric_value as Base_feature_Score, Metric_value as Prod_feature_Score, 
                                    Metric_value as Monitor_feature_Score, Metric_value as Score_delta , 
                                    capture_date AS capture_date from prod_feature_score""")
Features_log_table_structure_df =  spark.createDataFrame([],Features_log_table.schema)

In [None]:

# REGISTERING the base and Production table 
feature_summary_test.registerTempTable('base_feature_score')
feature_summary_test.registerTempTable('prod_feature_score')

# Registering New Monitoring Info Received.
feature_summary_val.registerTempTable('Monitoring_feature_score')


new_monitor_feature_log = spark.sql("""SELECT A.Model_version, A.Label_category, A.Feature_name, A.Metric_name, B.Metric_value AS Base_feature_Score,  
                               A.Metric_value AS Prod_feature_Score,  C.Metric_value AS Monitor_feature_Score, 
                              (C.Metric_value / A.Metric_value) - 1 AS Score_delta , CURRENT_DATE() AS capture_date FROM 
                               prod_feature_score A 
                               LEFT JOIN base_feature_score B ON A.Feature_name = B.Feature_name AND A.Label_category =B.Label_category
                               LEFT JOIN Monitoring_feature_score C ON A.Feature_name = C.Feature_name AND A.Label_category =C.Label_category""")

new_monitor_feature_log.registerTempTable('Monitoring_feature_score_new_log_entry')

# Capturing Monitoring info in Log Table with Delta.
Features_log_table_structure_df = Features_log_table_structure_df.union(new_monitor_feature_log)
Features_log_table_structure_df.show()

+-------------+--------------+--------------------+--------------------+------------------+------------------+---------------------+-----------+------------+
|Model_version|Label_category|        Feature_name|         Metric_name|Base_feature_Score|Prod_feature_Score|Monitor_feature_Score|Score_delta|capture_date|
+-------------+--------------+--------------------+--------------------+------------------+------------------+---------------------+-----------+------------+
|            1|     Continous|AgeCategory_trans...| Anova test - pValue|               0.0|               0.0|                  0.0|       null|  2022-11-16|
|            1|   Categorical|     DiffWalking_Yes|Chi-Square test -...|               0.0|               0.0|                  0.0|       null|  2022-11-16|
|            1|   Categorical|Race_American_Ind...|Chi-Square test -...|             0.004|             0.004|                0.001|      -0.75|  2022-11-16|
|            1|   Categorical|          Sex_Female|C

### Insight

In above chart we that feature scores of monitoring data are quite at par with production data in terms of P-value score generation.

P-Value for Sleep time in monitoring increased to 0.519, that is an indicator that Sleep time hasn't been significantly influencing Y variable in new data. as P-value is highly above the minimum threshold of 0.05 


### Data Drift Monitoring Log Structure. 

In [None]:
Data_drift_log_table =  spark.createDataFrame([],data_drif_summary.schema)

In [None]:
# Capturing Monitoring info in Log Table with Delta.
Data_drift_log_table = Data_drift_log_table.union(data_drif_summary)
Data_drift_log_table.show()

+-------------+--------------+--------------------+--------------------+------------+------------+
|Model_version|Label_category|        Feature_name|         Metric_name|Metric_value|capture_date|
+-------------+--------------+--------------------+--------------------+------------+------------+
|            1|   Categorical|         Smoking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical| AlcoholDrinking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Stroke_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|     DiffWalking_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Sex_Female|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|PhysicalActivity_Yes|       JS_Divergence|         0.0|  2022-11-16|
|            1|   Categorical|          Asthma_Yes|       JS_Divergence|         0.0|  2022-11-16|
|         

### Insights 

If we see JS Diverenge Scores for categorical data, the data distribution looks consistent among prod and monitoring data. 

Also for KS-Divergence test any of continous variables the distibution hasn't been significantly different in comparision between Prod and New validation. 

#### Packages and versions installed within the Python environment
(Just to be used for cross validation of versions)

In [9]:
!pip freeze

<IPython.core.display.Javascript object>

absl-py==1.3.0
aeppl==0.0.33
aesara==2.7.9
aiohttp==3.8.3
aiosignal==1.3.1
alabaster==0.7.12
albumentations==1.2.1
altair==4.2.0
appdirs==1.4.4
arviz==0.12.1
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
async-timeout==4.0.2
asynctest==0.13.0
atari-py==0.2.9
atomicwrites==1.4.1
attrs==22.1.0
audioread==3.0.0
autograd==1.5
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==5.0.1
blis==0.7.9
bokeh==2.3.3
branca==0.6.0
bs4==0.0.1
CacheControl==0.12.11
cached-property==1.5.2
cachetools==5.2.0
catalogue==2.0.8
certifi==2022.9.24
cffi==1.15.1
cftime==1.6.2
chardet==3.0.4
charset-normalizer==2.1.1
click==7.1.2
clikit==0.6.2
cloudpickle==1.5.0
cmake==3.22.6
cmdstanpy==1.0.8
colorcet==3.0.1
colorlover==0.3.0
community==1.0.0b1
confection==0.0.3
cons==0.4.5
contextlib2==0.5.5
convertdate==2.4.0
crashtest==0.3.1
crcmod==1.7
cufflinks==0.17.3
cvxopt==1.3.0
cvxpy==1.2.2
cycler==0.11.0
cymem==2.0.7
Cython==0.29.32
daft==0.0.4
dask==2022.2.0
datascience==0.17.5
datasets==2.7.0
db-dtypes==1.0