# Data Drift

==> <b>Definition</b> </br>
Drift is the variation between data from different timestamps.

==> <b>There Are mainly two types of Data Drift.</b> <br>
1) Data drift (features) <br>
--> This is for measuring the drifts in features
2)  Concept drift (model output) <br>
--> This is for measuring the drifts in target / model output

==> <b>Below STAT methods are used for calculating drift. These methods are after data drifts happened (reactive).</b> <br>
(https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests) <br>

1) Kolmogorov–Smirnov (K-S) test <br>
--> only for numerical features <br>
--> returns p_value <br>
--> drift detected when p_value < threshold <br>
<br>
2) Chi-Square test <br>
--> only for categorical features. <br>
--> if the number of labels for feature > 2 <br>
--> returns p_value <br>
--> drift detected when p_value < threshold <br>
<br>
3) Z-test <br>
--> only for categorical features <br>
--> if the number of labels for feature <= 2 <br>
--> returns p_value <br>
--> drift detected when p_value < threshold <br>
<br>
4) T-Test <br>
--> only for numerical features <br>
--> returns p-value <br>
--> drift detected when p_value < threshold <br>
<br>
4) Wasserstein distance <br>
--> only for numerical features <br>
--> returns distance <br>
--> drift detected when distance >= threshold <br>
<br>
5) Kullback-Leibler divergence (KL-divergence) <br>
--> for numerical and categorical features <br>
--> returns divergence <br>
--> drift detected when divergence >= threshold <br>
<br>
6) Population Stability Index (PSI) <br>
--> or numerical and categorical features <br>
--> returns psi_value <br>
--> drift detected when psi_value >= threshold <br>
<br>
7) Jensen-Shannon distance <br>
--> for numerical and categorical features <br>
--> returns distance <br>
--> drift detected when distance >= threshold <br>

In [None]:
#proactive
# pending

==> <b>Steps after detection od drift.</b> <br>
1) blindly update model with retraining
2) Incremental learning

In [14]:
'''
install jupyter nbextension, run
'''
!jupyter nbextension install --sys-prefix --symlink --overwrite --py evidently

!jupyter nbextension enable evidently --py --sys-prefix

Installing /Users/jaydeepchakraborty/JC/git-projects/py_stat/venv/lib/python3.7/site-packages/evidently/nbextension/static -> evidently
Removing: /Users/jaydeepchakraborty/JC/git-projects/py_stat/venv/share/jupyter/nbextensions/evidently
Symlinking: /Users/jaydeepchakraborty/JC/git-projects/py_stat/venv/share/jupyter/nbextensions/evidently -> /Users/jaydeepchakraborty/JC/git-projects/py_stat/venv/lib/python3.7/site-packages/evidently/nbextension/static
- Validating: [32mOK[0m

    To initialize this nbextension in the browser every time the notebook (or other app) loads:
    
          jupyter nbextension enable evidently --py --sys-prefix
    
Enabling notebook extension evidently/extension...
      - Validating: [32mOK[0m


In [15]:
'''
DATASET:
The dataset is the Cleveland Heart Disease dataset taken from the UCI repository. 
https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland
'''

'\nDATASET:\nThe dataset is the Cleveland Heart Disease dataset taken from the UCI repository. \nhttps://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland\n'

In [1]:
'''
## import statements
!pip install git+https://github.com/evidentlyai/evidently.git
'''
import pandas as pd
from IPython.display import display
import evidently
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import ColumnDriftMetric

In [2]:
'''
## Populate the config
'''
def get_conf():
    conf = {
        "data_path": "/Users/jaydeepchakraborty/JC/git-projects/model_util/DataSets/HeartDiseaseCleveland/",
        "data_fl": "Heart_disease_cleveland.csv",
        "report_fl": "/Users/jaydeepchakraborty/JC/git-projects/model_util/plots/data_drift/report_{ind}.html"
    }
    return conf

In [3]:
'''
## Load the Dataset


Age: Patients Age in years (Numeric)
Sex: Gender (Male : 1; Female : 0) (Nominal)
cp: Type of chest pain experienced by patient. This term categorized into 4 category.
0 typical angina, 1 atypical angina, 2 non- anginal pain, 3 asymptomatic (Nominal)
chol: Serum cholesterol in mg/dl (Numeric)
thalach: Maximum heart rate achieved (Numeric)
target: It is the target variable which we have to predict 1 means patient is suffering from heart disease and 0 means patient is normal.
'''
def get_data(conf):
    data = pd.read_csv(conf["data_path"]+conf["data_fl"])
    data = data[["age", "sex", "cp", "chol", "thalach", "target"]]
    return data

In [4]:
'''
## Populating data metric
'''
def populate_data_metric(data):
    display("Data Length:-")
    display(len(data))
    display("Data Description:-")
    display(data.describe())
    display("Data Samples:-")
    display(data.head(3))
    
    display("Data Columns and Types (before):-")
    for col in data.columns:
        print(f"Column Name:- {col}  Column Type:- {data[col].dtype}")
      
    # 'string', 'category', 'int64', 'float64'
    data['age'] = data['age'].astype('int64')
    data['sex'] = data['sex'].astype('category')
    data['cp'] = data['cp'].astype('category')
    data['chol'] = data['chol'].astype('int64')
    data['thalach'] = data['thalach'].astype('int64')
    data['target'] = data['target'].astype('category')
    
    display("Data Columns and Types (after):-")
    for col in data.columns:
        print(f"Column Name:- {col}  Column Type:- {data[col].dtype}")
    

In [5]:
'''
## Splitting data for drift
'''
def split_data(data):
    reference = data.sample(n=150, replace=False)
    current = data.sample(n=150, replace=False)
    return reference, current

In [6]:
'''
## Generating the report
~ https://docs.evidentlyai.com/reference/data-drift-algorithm
~ https://docs.evidentlyai.com/user-guide/input-data/column-mapping
~ https://docs.evidentlyai.com/presets/data-drift
~ 
'''
def generate_report(ref, cur):
    
    column_mapping = ColumnMapping()
    
    column_mapping.numerical_features = ['age', 'chol', 'thalach'] #list of numerical features
    column_mapping.categorical_features = ['sex', 'cp'] #list of categorical features
    column_mapping.target = 'target' #'target' is the name of the column with the target function
    
    data_drift_report = Report(metrics=[
        DataDriftPreset(),
        TargetDriftPreset(),
    ])
    
    data_drift_report.run(reference_data=ref, current_data=cur, column_mapping=column_mapping)
    
    return data_drift_report

In [7]:
'''
## Generating the custom report
~ https://docs.evidentlyai.com/reference/data-drift-algorithm
~ https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests
~ https://github.com/evidentlyai/evidently/tree/b4b1f80b4e2541e6303726d3d4691ca49c85105c/src/evidently/calculations/stattests
# sex := Nominal (0, 1)
# cp := Nominal (0, 1, 2, 3)
# age, chol, thalach := numerical
'''
def generate_custom_report(ref, cur):
    
    column_mapping = ColumnMapping()
    
    column_mapping.numerical_features = ['age', 'chol', 'thalach'] #list of numerical features
    column_mapping.categorical_features = ['sex', 'cp'] #list of categorical features
    column_mapping.target = 'target' #'target' is the name of the column with the target function
    
    data_drift_report = Report(metrics=[
        ColumnDriftMetric(column_name='sex', stattest='z'),
        ColumnDriftMetric(column_name='sex', stattest='kl_div'),
        ColumnDriftMetric(column_name='sex', stattest='psi'),
        ColumnDriftMetric(column_name='sex', stattest='jensenshannon'),
        ColumnDriftMetric(column_name='cp', stattest='chisquare'),
        ColumnDriftMetric(column_name='cp', stattest='kl_div'),
        ColumnDriftMetric(column_name='cp', stattest='psi'),
        ColumnDriftMetric(column_name='cp', stattest='jensenshannon'),
        ColumnDriftMetric(column_name='age', stattest='kl_div'),
        ColumnDriftMetric(column_name='age', stattest='t_test'),
        ColumnDriftMetric(column_name='age', stattest='wasserstein'),
        ColumnDriftMetric(column_name='age', stattest='jensenshannon'),
        ColumnDriftMetric(column_name='age', stattest='psi'),
        ColumnDriftMetric(column_name='chol', stattest='kl_div'),
        ColumnDriftMetric(column_name='chol', stattest='t_test'),
        ColumnDriftMetric(column_name='chol', stattest='wasserstein'),
        ColumnDriftMetric(column_name='chol', stattest='jensenshannon'),
        ColumnDriftMetric(column_name='chol', stattest='psi'),
        ColumnDriftMetric(column_name='thalach', stattest='kl_div'),
        ColumnDriftMetric(column_name='thalach', stattest='t_test'),
        ColumnDriftMetric(column_name='thalach', stattest='wasserstein'),
        ColumnDriftMetric(column_name='thalach', stattest='jensenshannon'),
        ColumnDriftMetric(column_name='thalach', stattest='psi'),
    ])
    
    data_drift_report.run(reference_data=ref, current_data=cur, column_mapping=column_mapping)
    
    return data_drift_report

In [8]:
'''
## Saving the report
'''
def save_report(report, conf, ind=""):
    report.save_html(conf["report_fl"].replace("{ind}", ind))

In [9]:
if __name__ == "__main__":
    conf = get_conf()
    data = get_data(conf)
    populate_data_metric(data)
    reference_data, current_data = split_data(data)
    default_report = generate_report(reference_data, current_data)
    save_report(default_report, conf, "default")
    custom_report = generate_custom_report(reference_data, current_data)
    save_report(custom_report, conf, "custom")

'Data Length:-'

303

'Data Description:-'

Unnamed: 0,age,sex,cp,chol,thalach,target
count,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,2.158416,246.693069,149.607261,0.458746
std,9.038662,0.467299,0.960126,51.776918,22.875003,0.49912
min,29.0,0.0,0.0,126.0,71.0,0.0
25%,48.0,0.0,2.0,211.0,133.5,0.0
50%,56.0,1.0,2.0,241.0,153.0,0.0
75%,61.0,1.0,3.0,275.0,166.0,1.0
max,77.0,1.0,3.0,564.0,202.0,1.0


'Data Samples:-'

Unnamed: 0,age,sex,cp,chol,thalach,target
0,63,1,0,233,150,0
1,67,1,3,286,108,1
2,67,1,3,229,129,1


'Data Columns and Types (before):-'

Column Name:- age  Column Type:- int64
Column Name:- sex  Column Type:- int64
Column Name:- cp  Column Type:- int64
Column Name:- chol  Column Type:- int64
Column Name:- thalach  Column Type:- int64
Column Name:- target  Column Type:- int64


'Data Columns and Types (after):-'

Column Name:- age  Column Type:- int64
Column Name:- sex  Column Type:- category
Column Name:- cp  Column Type:- category
Column Name:- chol  Column Type:- int64
Column Name:- thalach  Column Type:- int64
Column Name:- target  Column Type:- category


# Resources
1) https://www.youtube.com/watch?v=tQjRQWfYQ10
2) https://docs.evidentlyai.com/examples
3) https://towardsdatascience.com/why-data-drift-detection-is-important-and-how-do-you-automate-it-in-5-simple-steps-96d611095d93
4) https://www.evidentlyai.com/blog/data-drift-detection-large-datasets
5) https://www.kaggle.com/discussions/general/325253
6) https://towardsdatascience.com/calculating-data-drift-in-machine-learning-53676ff5646b