# Drift detection using alibi-detect package from SELDON.IO

Install the alibi-detect from seldon.io using following techniques

## With pip
pip install alibi-detect

## With conda

### Install mamba package manager
conda install mamba -n base -c conda-forge

### To install alibi-detect with the default TensorFlow backend:

mamba install -c conda-forge alibi-detect

## Drift detection on diabetic data

Here we are applying drift detection in diabetic dataset. The dataset contains both contineous data and categorical data. The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests for the continuous numerical features and Chi-Squared tests for the categorical features.

To add PySpark to sys.path for running the code on the Jupyter IDE we are Using the package findspark

In [3]:
import pyspark
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\spark-3.0.3-bin-hadoop2.7'

To perform any task on spark you need start a spark session, here we are starting a session for our logistic regression

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Logistic App").getOrCreate()

To start, we are loading the diabetes dataset 

In [5]:
!wget https://raw.githubusercontent.com/ismayilsiyad/hpe_ml/main/diabetes.csv
diabetes = spark.read.csv('diabetes.csv',header= True)
diabetes.printSchema()
diabetes.show()

--2022-04-07 19:38:01--  https://raw.githubusercontent.com/ismayilsiyad/hpe_ml/main/diabetes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 776415 (758K) [text/plain]
Saving to: 'diabetes.csv.15'

     0K .......... .......... .......... .......... ..........  6%  767K 1s
    50K .......... .......... .......... .......... .......... 13% 1.12M 1s
   100K .......... .......... .......... .......... .......... 19%  951K 1s
   150K .......... .......... .......... .......... .......... 26% 1.09M 1s
   200K .......... .......... .......... .......... .......... 32%  366K 1s
   250K .......... .......... .......... .......... .......... 39% 42.3M 1s
   300K .......... .......... .......... .......... .......... 46%  529K 1s
   350K .......... .......... ....

root
 |-- PatientID: string (nullable = true)
 |-- Pregnancies: string (nullable = true)
 |-- PlasmaGlucose: string (nullable = true)
 |-- DiastolicBloodPressure: string (nullable = true)
 |-- TricepsThickness: string (nullable = true)
 |-- SerumInsulin: string (nullable = true)
 |-- BMI: string (nullable = true)
 |-- DiabetesPedigree: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Diabetic: string (nullable = true)

+---------+-----------+-------------+----------------------+----------------+------------+-----------+----------------+---+--------+
|PatientID|Pregnancies|PlasmaGlucose|DiastolicBloodPressure|TricepsThickness|SerumInsulin|        BMI|DiabetesPedigree|Age|Diabetic|
+---------+-----------+-------------+----------------------+----------------+------------+-----------+----------------+---+--------+
|  1354778|          0|          171|                    80|              34|          23|43.50972593|     1.213191354| 21|       0|
|  1147438|          8|      

# Dropping unwanted columns

We need to drop unwanted columns from the dataset. By looking into the dataset we can see columns 'PatientID' have no relevance in predicting the diabetes. To have this insight in a complex problem. we have to formulate the hypothesis and evaluation of the hypothesis should be done.

In [6]:
colm = 'PatientID'
db_df = diabetes.select([column for column in diabetes.columns if column not in colm])
db_df.printSchema()

root
 |-- Pregnancies: string (nullable = true)
 |-- PlasmaGlucose: string (nullable = true)
 |-- DiastolicBloodPressure: string (nullable = true)
 |-- TricepsThickness: string (nullable = true)
 |-- SerumInsulin: string (nullable = true)
 |-- BMI: string (nullable = true)
 |-- DiabetesPedigree: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Diabetic: string (nullable = true)



# Changing the column datatype

We need to change column datatype to float from the initial string datatype

In [7]:
from pyspark.sql.functions import col
db_df = db_df.select(*(col(c).cast('float').alias(c) for c in db_df.columns))
db_df.printSchema()

root
 |-- Pregnancies: float (nullable = true)
 |-- PlasmaGlucose: float (nullable = true)
 |-- DiastolicBloodPressure: float (nullable = true)
 |-- TricepsThickness: float (nullable = true)
 |-- SerumInsulin: float (nullable = true)
 |-- BMI: float (nullable = true)
 |-- DiabetesPedigree: float (nullable = true)
 |-- Age: float (nullable = true)
 |-- Diabetic: float (nullable = true)



# Taking the count of the null and missing values

In [8]:
from pyspark.sql.functions import col, count, isnan, when
db_df.select([count(when(col(c).isNull(), c)).alias(c) for c in db_df.columns]).show()

+-----------+-------------+----------------------+----------------+------------+---+----------------+---+--------+
|Pregnancies|PlasmaGlucose|DiastolicBloodPressure|TricepsThickness|SerumInsulin|BMI|DiabetesPedigree|Age|Diabetic|
+-----------+-------------+----------------------+----------------+------------+---+----------------+---+--------+
|          0|            0|                     0|               0|           0|  0|               0|  0|       0|
+-----------+-------------+----------------------+----------------+------------+---+----------------+---+--------+



## Importing Chi-square drift and Tabular drift modules from alibi-detect library


In [10]:

from alibi_detect.cd import ChiSquareDrift, TabularDrift
from alibi_detect.utils.saving import save_detector, load_detector

In [11]:
db_df.show()

+-----------+-------------+----------------------+----------------+------------+---------+----------------+----+--------+
|Pregnancies|PlasmaGlucose|DiastolicBloodPressure|TricepsThickness|SerumInsulin|      BMI|DiabetesPedigree| Age|Diabetic|
+-----------+-------------+----------------------+----------------+------------+---------+----------------+----+--------+
|        0.0|        171.0|                  80.0|            34.0|        23.0|43.509727|       1.2131914|21.0|     0.0|
|        8.0|         92.0|                  93.0|            47.0|        36.0|21.240576|      0.15836498|23.0|     0.0|
|        7.0|        115.0|                  47.0|            52.0|        35.0|41.511524|      0.07901857|23.0|     0.0|
|        9.0|        103.0|                  78.0|            25.0|       304.0|29.582191|       1.2828698|43.0|     1.0|
|        1.0|         85.0|                  59.0|            27.0|        35.0|42.604534|       0.5495419|22.0|     0.0|
|        0.0|         82

## Converting spark dataframe to numpy array

In [33]:
import numpy as np
db_df_pd = np.array(db_df.select("*").collect())


In [34]:
db_df_pd.shape

(15000, 9)

We split the data in a reference set and 2 test sets on which we test the data drift

In [35]:
n_ref = 4500
n_test = 4500

X_ref, X_t0, X_t1 = db_df_pd[:n_ref], db_df_pd[n_ref:n_ref + n_test], db_df_pd[n_ref + n_test:n_ref + 2 * n_test]
X_ref.shape, X_t0.shape, X_t1.shape

((4500, 9), (4500, 9), (4500, 9))

## Detect drift

We need to provide the drift detector with the columns which contain categorical features so it knows which features require the Chi-Squared and which ones require the K-S univariate test. We can either provide a dict with as keys the column indices and as values the number of possible categories or just set the values to None and let the detector infer the number of categories from the reference data.

In [15]:


feature_names = ['Pregnancies', 'PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age','Diabetic']
category_map = {8: ['0.0','1.0']}
categories_per_feature = {f: None for f in list(category_map.keys())}

Next, we can initialize the detector

In [16]:
cd = TabularDrift(X_ref, p_val=.05)

We can save the initialised detector and load it back:

In [17]:
filepath = 'my_path'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

Directory my_path\model does not exist.


Next we can check whether the first test setis is drifting from the reference data:

In [18]:
preds = cd.predict(X_t0)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? No!


It is possible to check the drift for each feature seperately. The preds dictionary also returns the K-S or Chi-Squared test statistics and p-value for each feature

In [19]:
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Pregnancies -- K-S 0.005 -- p-value 1.000
PlasmaGlucose -- K-S 0.022 -- p-value 0.243
DiastolicBloodPressure -- K-S 0.028 -- p-value 0.064
TricepsThickness -- K-S 0.021 -- p-value 0.288
SerumInsulin -- K-S 0.017 -- p-value 0.520
BMI -- K-S 0.013 -- p-value 0.844
DiabetesPedigree -- K-S 0.018 -- p-value 0.486
Age -- K-S 0.027 -- p-value 0.080
Diabetic -- Chi2 0.010 -- p-value 0.976


Any of the feature's p-values are below the threshold value of p. The threshold p-value can be visible as

In [20]:
preds['data']['threshold']

0.005555555555555556

Now, Let's check the drift between the the second test set with reference set

In [23]:
preds = cd.predict(X_t1)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? No!


In [24]:
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Pregnancies -- Drift? No! -- K-S 0.016 -- p-value 0.642
PlasmaGlucose -- Drift? No! -- K-S 0.010 -- p-value 0.985
DiastolicBloodPressure -- Drift? No! -- K-S 0.015 -- p-value 0.677
TricepsThickness -- Drift? No! -- K-S 0.017 -- p-value 0.503
SerumInsulin -- Drift? No! -- K-S 0.016 -- p-value 0.571
BMI -- Drift? No! -- K-S 0.020 -- p-value 0.351
DiabetesPedigree -- Drift? No! -- K-S 0.013 -- p-value 0.844
Age -- Drift? No! -- K-S 0.021 -- p-value 0.288
Diabetic -- Drift? No! -- Chi2 0.002 -- p-value 1.000


# Chi-Square Drift

While the TabularDrift detector works fine with numerical or categorical features only, we can also directly use a categorical drift detector. First we construct a categorical-only dataset and then use the ChiSquareDrift detector

In [30]:
cols = list(category_map.keys())
cat_names = [feature_names[_] for _ in list(category_map.keys())]
X_ref_cat, X_t0_cat = X_ref[:, cols], X_t0[:, cols]
X_ref_cat.shape, X_t0_cat.shape

((4500, 1), (4500, 1))

In [31]:
cd = ChiSquareDrift(X_ref_cat, p_val=.05)
preds = cd.predict(X_t0_cat)
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? No!


In [32]:
print(f"Threshold {preds['data']['threshold']}")
for f in range(cd.n_features):
    fname = cat_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Threshold 0.05
Diabetic -- Drift? No! -- Chi2 0.963 -- p-value 0.326


In [19]:
spark.stop()