## Categorical and mixed type data drift detection on Titanic Dataset
### Method
The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests for the continuous numerical features and Chi-Squared tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur.


### Installation

In [2]:
!pip install alibi



You should consider upgrading via the 'C:\Projects\bluealtair\testing\env\Scripts\python.exe -m pip install --upgrade pip' command.


In [5]:
import alibi
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from alibi_detect.cd import ChiSquareDrift, TabularDrift
from alibi_detect.utils.saving import save_detector, load_detector

  from .autonotebook import tqdm as notebook_tqdm


### Load Titanic Dataset

In [6]:
%ls ..\..\data\raw\train.csv

 Volume in drive C has no label.
 Volume Serial Number is FA41-07A3

 Directory of c:\Projects\bluealtair\testing\data\raw

05/10/2022  11:17 AM            61,194 train.csv
               1 File(s)         61,194 bytes
               0 Dir(s)  393,752,371,200 bytes free


In [8]:
data = pd.read_csv(r"..\..\data\raw\train.csv")

In [9]:
data.shape

(891, 12)

In [10]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Preprocessing dataset

In [11]:
data.dropna(axis = 0,inplace =True)

In [12]:
data.shape

(183, 12)

In [13]:
X,y = data.drop(["Survived"],axis = 1),data["Survived"]

In [14]:
X.shape,y.shape

((183, 11), (183,))

In [15]:
X.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [16]:
X.drop(["Name","Ticket","PassengerId"],axis = 1,inplace=True)

In [17]:
feature_names = list(X.columns)

In [18]:
feature_names

['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked']

In [23]:
X.reset_index(inplace=True,drop=True)

For data drift detection we need to exclusively provide a list of catgeorical variable columns to the model.

In [24]:
# In our case I have manually created the list of catgorical columns
category_list = ["Pclass","Sex","SibSp","Parch","Cabin","Embarked"]

In [25]:
categories_per_feature = {}
for index,value in enumerate(X.columns):
    if value in category_list:
        categories_per_feature[index]=None



Data provided to drift detector must be in numpy array. So we need first lable encode catgeorical variable and then convert the dataframe to 2d numpy.

In [27]:
from sklearn.preprocessing import LabelEncoder 


In [28]:
X = X.apply(LabelEncoder().fit_transform)


In [29]:
X = X.to_numpy()

We split the data in a reference set and 2 test sets on which we test the data drift:

In [30]:
n_ref = 50
n_test = 50

X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
X_ref.shape, X_t0.shape, X_t1.shape

((50, 8), (50, 8), (50, 8))

### Detect drift

We need to provide the drift detector with the columns which contain categorical features so it knows which features require the Chi-Squared and which ones require the K-S univariate test. We can either provide a dict with as keys the column indices and as values the number of possible categories or just set the values to *None* and let the detector infer the number of categories from the reference data as in the example below:

### Initialize the detector

In [33]:
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

We can also save/load an initialised detector:

In [34]:
filepath = 'detector_path'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

Directory detector_path does not exist and is now created.
Directory detector_path\model does not exist.


Now we can check whether the 2 test sets are drifting from the reference data:

In [35]:
preds = cd.predict(X_t0)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? No!


In [36]:
print(preds)

{'data': {'is_drift': 0, 'distance': array([ 0.8660095 ,  0.64231235,  0.16      ,  1.5334947 ,  6.7868133 ,
        0.24      , 78.333336  ,  2.127603  ], dtype=float32), 'p_val': array([0.6485574 , 0.42287472, 0.4944631 , 0.6745616 , 0.07901227,
       0.09435459, 0.43624088, 0.34514126], dtype=float32), 'threshold': 0.00625}, 'meta': {'name': 'TabularDrift', 'detector_type': 'offline', 'data_type': None, 'version': '0.9.1'}}


In [37]:
categories_per_feature.keys()

dict_keys([0, 1, 3, 4, 6, 7])

In [38]:
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Pclass -- Chi2 0.866 -- p-value 0.649
Sex -- Chi2 0.642 -- p-value 0.423
Age -- K-S 0.160 -- p-value 0.494
SibSp -- Chi2 1.533 -- p-value 0.675
Parch -- Chi2 6.787 -- p-value 0.079
Fare -- K-S 0.240 -- p-value 0.094
Cabin -- Chi2 78.333 -- p-value 0.436
Embarked -- Chi2 2.128 -- p-value 0.345


None of the feature-level p-values are below the threshold: 

In [39]:
preds['data']['threshold']

0.00625

If you are interested in individual feature-wise drift, this is also possible:

In [40]:
fpreds = cd.predict(X_t0, drift_type='feature')

In [41]:
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Pclass -- Drift? No! -- Chi2 0.866 -- p-value 0.649
Sex -- Drift? No! -- Chi2 0.642 -- p-value 0.423
Age -- Drift? No! -- K-S 0.160 -- p-value 0.494
SibSp -- Drift? No! -- Chi2 1.533 -- p-value 0.675
Parch -- Drift? No! -- Chi2 6.787 -- p-value 0.079
Fare -- Drift? No! -- K-S 0.240 -- p-value 0.094
Cabin -- Drift? No! -- Chi2 78.333 -- p-value 0.436
Embarked -- Drift? No! -- Chi2 2.128 -- p-value 0.345


What about the second test set?

In [42]:
preds = cd.predict(X_t1)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? No!


In [43]:
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Pclass -- Drift? No! -- Chi2 3.085 -- p-value 0.214
Sex -- Drift? No! -- Chi2 0.000 -- p-value 1.000
Age -- Drift? No! -- K-S 0.100 -- p-value 0.943
SibSp -- Drift? No! -- Chi2 2.125 -- p-value 0.547
Parch -- Drift? No! -- Chi2 2.272 -- p-value 0.321
Fare -- Drift? No! -- K-S 0.140 -- p-value 0.660
Cabin -- Drift? No! -- Chi2 88.000 -- p-value 0.333
Embarked -- Drift? No! -- Chi2 2.905 -- p-value 0.234


### Logging Drift to file

Since there is no logging inbuilt in alibi-detect we need to log data ourselves.We are using logguru for logging due to ease of use.

In [44]:
!pip install loguru


Collecting loguru
  Using cached loguru-0.6.0-py3-none-any.whl (58 kB)
Collecting win32-setctime>=1.0.0
  Using cached win32_setctime-1.1.0-py3-none-any.whl (3.6 kB)
Installing collected packages: win32-setctime, loguru
Successfully installed loguru-0.6.0 win32-setctime-1.1.0


You should consider upgrading via the 'C:\Projects\bluealtair\testing\env\Scripts\python.exe -m pip install --upgrade pip' command.


In [45]:
from loguru import logger

In [46]:
logger.add("drift_log.log", rotation="50 MB")

1

In [47]:
logger.info(preds)


2022-05-27 08:30:50.432 | INFO     | __main__:<cell line: 1>:1 - {'data': {'is_drift': 0, 'distance': array([ 3.0852714,  0.       ,  0.1      ,  2.125    ,  2.2716577,
        0.14     , 88.       ,  2.9049695], dtype=float32), 'p_val': array([0.2138168 , 1.        , 0.9426822 , 0.54687166, 0.32115582,
       0.66033244, 0.3328474 , 0.23398817], dtype=float32), 'threshold': 0.00625}, 'meta': {'name': 'TabularDrift', 'detector_type': 'offline', 'data_type': None, 'version': '0.9.1'}}
