# Detecting Dataset Drift 

The basic idea to identify shift: if there is a shift in the dataset, then on 
mixing the train and test file, you should still be able to classify an instance of the mixed dataset as train or test with reasonable accuracy. 

Values close to  0.5  indicate that the classifier is not able to discriminate between the two datasets. This could be interpreted as a situation where no discernable shift has occurred in the data. Values close to 1 indicate that the dataset shift is detectable, and we may need to revisit modeling. 

In [19]:
from google.colab import drive
drive.mount('/content/drive/')

import os
os.chdir('/content/drive/My Drive/Colab Notebooks/dataset-drift/python')  

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [20]:
# Importing the dataset
import pandas as pd 
import numpy as np
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

train_ovolvulus = pd.read_csv('../input/ov_training.csv')

train_ovolvulus.replace([np.inf, -np.inf], np.nan, inplace=True)
train_ovolvulus.dropna(inplace=True)


test_ovolvulus = pd.read_csv('../input/ov_holdout.csv')

test_ovolvulus.replace([np.inf, -np.inf], np.nan, inplace=True)
test_ovolvulus.dropna(inplace=True)


  exec(code_obj, self.user_global_ns, self.user_ns)


In [21]:
feature_columns = [col for col in train_ovolvulus if col.startswith('feat_')]

In [22]:
training = train_ovolvulus[feature_columns]
testing = test_ovolvulus[feature_columns]

In [23]:
training['origin'] = 0
testing['origin'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
## combining random samples
combi = training.append(testing)
y = combi['origin']

In [26]:
combi.drop('origin',axis=1,inplace=True)

In [27]:
## modelling
model = RandomForestClassifier(n_estimators = 50, max_depth = 5,min_samples_leaf = 5)
drop_list = []
for i in combi.columns:
  score = cross_val_score(model,pd.DataFrame(combi[i]),y,cv=2,scoring='roc_auc')
  if (np.mean(score) > 0.6):
    drop_list.append(i)
  print(i,np.mean(score))

feat_seq_entropy 0.5213601767926039
feat_C_atoms 0.4934290417385172
feat_H_atoms 0.497245084084133
feat_N_atoms 0.504810187592698
feat_O_atoms 0.4791842441351505
feat_S_atoms 0.48251156252602534
feat_molecular_weight 0.48452168282710895
feat_Perc_Tiny 0.4925561372996471
feat_Perc_Small 0.4589991764542758
feat_Perc_Aliphatic 0.561918957731016
feat_Perc_Aromatic 0.5243933097351836
feat_Perc_NonPolar 0.5132169040335831
feat_Perc_Polar 0.5124457485485514
feat_Perc_Charged 0.4876408041468788
feat_Perc_Basic 0.5207149266216561
feat_Perc_Acidic 0.4882471764039641
feat_PP1 0.507569143217291
feat_PP2 0.5063654767521806
feat_PP3 0.5035034592475216
feat_KF1 0.5326749962487123
feat_KF2 0.508341930120658
feat_KF3 0.4973599486047243
feat_KF4 0.47162496293682304
feat_KF5 0.5000134726305049
feat_KF6 0.5493783979090379
feat_KF7 0.46588921763634306
feat_KF8 0.4558872681059721
feat_KF9 0.47666028299889907
feat_KF10 0.5023157125811297
feat_Z1 0.46970645142209055
feat_Z2 0.4891255812559514
feat_Z3 0.514788