### Purpose

Estimate the degree of difference between the training and test data sets in this competition. We'll create a binary classification target where the positive label is assigned to the test data. We'll then create classifier model and evaluate for the ROC-AUC metric. If the ROC-AUC is high, it means the training and test datasets are easily disinguishable, and we shouldn't expect models trained on the training set to easily generalize to the test set because it is from a different distribution.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.ensemble import ExtraTreesClassifier # faster than RandomForest for large datasets
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score

In [2]:
train = pd.read_csv('/kaggle/input/playground-series-s3e4/train.csv')[24315:192943]
test = pd.read_csv('/kaggle/input/playground-series-s3e4/test.csv')
print(train.shape)
print(test.shape)

(168628, 32)
(146087, 31)


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168628 entries, 24315 to 192942
Data columns (total 32 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      168628 non-null  int64  
 1   Time    168628 non-null  float64
 2   V1      168628 non-null  float64
 3   V2      168628 non-null  float64
 4   V3      168628 non-null  float64
 5   V4      168628 non-null  float64
 6   V5      168628 non-null  float64
 7   V6      168628 non-null  float64
 8   V7      168628 non-null  float64
 9   V8      168628 non-null  float64
 10  V9      168628 non-null  float64
 11  V10     168628 non-null  float64
 12  V11     168628 non-null  float64
 13  V12     168628 non-null  float64
 14  V13     168628 non-null  float64
 15  V14     168628 non-null  float64
 16  V15     168628 non-null  float64
 17  V16     168628 non-null  float64
 18  V17     168628 non-null  float64
 19  V18     168628 non-null  float64
 20  V19     168628 non-null  float64
 21  V20   

All features are numeric, so we don't need to do any categorical encoding. There are no missing values either.

In [4]:
# drop id, Time, and target column
train = train.drop(['id', 'Time', 'Class'], axis=1)
test = test.drop(['id', 'Time'], axis=1)

In [5]:
# Prepare target labels. The test set is assigned the positive label.
X = train.append(test)
y = [0] * len(train) + [1] * len(test)

In [6]:
# create a model and make predictions
model = ExtraTreesClassifier()
cv_preds = cross_val_predict(model, X, y, cv=5, n_jobs=-1, method='predict_proba')

In [7]:
# use ROC-AUC to see if the classifier can spot the difference between train and test
roc_auc_score(y, cv_preds[:, 1])

0.9939134087387652

Since we have a high ROC-AUC, see which features were most important in distinguishing between train and test.

In [8]:
model.fit(X, y)
ranks = sorted(list(zip(X.columns, model.feature_importances_)),
              key=lambda x: x[1], reverse=True)
for feature, score in ranks:
    print(f"{feature:10} : {score:0.4f}")

V3         : 0.1814
V25        : 0.0833
V1         : 0.0745
V15        : 0.0581
V4         : 0.0519
V22        : 0.0494
V5         : 0.0409
V11        : 0.0382
V9         : 0.0317
V24        : 0.0300
V6         : 0.0294
V26        : 0.0288
V7         : 0.0239
V18        : 0.0236
V14        : 0.0228
V28        : 0.0227
V10        : 0.0225
V17        : 0.0219
V23        : 0.0206
V16        : 0.0201
V2         : 0.0179
V19        : 0.0177
V8         : 0.0161
V21        : 0.0157
V12        : 0.0126
V20        : 0.0122
V27        : 0.0111
Amount     : 0.0111
V13        : 0.0099
