# Adversarial Validation

**Description**
The objective of `Adversarial Validation` is to learn a model that predicts which rows are in the training dataset, and which are in the test set. Thi classifier model will try to predict which data rows are from the training set, and which are from the test set.

If the two datasets came from the same distribution, this should be impossible. In such situation, if we attempted to train a classifier to distinguish training examples from test examples, it would perform no better than random. This would correspond to ROC AUC of 0.5. 

But if there are systematic differences in the feature values of your training and test datasets, then a classifier will be able to successfully learn to distinguish between them.

The better a model you can learn to distinguish them, the bigger the problem you have.

**Pre-requisites**

- Modules: See [requirements.txt](./requirements.txt) for specific versions I used

**Notes**

- To get the Catboost training GUI, you might need to [install/enable ipywidgets](https://catboost.ai/docs/installation/python-installation-additional-data-visualization-packages.html)

## Sources

In [1]:
# Expand the paragraph frames
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

### 1. Data Preparation

**Loading parameters & function**

In [2]:
import os
import pandas as pd

# Loads the dataset
data_dir = './data' # location of unzipped CSVs
df_train = pd.read_csv( os.path.join(data_dir, 'train.csv') )
df_test = pd.read_csv( os.path.join(data_dir, 'test.csv') )

**Only keeping numerical columns**

In [3]:
numeric_cols = [
    'payRecMkr', 'payMethod', 'accHolderType', 'amount', 'txMonth', 'txDay', 'txWeekDay', 'logAmount', 'txTypeCode', 'payRecMkrCode', 'payMethodCode', 'accHolderTypeCode'
]

df_train = df_train[numeric_cols]
df_test = df_test[numeric_cols]

**Filling missing values**

In [4]:
# Categoricals with "<UNK>"
#df_train.loc[:,cat_cols] = df_train[cat_cols].fillna('<UNK>')
#df_test.loc[:,cat_cols] = df_test[cat_cols].fillna('<UNK>')

# Numeric with -999
df_train = df_train.fillna(-999)
df_test = df_test.fillna(-999)

**Visualise dataset contents**

In [5]:
df_train.head()

Unnamed: 0,payRecMkr,payMethod,accHolderType,amount,txMonth,txDay,txWeekDay,logAmount,txTypeCode,payRecMkrCode,payMethodCode,accHolderTypeCode
0,0,0,1,14.89,11,5,3,1.172895,1,6,11,1
1,1,2,1,21.49,11,5,3,1.332236,1,3,12,1
2,1,2,1,49.25,11,5,3,1.692406,1,3,12,1
3,1,4,1,11.89,11,5,3,1.075182,1,3,8,1
4,1,4,1,5.67,11,5,3,0.753583,1,3,8,1


In [6]:
df_test.head()

Unnamed: 0,payRecMkr,payMethod,accHolderType,amount,txMonth,txDay,txWeekDay,logAmount,txTypeCode,payRecMkrCode,payMethodCode,accHolderTypeCode
0,0,0,1,4.89,10,23,4,0.689309,0,6,11,1
1,1,2,1,14.98,10,23,4,1.175512,0,3,12,1
2,1,3,1,3.89,10,23,4,0.58995,0,3,4,1
3,0,0,1,9.4,10,23,4,0.973128,0,6,11,1
4,0,0,1,4.89,10,23,4,0.689309,0,6,11,1


**Create adversarial label**

Indicates whether sample is from test set or not

In [7]:
df_train['is_test'] = 0
df_test['is_test'] = 1

orig_df_train = df_train.copy()

frames = [orig_df_train, df_test]
df_train = pd.concat(frames)

df_train['target'] = df_train.is_test
df_train.drop( 'is_test', axis = 1, inplace = True )

**Cross-validating logistic regression**

In [14]:
import warnings

from sklearn.model_selection import cross_validate as CV
from sklearn.linear_model import LogisticRegression as LR

warnings.filterwarnings("ignore")

clf = LR()

scores = CV( clf, df_train.drop( 'target', axis = 1 ), df_train.target, scoring = ('accuracy', 'roc_auc'), cv = 5, verbose = 1, return_estimator = True )
sorted(scores.keys())
#print("mean AUC: {:.2%}, std: {:.2%} \n".format( scores.mean(), scores.std()))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   19.5s finished


['estimator', 'fit_time', 'score_time', 'test_accuracy', 'test_roc_auc']

In [13]:
print( scores['test_score'] )
print( scores )

[0.39999961 0.72435821 0.64084227 0.24910181 0.16254692]
{'fit_time': array([2.89634967, 2.94893408, 2.95575595, 2.78712869, 3.29665637]), 'score_time': array([0.0369997 , 0.03200197, 0.02799845, 0.02899885, 0.03600264]), 'estimator': (LogisticRegression(), LogisticRegression(), LogisticRegression(), LogisticRegression(), LogisticRegression()), 'test_score': array([0.39999961, 0.72435821, 0.64084227, 0.24910181, 0.16254692])}


In [9]:
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.metrics import roc_auc_score as AUC
from sklearn.metrics import accuracy_score as accuracy
