### Adversarial Validation

#### How similar are the train and test data?

In this notebook, I'm going to run through a process called adversarial validation.  This will help us understand if our training and testing datasets are similar.  If they are different, we know that it will be very challenging to train a model that can predict outcomes in the testing dataset.  So let's walk through the process using some recent data.  

Part One will test this on a real dataset - in this case the ACLED Data we've been working with.  

In [72]:
import pandas as pd
%matplotlib inline
import os, sys
import itertools
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression as LR
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")

#### Import the data

We will import the data set and drop unnecessary columns.  Shuffling a data set is always good practice before running it through a model.  There could be longitudinal patterns in there that could cause the model not to generalize well with future data.  

In [29]:
df = pd.read_csv('Nigeria_Large.csv')

# drop columns we don't need 
df = df.drop(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'time_precision', 'location', 'fatalities',
              'event_type', 'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2', 
              'interaction', 'region', 'country', 'admin2', 'admin3', 'geo_precision', 'source', 
             'source_scale', 'notes', 'timestamp', 'iso3'], axis=1)


# shuffle dataframe
df = shuffle(df)

# look at the first few rows. 
df.head()

Unnamed: 0,event_date,year,actor1,admin1,latitude,longitude
5058,14-Nov-15,2015,Military Forces of Nigeria (2015-),Borno,11.8464,13.1603
2909,24-Apr-17,2017,Unidentified Communal Militia (Nigeria),Lagos,6.5263,3.3571
8990,23-Feb-13,2013,Rioters (Nigeria),Taraba,7.85,9.7833
7487,25-May-14,2014,Boko Haram - Jamatu Ahli is-Sunnah lid-Dawatai...,Borno,10.6129,12.1946
5898,11-Apr-15,2015,Unidentified Armed Group (Nigeria),Ondo,6.5879,4.7436


In [30]:

new_data = pd.read_csv('Nigeria_12_15_2018_5_15_2019.csv')

# drop columns we don't need 
new_data = new_data.drop(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'time_precision', 'location', 'fatalities',
              'event_type', 'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2', 
              'interaction', 'region', 'country', 'admin2', 'admin3', 'geo_precision', 'source', 
             'source_scale', 'notes', 'timestamp', 'iso3', "sub_event_type"], axis=1)


# shuffle dataframe
new_data = shuffle(new_data)

# look at the first few rows. 
new_data.head()


Unnamed: 0,event_date,year,actor1,admin1,latitude,longitude
133,13 April 2019,2019,Unidentified Armed Group (Nigeria),Benue,7.3667,9.05
614,09 February 2019,2019,Boko Haram - Wilayat Gharb Ifriqiyyah,Adamawa,10.8667,13.5833
488,22 February 2019,2019,Fulani Ethnic Militia (Nigeria),Benue,7.8469,8.8546
334,10 March 2019,2019,Unidentified Armed Group (Nigeria),Imo,5.3321,7.1447
408,04 March 2019,2019,Rioters (Nigeria),Lagos,6.4531,3.3958


In [31]:
def describe_df(df):
    stats_df = df.describe()
    stats_df.append(pd.Series(df.isna().any(), name='nans'))
    return stats_df

print(describe_df(df))
print(describe_df(new_data))

               year      latitude     longitude
count  13261.000000  13261.000000  13261.000000
mean    2012.898349      8.434394      8.142161
std        5.171428      2.543484      3.160536
min     1997.000000      4.286500      2.722200
25%     2012.000000      6.335000      6.000000
50%     2014.000000      7.854500      7.533300
75%     2017.000000     10.799400      9.900000
max     2018.000000     13.711700     14.650000
              year    latitude   longitude
count   938.000000  938.000000  938.000000
mean   2018.906183    8.854114    8.064086
std       0.291729    2.773835    3.045769
min    2018.000000    4.452200    2.750000
25%    2019.000000    6.404175    6.311600
50%    2019.000000    8.916700    7.282150
75%    2019.000000   11.747000    9.011125
max    2019.000000   13.695700   14.472400


In [32]:
train = df
test = new_data
features = train.columns
train['target'] = 0
test['target'] = 1

In [33]:
features

Index(['event_date', 'year', 'actor1', 'admin1', 'latitude', 'longitude'], dtype='object')

In [34]:
train_test = pd.concat([df, new_data], axis = 0)

In [36]:
target = train_test['target'].values

In [37]:
# encode the data using pandas get_dummies
feat = pd.get_dummies(train_test)
# Display the first 5 rows to see how the data has changed. Note how we have wide data now.  
print(feat.iloc[:,5:].head(5))

train_test = feat

features = train_test.columns

       event_date_01 February 2019  event_date_01 January 2019  \
4055                             0                           0   
8972                             0                           0   
10881                            0                           0   
11947                            0                           0   
12890                            0                           0   

       event_date_01 March 2019  event_date_02 April 2019  \
4055                          0                         0   
8972                          0                         0   
10881                         0                         0   
11947                         0                         0   
12890                         0                         0   

       event_date_02 February 2019  event_date_02 January 2019  \
4055                             0                           0   
8972                             0                           0   
10881                            0    

In [38]:
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn import model_selection, preprocessing, metrics

param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.006,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 27,
         "metric": 'auc',
         "verbosity": -1}

folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(train_test))


for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_test.values, target)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(train_test.iloc[trn_idx][features], label=target[trn_idx])
    val_data = lgb.Dataset(train_test.iloc[val_idx][features], label=target[val_idx])

    num_round = 30000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 1400)
    oof[val_idx] = clf.predict(train_test.iloc[val_idx][features], num_iteration=clf.best_iteration)

fold n°0
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 1	valid_1's auc: 1
Early stopping, best iteration is:
[1]	training's auc: 1	valid_1's auc: 1
fold n°1
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 1	valid_1's auc: 1
Early stopping, best iteration is:
[1]	training's auc: 1	valid_1's auc: 1
fold n°2
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 1	valid_1's auc: 1
Early stopping, best iteration is:
[1]	training's auc: 1	valid_1's auc: 1
fold n°3
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 1	valid_1's auc: 1
Early stopping, best iteration is:
[1]	training's auc: 1	valid_1's auc: 1
fold n°4
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 1	valid_1's auc: 1
Early stopping, best iteration is:
[1]	training's auc: 1	valid_1's auc: 1


In [39]:
metrics.roc_auc_score(target, oof)

1.0

In this case we would like to see a score of around .50 which let's us know that the two datasets are identitcal.  Here we get a score of 1.0 which leads us to believe there is a huge difference in the training and testing dataset.  It would be very difficult to train a model to predict anything of value in the test dataset.  

In [40]:
x = train_test.drop([ 'target'], axis = 1)
y = train_test.target


In [41]:
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.metrics import roc_auc_score as AUC

clf = RF(n_estimators = 100, verbose = True, n_jobs = -1)
clf.fit(x, y)

p = clf.predict_proba( x )[:,1]
auc = AUC( y, p )
print(auc)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    5.0s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


1.0


### Part II Adversarial Validation on a Currated Data Set

#### Let's run through the same process on a currated dataset from the ML learning repository.  

In this case we'll use the red wine dataset.   

In [100]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url, sep=';')

df.head()



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [101]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

train = train[train.columns[2:]]
test = test[test.columns[1:]]

features = train.columns
train['target'] = 0
test['target'] = 1

train_test = pd.concat([train, test], axis =0)

target = train_test['target'].values


In [102]:
param = {'num_leaves': 50,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.006,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 27,
         "metric": 'auc',
         "verbosity": -1}

folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(train_test))


for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_test.values, target)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(train_test.iloc[trn_idx][features], label=target[trn_idx])
    val_data = lgb.Dataset(train_test.iloc[val_idx][features], label=target[val_idx])

    num_round = 30000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 1400)
    oof[val_idx] = clf.predict(train_test.iloc[val_idx][features], num_iteration=clf.best_iteration)

fold n°0
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 0.94561	valid_1's auc: 0.480286
Early stopping, best iteration is:
[3]	training's auc: 0.696082	valid_1's auc: 0.550201
fold n°1
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 0.94671	valid_1's auc: 0.501313
[2000]	training's auc: 0.978154	valid_1's auc: 0.506002
[3000]	training's auc: 0.989942	valid_1's auc: 0.503438
Early stopping, best iteration is:
[1842]	training's auc: 0.975501	valid_1's auc: 0.508502
fold n°2
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 0.936589	valid_1's auc: 0.489143
Early stopping, best iteration is:
[200]	training's auc: 0.84325	valid_1's auc: 0.513714
fold n°3
Training until validation scores don't improve for 1400 rounds.
[1000]	training's auc: 0.941294	valid_1's auc: 0.492371
Early stopping, best iteration is:
[2]	training's auc: 0.676546	valid_1's auc: 0.537143
fold n°4
Training 

In [103]:
metrics.roc_auc_score(target, oof)

0.5136679046129788

Given that we're nearly at the 0.5 mark we be certain that a training and testing dataset from the red wine source are nearly identical.  This is somewhat expected since it's from a trusted machine learning repository.  In the wild you're not always going to see identical training and testing datasets. Therefore, it's important to know recognize the differences and adversarial validation can be a really helpful technique for those doing machine learning in the wild.  