This notebook will be used to prototype various ideas to be used in the Kaggle IEEE CIS Fraud-Detection competition. It is unlikely to be very well organised or well annotated as I will play around with ideas as I get sparks of inspiration.

# Leaderboard score proxy

The first thing we want to do is to have an internal validation protocol that serves as a good proxy for the leaderboard score. There's no point in training models and tuning them if the metric we use isn't consistent with the dyanmics of the leaderboard score.

## Simple Train-Test split

This is the first idea we will be investigating. We will use 80% of the data for training and cross-validating to choose hyperparameters and the remaining 20% to test our model fit. We wil use xgboost as our prototyping model.

sklearn has the train_test_split function to help us. An important question to ask is whether or not we should be shuffling the data. This is strictly speaking time-series, so shuffling data may not be appropriate, but this is transactional data so any time dependencies are unlikely to be very strong.

In [3]:
import pandas as pd
import numpy as np
import xgboost as xgb
import time

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [4]:
train = pd.read_csv('Data/train_transaction.csv')

We'll also need to do some data pre-processing as well. Nothing fancy, transformation of categorical variables into dummies and random forward filling of NAs.

In [5]:
#Creating function to deal with NAs by shuffling and forward filling.

def ffill(df):
    
    t0 = time.time()
    
    na_count = df.isna().sum().sum()
    while na_count>0:
        df = df.sample(frac=1)
        df = df.fillna(method='ffill',limit=10)
        na_count = df.isna().sum().sum()

        df = df.sort_index()

    t1 = time.time()

    print(t1-t0)

In [37]:
#Only using the first 10000 rows otherwise my 11 inch 2014 MacBook Air won't be able to handle it :/
train_sub = train.iloc[:12000,:]
fraud = train_sub['isFraud']
train_sub = train_sub.drop('isFraud', axis=1)

#Numerics
numerics = train_sub.select_dtypes(exclude='object')
numerics = ffill(numerics)

#Converting 
categorical = train_sub.select_dtypes(include='object')
dummies = pd.get_dummies(categorical)

X = pd.concat([numerics, dummies], axis=1)

0.7184858322143555


In [38]:
X_train, X_test, y_train, y_test = train_test_split(X.iloc[:10000,:],fraud.iloc[:10000],test_size=0.2)

In [8]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [39]:
#Setting up model evaluation
model = xgb.XGBClassifier(
    max_depth = 300,
    learning_rate = 0.1,
    objective = 'binary:logistic',    
)

model.fit(X_train, y_train, eval_metric = 'auc')

train_preds = model.predict_proba(X_train)
test_preds = model.predict_proba(X_test)

train_score = roc_auc_score(y_train, train_preds[:,1])
test_score = roc_auc_score(y_test, test_preds[:,1])

print(train_score,test_score)

0.9496994633833445 0.8498410652920962


Ok, so using a train test split is not too bad, though as expected, the training score is slightly lower than the test score. sklearn's train_test_split shuffles the data for us, I'm not sure if that is the right way to go, I'll check with an unshuffled 'final' test set.

In [40]:
final_test_preds = model.predict_proba(X.iloc[10000:12000,:])
final_test_score = roc_auc_score(fraud.iloc[10000:12000], final_test_preds[:,1])
print(train_score, test_score, final_test_score)

0.9496994633833445 0.8498410652920962 0.7357435334065714


Well, that's certainly interesting. The test score on the unshuffled holdout set is even worse! This may mean that shuffling our data biases our in sample metrics upwards.

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X.iloc[:10000,:],fraud.iloc[:10000],test_size=0.2, shuffle=False)

model.fit(X_train, y_train, eval_metric = 'auc')

train_preds = model.predict_proba(X_train)
test_preds = model.predict_proba(X_test)
final_test_preds = model.predict_proba(X.iloc[10000:12000,:])

train_score = roc_auc_score(y_train, train_preds[:,1])
test_score = roc_auc_score(y_test, test_preds[:,1])
final_test_score = roc_auc_score(fraud.iloc[10000:12000], final_test_preds[:,1])

print(train_score, test_score, final_test_score)

0.9522543059777102 0.8063564101005598 0.7520473384600019


Nope, doesn't seem to be the case here at all. Shuffling doesn't seem to have any effect on the variation in the scores across the train, test and final test set. Rather it is the distance in time between the training set and the test set which causes the loss in generality. Let's see if we can test this. 

In [19]:
train_sub = train.iloc[:100000,:]
fraud = train_sub['isFraud']
train_sub = train_sub.drop('isFraud', axis=1)

#Numerics
numerics = train_sub.select_dtypes(exclude='object')
numerics = ffill(numerics)

#Converting 
categorical = train_sub.select_dtypes(include='object')
dummies = pd.get_dummies(categorical)

X = pd.concat([numerics, dummies], axis=1)

8.767240047454834


In [35]:
X_train = X.iloc[:10000,:]
y_train = fraud.iloc[:10000]

model.fit(X_train, y_train, eval_metric = 'auc')

scores = []
for index in [1,2,3,4,5,6,7,8,9]:
    start = (index - 1)*1000
    end = index*1000
    preds = model.predict_proba(X.iloc[start:end,:])
    test = fraud.iloc[start:end]
    score = roc_auc_score(test,preds[:,1])
    scores.append(score)

In [36]:
print(scores, np.mean(scores))

[0.9639255499153977, 0.9534038856420786, 0.8998076487893187, 0.9695824795081966, 0.9179829410835572, 0.9711627515365833, 0.95447006791976, 0.9578252032520326, 0.9702886423273208] 0.9509387966638051


Looks like it was a false alarm, there is going to be inherent variation in the test scores because of the randomness of the data. However, when you average it out, it's pretty close to the training scores we're getting. 

In [34]:
    preds = model.predict_proba(X.iloc[start:end,:])
    print(preds[:,1])

[0.9946369  0.9971597  0.9963649  0.98149884 0.9874941  0.9829064
 0.9992461  0.94470257 0.9970029  0.9927669  0.9967974  0.99951786
 0.99021554 0.9981632  0.99385613 0.99472964 0.9983237  0.9994455
 0.9995038  0.9698555  0.9986267  0.697731   0.9989849  0.9960668
 0.98827666 0.99734145 0.99609506 0.9959356  0.9985062  0.2418536
 0.99712455 0.9990179  0.9993223  0.99455065 0.99712455 0.99745774
 0.9820326  0.9959274  0.99609506 0.95665723 0.6432842  0.99609506
 0.9162802  0.9889068  0.99918956 0.9974291  0.98828125 0.99649346
 0.99574125 0.9994194  0.9959356  0.94470257 0.96324563 0.99927735
 0.9970825  0.98827666 0.72831094 0.9805087  0.9985814  0.99948657
 0.9986864  0.99574125 0.9988583  0.8720083  0.99367344 0.990801
 0.9920441  0.9980923  0.99591863 0.9966662  0.9927669  0.9978156
 0.9944435  0.9874742  0.92351556 0.3865987  0.9993223  0.9792002
 0.98067313 0.3865987  0.9985352  0.99786395 0.82444745 0.9896744
 0.99803275 0.96324563 0.9920441  0.99803275 0.9871452  0.3865987
 0.38

## k-fold cross-validation

The next validation protocol we'll consider using is k-fold cross-validation. 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 