# 11-10-2016

In this script, I examined the performance of three classifiers (Logistic Regression, RandomForest, & AdaBoost) and an undersampling method on this highly unbalanced dataset. The undersampling technique, [EasyEnsemble][1] proposed by Liu, Wu, and Zhou (2008), samples a subset of the negative cases to create a balanced dataset. A classifier is then trained on this reduced dataset and generate predictions for the test set. This procedure is repeated multiple times and the test predicitions are aggregated.

Without doing any hyperparameter tuning, I find the undersampling technique provided little to no improvement. I think this is due to the fact that most classifiers are able to do a decent job with any modifications. This result is consistent with the ones reported by Liu et al. (2008). 


  [1]: http://ieeexplore.ieee.org/document/4717268/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
random.seed(2016)

In [None]:
df = pd.read_csv('../input/creditcard.csv')
df.head()

I dropped the purchase time and normalized the purchase amount.

In [None]:
# Normalize the purchase amount
df['normAmount'] = StandardScaler().fit_transform(df['Amount'].as_matrix())
df = df.drop(['Time', 'Amount'], axis = 1) #Drop the time and amount columns

In [None]:
df['Class'].describe() # 0.173% of positive cases

In [None]:
# Save the features and target
features = [col for col in df.columns if col not in ['Class']]
target = 'Class'

In [None]:
n_fold = 5
test_size = 0.1
n_iterations = 40
# Set up a Stratified Kfold splits
sss = StratifiedShuffleSplit(n_fold, test_size = test_size)

This function implements the EasyEnsemble algorithm described by Liu et al. (2008) and return the average predicted probability for the test set.

In [None]:
def EasyEnsemble(train_df, test_df, features, target, clf, n_iterations):
    num_pos = np.sum(train_df[target])
    neg_train_df = train_df[train_df[target] == 0]
    pos_train_df = train_df[train_df[target] == 1]

    test_predictions = np.empty((test_df.shape[0], n_iterations))
    train_predictions = np.empty((test_df.shape[0], n_iterations))
    
    for i in range(n_iterations):
        classifier = clf()
        # Sample the same number of negative cases as positive cases
        neg_sample = neg_train_df.sample(num_pos, random_state = i)
        subset = neg_sample.append(pos_train_df)
        
        # Fit the classifier to the balanced dataset
        classifier.fit(subset[features], np.ravel(subset[target]))
        prediction = classifier.predict_proba(test_df[features])[:,1]
        test_predictions[:, i] = prediction
    
    # Average all the test predictions
    ensemble_predictions = np.mean(test_predictions, axis = 1)
    
    return(ensemble_predictions)

Since there is so little positive cases in the dataset, I find it necessary to do KFold cross validation to get more consistent results. I also used the stratified KFold to make sure the train and test set have approximately same proportion of positive cases.

In [None]:
def KFoldPrediction(df, features, target, kSplit, clf, ensemble = False, n_iterations = 0):
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    
    mean_recall = 0.0
    mean_precision = np.linspace(0, 1, 100)
    
    mean_precision_score = 0.0
    
    # Loop through the CV indexes
    for i, (train_index, test_index) in enumerate(kSplit.split(df[features], df[target])):
        
        classifier = clf()
        train_df, test_df = df.iloc[train_index], df.iloc[test_index]
        
        # For ensemble method, call the EasyEnsemble function
        if not ensemble:
            classifier.fit(train_df[features], np.ravel(train_df[target]))
            proba = classifier.predict_proba(test_df[features])[:, 1]
        else:
            proba = EasyEnsemble(train_df, test_df, features, target, clf, n_iterations)
            
        # Get the FPR, TPR 
        fpr, tpr, thresholds = roc_curve(test_df[target], proba)
        mean_tpr += np.interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        
        # Get the precision and recall
        precision, recall, thresholds = precision_recall_curve(test_df['Class'], proba)
        mean_recall += np.interp(mean_precision, precision, recall)

        mean_precision_score += average_precision_score(test_df[target], proba)
    
    nfold = sss.get_n_splits(df[features], df[target])
    
    # Average the totals for TPR, recall, precision score
    mean_tpr /= nfold
    mean_recall /= nfold
    mean_precision_score /= nfold
    
    return(mean_fpr, mean_tpr, mean_precision, mean_recall, mean_precision_score)

# Graph all the ROC and Precision-Recall Curves

In [None]:
plt.style.use('fivethirtyeight')
classifiers = {'Logistic Regression': LogisticRegression,
               'Random Forest': RandomForestClassifier,
               'AdaBoost': AdaBoostClassifier,
               'Ensemble_Random Forest': RandomForestClassifier,}

# Not the most elegant implementation, need to rewrite this when I have time
for key in classifiers:
    if str.split(key, '_')[0] == 'Ensemble':
            fpr, tpr, precision, recall, precision_score = KFoldPrediction(df, features, target, sss, classifiers[key], True, n_iterations)
    else:
        fpr, tpr, precision, recall, precision_score = KFoldPrediction(df, features, target, sss, classifiers[key])
    
    roc_auc = auc(fpr, tpr)
    
    # Graph ROC Curve
    plt.figure(1)
    plt.plot(fpr, tpr,
             label='%s ROC (area = %0.2f)' % (key, roc_auc))
    plt.legend(loc=4, fontsize = 8)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    
    # Graph Precision-Recall Curve
    plt.figure(2)
    plt.plot(recall, precision,
             label='%s Precision (area = %0.2f)' % (key, precision_score))
    plt.legend(loc=3, fontsize = 8)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])

As you can see above, the undersampling technique (EasyEnsemble) did not result in substantial improvement in classification. In fact, comparing to the original classifier (Random Forest), the undersampling technique seems to result in a degradation of classification performance. It will be interesting to see if this result holds for other classifiers or ensemble method.