## Credit Card Fraud Detection
  
#### Author : Rahul Choudhry
  
#### Description:  
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

In [None]:
import pandas as pd
import numpy as np 
#import tensorflow as tf
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.gridspec as gridspec
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
%matplotlib inline

In [None]:
df = pd.read_csv("../input/creditcard.csv")

In [None]:
print(df.shape)
print(df.describe())
print(df.isnull().sum())
print (df.info())

In [None]:
df['Class'].value_counts()

## Time:
Time variable is the time elapsed in seconds for each transaction from the first transaction in the dataset. For now, keeping  the field until we are sure if it has/has not value. Plotting a histogram of Time in Fraudulent and Normal transactions. We see there are a couple of peaks in the fraudulent transactions. At the time of the first peak in Fraudulent transactions (elapsed time = 40K seconds), there is also a large number of normal transactions. The second peak occurs at about 90K seconds since the start of first transaction. During this time, the normal transactions are very low. 

We can also see that the the normal transactions show a trend. The first uptrend began at about 25K seconds and then started to decline at about 75K seconds. The delta between the two ~ 50K seconds is approximately 14 hours. This sounds intuitively correct and could be the transactions happening during the day hours. The difference between the two bottoms on the Normal transactions is ~ 82K seconds ~ 1 day.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 50

ax1.hist(df.Time[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

bins = 100
ax2.hist(df.Time[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time - in Seconds')
plt.ylabel('Number of Transactions')

## Amount:

Next we look at the summary stats of the transaction amount field for fraudulent and normal transactions.The IQR of Fraudulent is between $1 to $105 and the median is $9. The mean is $122 and its large difference from the median is due to the outliers on the right side of the distribution. The fraudulent transactions also have a large standard deviation of $256.

For normal transactions, the IQR range is between $5 to $77. The difference between Mean ($88) and Median ($22) is $66 and is tighter than the $104 difference for fraudulent transactions.

The histograms below show the distributions of both the transaction types.

In [None]:
print ("Fraud")
print (df.Amount[df.Class == 1].describe())
print ()
print ("Normal")
print (df.Amount[df.Class == 0].describe())

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 30

ax1.hist(df.Amount[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

bins = 100

ax2.hist(df.Amount[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')

The scatterplots between the time elapsed and transaction amount have been grouped by the transaction type. We do see the some extreme outliers in the Fraud transactions happening during the periods of low volumes for normal transactions.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,6))

ax1.scatter(df.Time[df.Class == 1], df.Amount[df.Class == 1])
ax1.set_title('Fraud')

ax2.scatter(df.Time[df.Class == 0], df.Amount[df.Class == 0])
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')

## PCA Transformed features: 

As mentioned in the descriptions above, this dataset has 29 numerical features that we obtained as a result of PCA. We do not have any business context of what those fields imply. In the series of plots shown below, we plot the histograms overlaid with density for each of these variables. The plots are color coded by the type of transaction. **Green ~ Normal,**  **Blue ~ Fraud.** We will visually inspect these distributions and use that information to only keep the variables where we see a clear distinction.

In [None]:
#Select only the anonymized features.
v_features = df.ix[:,1:29].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(df[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(df[cn][df.Class == 1], bins=50)
    sns.distplot(df[cn][df.Class == 0], bins=100)
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))

Dropping  some variables as they have very similar distributions for both types of transactions.

In [None]:
df = df.drop(['V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8'], axis =1)

Performing scaling on Amount and Time field as a necessary data transformation step before modeling. 

In [None]:
from sklearn.preprocessing import StandardScaler

df['normAmount'] = StandardScaler().fit_transform(df['Amount'].reshape(-1, 1))
df = df.drop(['Amount'],axis=1)
df['normTime'] = StandardScaler().fit_transform(df['Time'].reshape(-1, 1))
df = df.drop(['Time'],axis=1)
df.head(2)

In [None]:
X = df.ix[:, df.columns != 'Class']
y = df.ix[:, df.columns == 'Class']

Undersampling the normal transactions so that the number of normal transactions is 3 times the fraudulent transactions. This is to overcome the extreme imbalance between the two classes as described above.

In [None]:
number_records_fraud = len(df[df.Class == 1])
fraud_indices = np.array(df[df.Class == 1].index)


normal_indices = df[df.Class == 0].index


random_normal_indices = np.random.choice(normal_indices, number_records_fraud*3, replace = False)
random_normal_indices = np.array(random_normal_indices)

#Concatenating the indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Create the undersampled dataset
under_sample_data = df.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Printing info
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])*1.0/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])*1.0/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

## Modeling

Split the data into train - test in the ratio 75:25. The 25% test is our holdout sample that we do not use for training or CV.

In [None]:
from sklearn.cross_validation import train_test_split

# Entire dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.25
                                                                                                   ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

Defining some helper functions for calculating different accuracy metrics that we will use for evaluating model performance

In [None]:
def ROC_curve_data(y_true, y_score):
    y_true  = np.asarray(y_true,  dtype=np.bool_)
    y_score = np.asarray(y_score, dtype=np.float_)
    assert(y_score.size == y_true.size)

    order = np.argsort(y_score) # Just ordering stuffs
    y_true  = y_true[order]
    # The thresholds to consider are just the values of score, and 0 (accept everything)
    thresholds = np.insert(y_score[order],0,0)
    TP = [sum(y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    FP = [sum(~y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    TN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)
    FN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)

    for i in range(1, thresholds.size) : # "-1" because the last threshold
        # At this step, we stop predicting y_score[i-1] as True, but as False.... what y_true value say about it ?
        # if y_true was True, that step was a mistake !
        TP.append(TP[-1] - int(y_true[i-1]))
        FN.append(FN[-1] + int(y_true[i-1]))
        # if y_true was False, that step was good !
        FP.append(FP[-1] - int(~y_true[i-1]))
        TN.append(TN[-1] + int(~y_true[i-1]))

    TP = np.asarray(TP, dtype=np.int_)
    FP = np.asarray(FP, dtype=np.int_)
    TN = np.asarray(TN, dtype=np.int_)
    FN = np.asarray(FN, dtype=np.int_)

    accuracy    = (TP + TN) / (TP + FP + TN + FN)
    sensitivity = TP / (TP + FN)
    specificity = TN / (FP + TN)
    return((thresholds, TP, FP, TN, FN))

We are now ready to start the modeling. In the first cut, we will be using Logistic regression model and train it on the X variables in the  75% of records from the dataset we created after downsampling the majority class (Normal transactions) and combining with the fraudulent transactions.

Since the number of records in this 75% training sample is not that large(1476 records), I will be performimg 5 fold CV to get the optimal value of C parameter. The metric of interest is Area Under Precision Recall curve. 

The function below called **printing\_Kfold\_scores** is used for performing Cross Validation ad then choosing the best value of C parameter.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

In [None]:
def perform_Kfold_CV(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # Different C parameters
    c_param_range = [0.001,0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),3), columns = ['C_parameter','Mean recall score','Mean_F1'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        auprc_accs = []
        F1_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # Call the logistic regression model with a certain C parameter. Using L1 penalty - Lasso
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            
            # Compute confusion matrix
            cnf_matrix = confusion_matrix(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            np.set_printoptions(precision=2)
            recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
            precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
            F1 = 2*recall*precision/(precision+recall)
            
            #prg_curve = prg.create_prg_curve(y_train_data.iloc[indices[1],:].values, y_pred_undersample)
            #auprc_acc = prg.calc_auprg(prg_curve)
            #auprc_accs.append(auprc_acc)
            
            recall_accs.append(recall_acc)
            F1_accs.append(F1)
            
            print('Iteration ', iteration,': recall score = ', recall_acc)
            print('Iteration ', iteration,': F1 score = ', F1)
            #print('Iteration ', iteration,': AUPRC score = ', auprc_acc)

        # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        results_table.ix[j,'Mean_F1'] = np.mean(F1_accs)
        #results_table.ix[j,'Mean_AUPRC'] = np.mean(auprc_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
        print('Mean F1 score ', np.mean(F1_accs))
        #print('Mean AUPRC score ', np.mean(auprc_accs))
        print('')

    print('Best C param for recall score ', results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'])
    print('')
    print('Best C param for F1 score ', results_table.loc[results_table['Mean_F1'].idxmax()]['C_parameter'])
    print('')
    #best_c = results_table.loc[results_table['Mean_F1'].idxmax()]['C_parameter']
    #print('Best C param for AUPRC ', best_c)
    #print('')
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    #print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return results_table

In [None]:
cv_results = perform_Kfold_CV(X_train_undersample,y_train_undersample)

In [None]:
best_c_F1 = cv_results.loc[cv_results['Mean_F1'].idxmax()]['C_parameter']
best_c_recall = cv_results.loc[cv_results['Mean recall score'].idxmax()]['C_parameter']
print('Best CParameter for optimal F1 score ' , best_c_F1)
print('Best CParameter for optimal recall score ' , best_c_recall)

We see that the best value of C parameter  is 0.001

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Use this C_parameter for best F1 score to build the final model with the undersampled training dataset  and predict the classes in the undersampled test
# dataset
lr = LogisticRegression(C = best_c_F1, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

Viola !! So our first model is correctly classifying 114 of the 130 fraud transactions correctly, and 367 of the 372 normal transactions correctly. Combining Accuracy and Recall, we calculate another measure called the F1 score which seems pretty good so far.

Next we try to make predictions using the same model on the overall test set where the normal transactions are much higher than the fraud transactions. It would be interesting to see how we perform now.

In [None]:
# Use this C_parameter to build the final model with the whole training dataset and predict the classes in the test
# dataset
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)

#prg_curve = prg.create_prg_curve(y_test.values, y_pred)
#auprc_acc = prg.calc_auprg(prg_curve)
#print("AUPRC metric in the testing dataset: ", auprc_acc)
#prg.plot_prg(prg_curve)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the testing dataset: ", f1)

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

Our recall has slightly improved slightly, however the precision has gone down from 96% to 11%. The F1 score which combines Precision and Recall has also reduced.

## Random Forest

In [None]:
%timeit
from sklearn.ensemble import RandomForestClassifier
from collections import OrderedDict
%pylab inline

After loading the required dependencies, we build our first Forest model. We set n_jobs = 3 to leverage parallelism due to multi cores and also set the number of estimators as a fixed value = 501.
 
Next we do a prediction on the 25% Test set after downsampling and then calculate the performance metrics as calculated above for logistic regression.

In [None]:
model = RandomForestClassifier(n_estimators = 501, oob_score = True,n_jobs = 3, random_state =1)
model.fit(X_train_undersample,y_train_undersample.values.ravel())

In [None]:
y_pred_score_rf_usample = model.predict(X_test_undersample)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_score_rf_usample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

On the down sampled test data, we see that  baseline Random Forest is slightly bad in terms of Recall and F1 score but better in terms of Precision  when compared to Logistic regression on the same test data. We can now try different combinations of max_features values and vary the number of Trees(estimators) and inspect the OOB error.

In [None]:
RANDOM_STATE = 123
ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True,
                               random_state=RANDOM_STATE))
]

error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 51
max_estimators = 301

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X_train_undersample,y_train_undersample.values.ravel())

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
pylab.rcParams['figure.figsize'] = (14, 8)
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")

As seen from the plot above, the red line corresponding to max_features = None is having a lower OOB error and the error almost minimizes at 225 trees. We can also perform a randomized search and include some other hyper-parameters that control tree depth while fixing the number of trees and the number of features at every split from above.

In [None]:
# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

## Setting the hyper-prameter choices and setting the grid.

In [None]:
from scipy.stats import randint as sp_randint
param_dist = {"max_depth": [3, None],
              "max_features": [None],
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "criterion": ["gini"]}

from sklearn.model_selection import RandomizedSearchCV
clf = RandomForestClassifier(n_estimators=201,oob_score=True)
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

from time import time
start = time()
random_search.fit(X_train_undersample,y_train_undersample.values.ravel() )
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

Choosing the best hyper-parameter choices from model tuning and building the final random forest moel.

In [None]:
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 3, random_state =1, min_samples_split = 8, min_samples_leaf = 1, max_features =None)
model_final_rf.fit(X_train_undersample,y_train_undersample.values.ravel())

In [None]:
importances = model_final_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_final_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])

## Calculate the different performance evaluation metrics.

In [None]:
# Compute confusion matrix
y_pred_score_rf_usample = model_final_rf.predict(X_test_undersample)
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_score_rf_usample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

In [None]:
# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

 We see that the recall is almost 0.93 which means that we will be correctly classifying 93 out of 100 fraudulent transactions. This is what is most important. However, the precision is 0.078 which means that out of every 100 predictions that are classified, roughly 8 are fraud and the rest 92 are not fraud implying that we have a high false positive rate. Let's work on finding a compromise between precision and recall.

Let us now try up-sampling by creating synthetic data using SMOTE for the Fraudulent transactions and create a 50:50 Fraud  / Normal transactions training dataset and then test against the the test set consisting of skewed transactions.

In [None]:
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=999)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train,y_train.values.ravel())
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=["Class"])
print("Total number of records in oversampled data is ",len(os_data_X))
print("Number of normal transcation in oversampled data",len(os_data_y[os_data_y["Class"]==0]))
print("No.of fraud transcation",len(os_data_y[os_data_y["Class"]==1]))
print("Proportion of Normal data in oversampled data is ",len(os_data_y[os_data_y["Class"]==0])/len(os_data_X))
print("Proportion of fraud data in oversampled data is ",len(os_data_y[os_data_y["Class"]==1])/len(os_data_X))

In [None]:
from time import time
from sklearn.ensemble import RandomForestClassifier
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 8, random_state =1, max_features = 'sqrt')
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))

# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

We see that the recall has gone down from 0.93  when we were downsampling the normal transactions, 
 to 0.84 when doing up-sampling. However, by doing so precision has improved drastically from 0.078 to 0.88. 

We can also try other techniques such as varying the thresholds to find the sweet spot on the Precision Recall curve, changing the cost function to penalize based on the type of error made. 

Please leave suggestions/comments if you liked the analysis.