There will be a lot decisions made regarding features based on the analysis performed earlier. It can accessed in analysis.ipynb file.

In [1]:
import pandas as pd
import numpy as np
import datetime 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

from sklearn.externals import joblib
import pickle

In [2]:
creditCardData = pd.read_csv("Data/creditcard.csv")

In [3]:
# Feature engineering

# As the time provided is in seconds we can use it as seconds since epoch as we won't care about years
def convert_totime(seconds):
    return datetime.datetime.fromtimestamp(seconds);

creditCardData['datetime'] = creditCardData.Time.apply(convert_totime)
creditCardData['hour of the day'] = creditCardData.datetime + pd.Timedelta("7 hour")
creditCardData['hour of the day'] = creditCardData['hour of the day'].dt.hour
creditCardData['isNight'] = (creditCardData['hour of the day'] >= 1) & (creditCardData['hour of the day'] <= 7)

#dropping insignificant rows
creditCardData = creditCardData.drop(['V22','V23','V25','V26','V13','V15', 'datetime'], axis = 1)

In [4]:
creditCardData['isNight'] = creditCardData['isNight'].map({False:0, True:1})
# Creating Test and train Datasets. I am also fixing the random_state so that the sets remain same over multiple
# executions so that results remain the same whenever I run it.
train, test  = train_test_split(creditCardData, test_size=0.33, random_state=42)
train_y = train.Class
train_x = train.drop('Class', axis = 1)

test_y = test.Class
test_x = test.drop('Class', axis = 1)

In [7]:
model = RandomForestClassifier(n_estimators = 500, max_depth = 8, max_features = None)
model.fit(train_x, train_y)

preds = model.predict(test_x)
preds_prob = model.predict_proba(test_x)

print 'Accuracy:', accuracy_score(test_y, preds)
print 'AUC_ROC:', roc_auc_score(test_y, [x[1] for x in preds_prob])

Accuracy: 0.999627608073
AUC_ROC: 0.975602784522


The AUC_ROC looks good. We are correctly classifying a lot of positive classes(The fraud Cases). I think we can do better as till now we have not done anything to address the imbalance problem. 

I will be trying out 2 different ways to tackle it
1. Use 'class_weights' parameter so that Random Forest will apply a higher weight to the positive class
2. Use Synthetic Minority Over-sampling to oversample the positive class. I did not wanted to try undersampling as then the overall dataset size would be pretty small

#### Using 'class_weight' parameter

In [6]:
model = RandomForestClassifier(n_estimators = 500, max_depth = 8, max_features = None, class_weight = 'balanced')
model.fit(train_x, train_y)

preds = model.predict(test_x)
preds_prob = model.predict_proba(test_x)

print 'Accuracy:', accuracy_score(test_y, preds)
print 'AUC_ROC:', roc_auc_score(test_y, [x[1] for x in preds_prob])

Accuracy: 0.999074340068
AUC_ROC: 0.976120097595


#### Using Synthetic Minority Over-sampling

In [8]:
from imblearn.over_sampling import SMOTE

oversampler=SMOTE(random_state=0)
os_train_x,os_train_y=oversampler.fit_sample(train_x,train_y)

model = RandomForestClassifier(n_estimators = 500, max_depth = 8, max_features = None)
model.fit(train_x, train_y)

preds = model.predict(test_x)
preds_prob = model.predict_proba(test_x)

print 'Accuracy:', accuracy_score(test_y, preds)
print 'AUC_ROC:', roc_auc_score(test_y, [x[1] for x in preds_prob])

Accuracy: 0.999627608073
AUC_ROC: 0.975463890289


The 'class_weight' parameter performed better than over-sampling. We can still achieve a better accuracy by fine tuning the hyperparameters, but that is for next time. Cheers!!