***Under development***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, precision_score, recall_score
from sklearn.model_selection import train_test_split, KFold, cross_val_score
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

<h2> Introduction </h2>
The given data set contains information from ~250,000 credit card transactions that took place over a certain time period in Europe. Each of the transactions is labeled 0 for a legitimate transaction and 1 for a fraudulent transaction. The data set is skewed, containing many more legitimate transactions compared to fraudulent transactions.

Other features given include the timestamp (since the 1st transaction in the data set), amount of the purchase (in USD), as well as the first 28 principle components generated from other data collected for each transaction. For privacy purposes, no more information is given about the data set.

<h2> Loading and Visualizing the Data </h2> 

In [None]:
df = pd.read_csv('../input/creditcard.csv')
df.head()

Our goal is to be able to predict whether a transaction is fraudulent  given this data. The timestamp, though not unique, serves essentially as a unique ID for each transaction, and is ultimately unimportant in determining if a transaction is legit or not. So we'll go ahead and remove that, as well as split the data into the fraudulent and legitimate transactions.

In [None]:
df = df.drop('Time', axis=1)

In [None]:
fraud = df[df['Class'] == 1]
legit = df[df['Class'] == 0]

Let's also take a look at how many transactions are actually fraudulent

In [None]:
len(legit)/(len(fraud) + len(legit))

So, around 0.17% are actually fraudulent. We keep this in mind when choosing a performance evaluation. Naively choosing classification accuracy would not necessarily accurately describe how well our learning algorithm is doing. For example, by blindly classifying every transaction as legitimate, we would be right over 99.8% of the time. More on this later.

In [None]:
fraud_ax = pd.DataFrame.hist(fraud, column='Amount')
plt.xlabel('Purchase Amount (USD)')
plt.ylabel('Occurances')
plt.title('Fraudulent Transactions')

legit_ax = pd.DataFrame.hist(legit, column='Amount')
plt.xlabel('Purchase Amount (USD)')
plt.ylabel('Occurances')
plt.title('Legitimate Transactions')

Just from the autoscaling, Most of the legitimate purchases are under $5000, but it looks like there's a relatively small amount of legitimate purchases that reach as high as $25,000. The fraudulent purchases interestingly are all rather small. Most fall under $500, but there are some that reach the neighborhood of $2000.

Let's take a look at exactly how many outliers we're dealing with in each case:

In [None]:
sns.violinplot(x='Class', y='Amount', data=df, inner='quartile')

Since we have a lot of outliers, and there's no <i>a priori</i> reason to ignore them, we want to be careful with choosing a learning algorithm. Learning algorithms using regression can be significantly affected by these outliers. 

Here, I'll choose to use an ensemble method, namely a Random Forest Classifier. A RFC is not sensitive to these outliers since it's based on splitting data into two groups at each node.

In [None]:
X = df
y = X.pop('Class')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:

RANDOM_STATE = 123
clf = RandomForestClassifier(n_estimators=10, warm_start=True, max_features='auto',
                             random_state=RANDOM_STATE)
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
fpr, tpr, _ = roc_curve(y_test, clf.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr,tpr, 'b-', [0,1], [0,1], 'k--')

prec = precision_score(y_test, prediction)
recall = recall_score(y_test, prediction)
print("Number of actual fraud cases in test set: {0}".format(sum(y_test)))
print("Number of predicted fraud cases: {0}".format(sum(prediction)))
print("AUC score: {0}".format(roc_auc))
print("F1 score: {0}".format(2*prec*recall/(prec + recall)))

Ok, so we've got an AUC of around 0.924 with an F1 score of 0.835 using 10 trees and sqrt(29) features. Not bad, but the parameter choices for this were rather arbitrary. We can try tuning some parameters to get a better result

<h2> Parameter Tuning </h2>

In [None]:
f1_scores = [2*prec*recall/(prec + recall)]
auc_scores = [roc_auc]

for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    clf.n_estimators += 10
    clf.fit(X_train, y_train)
    prediction = clf.predict(X_test)
    fpr, tpr, _ = roc_curve(y_test, clf.predict_proba(X_test)[:,1])
    roc_auc = auc(fpr, tpr)
    prec = precision_score(y_test, prediction)
    recall = recall_score(y_test, prediction)
    F1 = 2*prec*recall/(prec+recall)
    f1_scores.append(F1)
    auc_scores.append(roc_auc)

In [None]:
plt.plot(auc_scores, 'bo-')
plt.plot(f1_scores, 'ro-')