# Detecting Credit Card Fraud with Machine Learning

Let's see if we can predict whether or not a given transacation is fraudulent using the given dataset.

### Loading and Observing the Data

In [2]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, confusion_matrix
import matplotlib.pyplot as plt

np.random.seed(0)

print(os.listdir("../input"))

Let's first load the data into a pandas dataframe and take a peek at its contents.

In [3]:
data = pd.read_csv("../input/creditcard.csv")

In [4]:
data.head()

In [5]:
data.describe()

There is apparently a huge class imbalance in this dataset. We can confirm this by plotting a seaborn countplot of the different class labels.

In [6]:
sns.countplot(data["Class"])

As we can see from the above plot, fraudulent activities make up a very small fraction of this dataset. Let's further check to see if there are any null values in the data.

In [7]:
data.isnull().any().describe()

As we can see, the dataset is not missing any values.

### Preparing a Train and Test Set

Let's partition our dataset into a train and test set, where 90% of the data will be a part of the training set, 5% will be allocated for validation, and 5% will be used for testing.

In [8]:
limit = int(0.9*len(data))
train = data.loc[:limit]
dev_test = data.loc[limit:]
dev_test.reset_index(drop=True, inplace=True)
dev_test_limit = int(0.5*len(dev_test))
dev = dev_test.loc[:dev_test_limit]
test = dev_test.loc[dev_test_limit:]

Let's check to see that the validation and test set include a fair amount of fraudulent activites before going any further.

In [9]:
print("Number of fraudulent transactions in the dev set: {}".format(dev["Class"].value_counts()[1]))
print("Number of fraudulent transactions in the test set: {}".format(test["Class"].value_counts()[1]))

Now we can focus on developing a model to accurately detect fraudulent activity. Due to the huge class imbalance in our dataset, a model that simply identifies all transactions as not being fraudulent would score high accuracy. There also would not be many fraudulent samples for the model to learn from to be able to accurately identify what a fraulent transaction is. Therefore, we should find a way to balance out the number of positive and negatives instances in our training set. This can be done by either oversampling the positive instances, or negatively sampling the negative instances. Negatively sampling the negative instances would involving reducing the number of not-fraudulent transactions until the ratio between positive and negative instances was approximately 1-to-1. Since we don't have that many data samples, I fear doing so would severely limit our model's performance, since it would have much less data to train on. We shall therefore oversample the positive instances in our dataset.

To oversample the positive instances in our training set, we will add copies of them to it, but with their feature values slightly tweaked. This manipulation of the data is to allow for there to be more positive instances in the dataset, with this manipulation only being slight so has to not change the data too much as to end up teaching our model false information. This tweaking will be done by multiplying each positive sample copy's feature values by a number between the uniform distribution of 0.9 and 1.1.

In [10]:
train_positive = train[train["Class"] == 1]
train_positive = pd.concat([train_positive] * int(len(train) / len(train_positive)), ignore_index=True)
noise = np.random.uniform(0.9, 1.1, train_positive.shape)
train_positive = train_positive.multiply(noise)
train_positive["Class"] = 1
train_extended = train.append(train_positive, ignore_index=True)
train_shuffled = train_extended.sample(frac=1, random_state=0).reset_index(drop=True)

The ratio of positive to negative instances in our training set should now be much more balanced.

In [11]:
sns.countplot(train_shuffled["Class"])

With the class imbalance in our training set dealt with we can now separate our training, validation, and test sets into their respective features and responses.

In [12]:
X_train = train_shuffled.drop(labels=["Class"], axis=1)
Y_train = train_shuffled["Class"]
X_dev = dev.drop(labels=["Class"], axis=1)
Y_dev = dev["Class"]
X_test = test.drop(labels=["Class"], axis=1)
Y_test = test["Class"]

### Training and Validation a Fraudulent Activity Detector

Let's begin by training a simple logistic regression model on our training set and see how well that does on our validation set.

In [13]:
lr_model = LogisticRegression(random_state=0).fit(X_train, Y_train)
print("Train Accuracy:", lr_model.score(X_train, Y_train))
print("Dev Accuracy:", lr_model.score(X_dev, Y_dev))

Though achieving over 98% accuracy on our validation data is thrilling, we cannot forget about the huge class imbalance still present in the validation set. A model that simply outputted that there were no transactions would achieve high accuracy as well. The dataset info recommends using the AUPRC as an evaluation metric. We will use the average precision score instead, which is an evaluation metric available in scikit-learn that is sometimes used as an alternative to AUPRC,

Let's take a look and see what the above model achieved on the training and validation set's average precision score.

In [14]:
lr_predict_train = lr_model.predict(X_train)
lr_predict_dev = lr_model.predict(X_dev)
print(average_precision_score(Y_train, lr_predict_train))
print(average_precision_score(Y_dev, lr_predict_dev))

As we can see, this model didn't perform nearly as well as we initially thought. Viewing a confusion matrix of the predictions made by the model, we can gather a better idea of what type of errors it is making.

In [15]:
lr_confusion_dev = pd.DataFrame(confusion_matrix(Y_dev, lr_predict_dev))
lr_confusion_dev.columns = ["Predicted Negative", "Predicted Positive"]
lr_confusion_dev.index = ["Actual Negative", "Actual Positive"]
sns.heatmap(lr_confusion_dev, annot=True)
plt.yticks(rotation=0)

Viewing the confusion matrix, we can see that the logistic regression model identified all but one of the fraudulent transactions as such. However, we are also misclassifying 28 non-fraudulent activities in the validation set as fraudulent. Though I would much rather our model make false positive errors rather than false negatives, let's see if we can improve our results by using several other machine learning models.

### To be continued...