# Credit card fraud detection

The goal of this project is to implement a machine learning algorithm to detect credit card fraud based on a dataset that contains credit card transactions made by european cardholders. This dataset includes transactions that occurred in the course of two days in September 2013, with 492 fradulent transactions out of a total of 284,315 transactions. The dataset is thus highly unbalanced with the positive class (frauds) accounting for just 0.17% of all transactions. This will be addressed later in the discussion of the best model and sampling trategy. The dataset has 30 input features, 28 of which anonymized, and 1 target variable.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
sns.set()

import os
print(os.listdir("../input"))

## Data description

The loaded dataset below includes 30 input features with only two of which, `Time` and `Amount`, being labeled. This doesn't allow us to do EDA on most of the features.

In [None]:
# Load the dataset
data = pd.read_csv('../input/creditcard.csv')

data.head()

In [None]:
# identify the features and the target in the data set
features = data.columns[:-1]
print('The features are as follows: {}'.format(features))
target = data.columns[-1]
print('The target is: ' + target)

In [None]:
# Create an X variable containing the features and a y variable containing only the target variable
X = data[features]
y = data[target]

Once we've created the feature variable (X) and the target variable (y), we next view the histograms of each of the features below. We see that the unlabeled feature values have been transformed --- they're the result of a PCA transformation. `Amount` isn't and neither is `Time` (between transaction), the latter displaying a bimodal distribution (with two modes: one mode around 50K seconds, or 13.89 hours and the other around 150k seconds, or 41.67 hours).

In [None]:
# Plot histograms of each parameter 
X.hist(figsize = (20, 20))
plt.show()

Next let's look at the distribution of class types. It looks like the daraset we have is drastically biased toward non-fradulent transactions (284,315) in comparison with fradulent transactions (492).

In [None]:
# Count frequency of target class types
count_target = pd.value_counts(y, sort = True).sort_index()
count_target.plot(kind = 'bar')
plt.xticks(rotation=0)
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

## Data preparation

Next we prepare the features for the machine learning algorithm. The machine learning algorithm requires standard normally distributed data. In the histograms above we saw that the anomymized features were all scaled (via PCA transformation) but not the `Time` and `Amount` ones. We then scale these two features.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Scale `Amount` and `Time` column values and assign to new columns
X['normAmount'] = scaler.fit_transform(X['Amount'].values.reshape(-1, 1)) 
X['normTime'] = scaler.fit_transform(X['Time'].values.reshape(-1, 1))

# Drop pre-scaled `Time` and `Amount` values from the feature dataset
X = X.drop(['Time','Amount'],axis=1)

# Plot histograms of each parameter 
X.hist(figsize = (20, 20))
plt.show()

## Supervised machine learning model - Logistic regression

The first model we'll consider will be a Logistic Regression model. We split the dataset into training and test set and train our model. We then predict the target on the test set and produce a classification report, an accuracy score, and confusion matrix.

In [None]:
# Split the data set using 'train_test_split' function
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the model
from sklearn.linear_model import LogisticRegression
lg_model = LogisticRegression(solver = 'liblinear')

# Train the model using 'fit' method
lg_model.fit(X_train, y_train)

# Test the model using 'predict' method
y_pred = lg_model.predict(X_test)

# Print the classification report 
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test,y_pred)

# print confusion matrix
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt = 'd', ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Logistic regression: Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Non-fradulent', 'Fradulent']); ax.yaxis.set_ticklabels(['Non-fradulent', 'Fradulent']);

lg_model_accuracy = lg_model.score(X_test, y_test)
print("Model accuracy: ", lg_model_accuracy)

Our goal is to have the highest success rate possible in detecting fradulent trasactions. This means is that we want to have 100% success rate for TPs (true positives) and the lowest error rate for FNs (false negatives), the latter so we don't miss any fradulent transactions. This means that the most important score in the confusion matrix above is the recall rate (TP/(TP+FN)). The recall score is quite shabby, but this is not surprising given that the training set is skewed toward non-fradulent transactions.

The ROC curve below shows that our model is quite good for detecting true positives and minimizing false positives, but it doesn't say anything about the false negatives, i.e., those fadulent transactions that fly under the radar. We'll move on to the precision-recall scores to get a better idea of how the model fares with respect to false negatives.

In [None]:
y_pred_prob = lg_model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Logistic Regression ROC Curve')
plt.show()

print('The ROC AUC score is: {}'.format(roc_auc_score(y_test, y_pred_prob)))

The AUPRC (Area Under the Precision-Recall Curve) shows the trade-off between precision and recall: As recall increases, precision plumets to a point that above 0.5 of recall precision is no better than an unskilled model, depicted by the 0.5 line. 

In [None]:
#Area Under the Precision-Recall Curve (AUPRC)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
# Plot comparison to no-skill model
plt.plot([0, 1], [0.5, 0.5], linestyle='--', label='Unskilled model')
# plot the roc curve for the model
plt.plot(recall, precision, marker='.')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Logistic Regression Precision-Recall Curve')
# show the plot
plt.show()

## Resampling and tuning the classifier's parameters

In order to address the skewed sample, I've adopted [oparga3's](https://www.kaggle.com/joparga3) [code for under-sampling](https://www.kaggle.com/joparga3/in-depth-skewed-data-classif-93-recall-acc-now). The idea behind undersampling in this case is creating a 50/50 ratio for class 1 (fadulent) and class 0 (non-fadulent), but randomly selecting a number of observations from the majority class (class 0 in this case) that equals that of the number of observations from the minority class (class 1).

In [None]:
from sklearn.model_selection import cross_val_score

# Number of data points in the minority class
number_records_fraud = len(y[y.values == 1])
fraud_indices = np.array(y[y.values == 1].index)

# Picking the indices of the normal classes
normal_indices = y[y.values == 0].index

# Out of the indices we picked, randomly select an x number (== number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

#X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
#y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']


# Under sample dataset
X_undersample = X.iloc[under_sample_indices,:]
y_undersample = y.iloc[under_sample_indices]

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 1)
print("")
print("Number transactions training dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

## Logistic regression on undersampled data

Below the logistic model is run on the undersampled yet balanced in terms of the representation of the two classes. Precision for the detection of fradulent transaction is now 0.93 and recall is 0.94, much better results than with the much larger but highly unbalanced set before. Obviously, we want an even better model, one that catches those extra 6% fradulent transaction.

In [None]:
# Instantiate the model
from sklearn.linear_model import LogisticRegression
lg_model = LogisticRegression(solver = 'liblinear')

# Train the model using 'fit' method
lg_model.fit(X_train_undersample, y_train_undersample)

# Test the model using 'predict' method
y_pred_undersample = lg_model.predict(X_test_undersample)

# Print the classification report 
print(classification_report(y_test_undersample, y_pred_undersample))

cm_undersample = confusion_matrix(y_test_undersample,y_pred_undersample)

# print confusion matrix
ax= plt.subplot()
sns.heatmap(cm_undersample, annot=True, fmt = 'd', ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Logistic regression: Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Non-fradulent', 'Fradulent']); ax.yaxis.set_ticklabels(['Non-fradulent', 'Fradulent']);

lg_model_accuracy_undersample = lg_model.score(X_test_undersample, y_test_undersample)
print("Model accuracy: ", lg_model_accuracy_undersample)


#cv_scores = cross_val_score(lg_model, X, y, cv=5, scoring='roc_auc')
#print("The scores of 5-fold cross-validation are: {}".format(cv_scores))
#print("The mean cross-validation score is: {}".format(np.mean(cv_scores)))

In [None]:
y_pred_prob_undersample = lg_model.predict_proba(X_test_undersample)[:,1]
fpr, tpr, thresholds = roc_curve(y_test_undersample, y_pred_prob_undersample)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Logistic Regression ROC Curve')
plt.show()

print('The ROC AUC score is: {}'.format(roc_auc_score(y_test_undersample, y_pred_prob_undersample)))

The model's precision-recall curve is what we're looking for: No huge trade off between precision and recall but rather a similarly high rates for both precision and recall.

In [None]:
#Area Under the Precision-Recall Curve (AUPRC)
precision, recall, thresholds = precision_recall_curve(y_test_undersample, y_pred_prob_undersample)
# Plot comparison to no-skill model
plt.plot([0, 1], [0.5, 0.5], linestyle='--', label='Unskilled model')
# plot the roc curve for the model
plt.plot(recall, precision, marker='.')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Logistic Regression Precision-Recall Curve')
# show the plot
plt.show()

## Looking ahead

Next I'd like to look at SVM and Decision Tree models, but for now I'm quite happy with the the logistic regression model here.