# Fraud Detection in Credit Card Payments

The data used in the following notebook was obtained from Kaggle (Dal Pozzolo et al 2015). Link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

According to the dictionary, to protect the identity, the variables in the data set are the consequence of a dimensionality reduction process (PCA). The time variable represent the number of seconds elapsed between the transaction and the first transaction in the dataset.

## Downloading the dataset

In case you cloned this repository from Amazon Sagemaker, the best way to obtain the data is to run the following lines:

```
!rm creditcard.csv
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
!unzip creditcardfraud
!rm creditcardfraud.zip
```

Because creditcard.csv was uploaded using `lfs`, When this repository is cloned from Sagemaker, the creditcard.csv file will contain only pointers.

If you clone this repo locally and `lfs` is configured, the dataset will be correcly downloaded.

## Exploring the dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!rm creditcard.csv
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
!unzip creditcardfraud
!rm creditcardfraud.zip

In [3]:
data=pd.read_csv('creditcard.csv')

In [None]:
print("The dataset contains {} observations and {} features".format(data.shape[0],data.shape[1]))
data.head()

### Features

* V1-V28: The principal componentes obtained by PCA. The original features cannot be disclosed for confidentiality reasons.
* Time: Seconds elapsed between the transaction and the first transaction in the dataset.
* Amount: The amount of the transaction.
* Class: Takes the value of **1** if it was a fraudulent transaction and **0** otherwise.

#### Time

The data contains two days of transactions

In [None]:
data.Time.plot(kind='density');
data.Time.describe()

#### PCA components

As expected, all the components of the PCA has zero correlation with each other.

In [None]:
sns.heatmap(data.loc[:,data.columns.str.startswith('V')].corr());

#### Amount

In [None]:
data.Amount.describe()

#### Class

The dataset is extremely imbalanced, with only 0.17% of the observations labelled as fraudulent

In [None]:
print("Number of fraudulent transactions in the dataset: {}".format(data.Class.value_counts()[1]))
print("Proportion of fraudulent transactions in the dataset: {}%".format((data.Class.value_counts()[1]/len(data.Class))*100))

## Model Estimation

Our task is to detect the as many fraudulent transactions as possible. Provided that our data is extremely imbalanced, with more that 99% of the observations labelled as non-fraudulent, algorithms will generally tend to predict all transactions to be on the majority class. For this reason, we should focus on minimizing False Negatives and maximizing True Positives (maximizing Recall).

\begin{align}
Recall=\frac{TP}{TP+FN}
\end{align}

Performing cross validation to estimate the performance (recall) on the test set would not give accurate results. The reason for this is the low proportion of fraudulent transactions, which causes a great variability on the amout of class-1 observations used for training and test in each validation set. 

As we only have 492 fraudulent data points, training an algorithm with 400 of them and testing on 92 may generate very different results as training on 250 of them and testing on the rest.

In order to mantain the proportion of class-1 points in the training and test set, we will perform a stratified sampling to divide the data into train and test.

In [None]:
data.groupby(by='Class').sample(frac=0.7).Class.value_counts()

In [4]:
def stratified_sampling(data,fraction):
    """
    Input: the dataset to split into training and test
    
    Output:
    
    The function returns X_train, X_test, y_train, y_test
    
    """
    train=data.iloc[data.groupby(by='Class').sample(frac=fraction).index]
    test=data.drop(train.index,axis=0)
    
    X_train=train.drop('Class',axis=1)
    X_test=test.drop('Class',axis=1)
    y_train=train.Class
    y_test=test.Class
    return X_train, X_test, y_train, y_test

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

#### Not accounting for class imbalance

In [None]:
recall_scores=[]
auc_scores=[]
for i in range(20):
    X_train, X_test, y_train, y_test=stratified_sampling(data)
    log_reg=LogisticRegression(max_iter=1000).fit(X_train,y_train)
    y_pred=log_reg.predict(X_test)
    recall_scores.append(recall_score(y_test,y_pred))
    auc_scores.append(roc_auc_score(y_test,log_reg.predict_proba(X_test)[:,1]))

print("The Logistic Regression average recall score without accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The Logistic Regression averace AUC score without accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))

Is important to highlight that the 20 recall scores found above are not independent. 

Some training observations in the training set (or test) are the same among different repetitions. However, using 20 different iterations with only 492 class-1 observations makes highly probable that all the observations are used as training and test data points at least once. The latter occurs because we always select the same proportion of the 492 observations as training and test.

An example of a confusion matrix in one of the 20 repetitions estimated above

|Confusion Matrix|y_true=1|y_true=0|
|----------------|--------|--------|
|**y_predicted=1**|TP=84|FP=16|
|**y_predicted=0**|FN=64|TN=85,279|

#### Accounting for class imbalance

In [None]:
recall_scores=[]
auc_scores=[]
for i in range(20):
    X_train, X_test, y_train, y_test=stratified_sampling(data,0.7)
    log_reg=LogisticRegression(class_weight='balanced',max_iter=1000).fit(X_train,y_train)
    y_pred=log_reg.predict(X_test)
    recall_scores.append(recall_score(y_test,y_pred))
    auc_scores.append(roc_auc_score(y_test,log_reg.predict_proba(X_test)[:,1]))
print("The Logistic Regression average recall score accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The Logistic Regression average AUC score without accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))

An example of a confusion matrix in one of the 20 repetitions estimated above

|Confusion Matrix|y_true=1|y_true=0|
|----------------|--------|--------|
|**y_predicted=1**|TP=134|FP=3,879|
|**y_predicted=0**|FN=14|TN=81,416|

Accounting for class imbalance improves bot the recall score and the AUC. 

It is worth highlighting that the number of false positives soared (~240 times the false positives found on the model without balanced weights). We can assume that a bank incurrs in more costs when failing to detect fraudulent transactions than when labelling as fraudulent a legal transaction.

### Support Vector Machines

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC, LinearSVC
import numpy as np

In [6]:
data2=StandardScaler().fit_transform(data.drop('Class',axis=1))
data2=pd.DataFrame(data2,columns=data.drop('Class',axis=1).columns)
data2=pd.concat([data2,data.Class],axis=1)

In [9]:
recall_scores_test=[]
recall_scores_train=[]

for i in range(1):
    X_train, X_test, y_train, y_test=stratified_sampling(data2,0.4)
    svc=SVC(class_weight='balanced',C=0.1).fit(X_train,y_train)
    y_pred_test=svc.predict(X_test)
    y_pred_train=svc.predict(X_train)
    recall_scores_test.append(recall_score(y_test,y_pred_test))
    recall_scores_train.append(recall_score(y_train,y_pred_train))
print("The SVC test recall score accounting for class imbalance is: {}".format(sum(recall_scores_test)/len(recall_scores_test)))
print("The SVC train recall score accounting for class imbalance is: {}".format(sum(recall_scores_train)/len(recall_scores_train)))

The SVC test recall score accounting for class imbalance is: 0.7254237288135593
The SVC train recall score accounting for class imbalance is: 0.9949238578680203


The SVM performance with RBF kernel is not accurate if we compare it with the logistic regression results.

The python package is not efficient to calculate SVM in large datasets, therefore I will use Stochastic Gradient Descent with a Linear Kernel to see if the 

In [None]:
sample_size=list(np.arange(0.1,0.5,0.05))
test_scores=[]
train_scores=[]
for i in sample_size:
    X_train, X_test, y_train, y_test=stratified_sampling(data2,i)
    svc=SVC(class_weight='balanced').fit(X_train,y_train)
    y_pred=svc.predict(X_test)
    y_train_pred=svc.predict(X_train)
    test_scores.append(recall_score(y_test,y_pred))
    train_scores.append(recall_score(y_train,y_train_pred))
    print(i)

In [None]:
plt.plot(sample_size,train_scores)
plt.plot(sample_size,test_scores)