# Fraud Detection in Credit Card Payments

The data used in the following notebook was obtained from Kaggle (Dal Pozzolo et al 2015). Link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

According to the dictionary, to protect the identity, the variables in the data set are the consequence of a dimensionality reduction process (PCA). The time variable represent the number of seconds elapsed between the transaction and the first transaction in the dataset.

## Downloading the dataset

In case you cloned this repository from Amazon Sagemaker, the best way to obtain the data is to run the following lines:

```
!rm creditcard.csv
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
!unzip creditcardfraud
!rm creditcardfraud.zip
```

Because creditcard.csv was uploaded using `lfs`, When this repository is cloned from Sagemaker, the creditcard.csv file will contain only pointers.

If you clone this repo locally and `lfs` is configured, the dataset will be correcly downloaded.

## Exploring the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!rm creditcard.csv
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
!unzip creditcardfraud
!rm creditcardfraud.zip

In [2]:
data=pd.read_csv('creditcard.csv')

In [3]:
print("The dataset contains {} observations and {} features".format(data.shape[0],data.shape[1]))
data.head()

The dataset contains 284807 observations and 31 features


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Features

* V1-V28: The principal componentes obtained by PCA. The original features cannot be disclosed for confidentiality reasons.
* Time: Seconds elapsed between the transaction and the first transaction in the dataset.
* Amount: The amount of the transaction.
* Class: Takes the value of **1** if it was a fraudulent transaction and **0** otherwise.

#### Time

The data contains two days of transactions

In [None]:
data.Time.plot(kind='density');
data.Time.describe()

#### PCA components

As expected, all the components of the PCA has zero correlation with each other.

In [None]:
sns.heatmap(data.loc[:,data.columns.str.startswith('V')].corr());

#### Amount

In [None]:
data.Amount.describe()

#### Class

The dataset is extremely imbalanced, with only 0.17% of the observations labelled as fraudulent

In [None]:
print("Number of fraudulent transactions in the dataset: {}".format(data.Class.value_counts()[1]))
print("Proportion of fraudulent transactions in the dataset: {}%".format((data.Class.value_counts()[1]/len(data.Class))*100))

## Model Estimation

Our task is to detect the as many fraudulent transactions as possible. Provided that our data is extremely imbalanced, with more that 99% of the observations labelled as non-fraudulent, algorithms will generally tend to predict all transactions to be on the majority class. For this reason, we should focus on minimizing False Negatives and maximizing True Positives (maximizing Recall).

\begin{align}
Recall=\frac{TP}{TP+FN}
\end{align}

Performing cross validation to estimate the performance (recall) on the test set would not give accurate results. The reason for this is the low proportion of fraudulent transactions, which causes a great variability on the amout of class-1 observations used for training and test in each validation set. 

As we only have 492 fraudulent data points, training an algorithm with 400 of them and testing on 92 may generate very different results as training on 250 of them and testing on the rest.

In order to mantain the proportion of class-1 points in the training and test set, we will perform a stratified sampling to divide the data into train and test.

In [None]:
data.groupby(by='Class').sample(frac=0.7).Class.value_counts()

In [4]:
def stratified_sampling(data,fraction):
    """
    Input: the dataset to split into training and test
    
    Output:
    
    The function returns X_train, X_test, y_train, y_test
    
    """
    train=data.iloc[data.groupby(by='Class').sample(frac=fraction).index]
    test=data.drop(train.index,axis=0)
    
    X_train=train.drop('Class',axis=1)
    X_test=test.drop('Class',axis=1)
    y_train=train.Class
    y_test=test.Class
    return X_train, X_test, y_train, y_test

### Logistic Regression

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

#### Not accounting for class imbalance

In [None]:
recall_scores=[]
auc_scores=[]
for i in range(20):
    X_train, X_test, y_train, y_test=stratified_sampling(data,0.7)
    log_reg=LogisticRegression(max_iter=1000).fit(X_train,y_train)
    y_pred=log_reg.predict(X_test)
    recall_scores.append(recall_score(y_test,y_pred))
    auc_scores.append(roc_auc_score(y_test,log_reg.predict_proba(X_test)[:,1]))

print("The Logistic Regression average recall score without accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The Logistic Regression averace AUC score without accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))

Is important to highlight that the 20 recall scores found above are not independent. 

Some training observations in the training set (or test) are the same among different repetitions. However, using 20 different iterations with only 492 class-1 observations makes highly probable that all the observations are used as training and test data points at least once. The latter occurs because we always select the same proportion of the 492 observations as training and test.

An example of a confusion matrix in one of the 20 repetitions estimated above

|Confusion Matrix|y_true=1|y_true=0|
|----------------|--------|--------|
|**y_predicted=1**|TP=84|FP=16|
|**y_predicted=0**|FN=64|TN=85,279|

#### Accounting for class imbalance

In [None]:
recall_scores=[]
auc_scores=[]
for i in range(20):
    X_train, X_test, y_train, y_test=stratified_sampling(data,0.7)
    log_reg=LogisticRegression(class_weight='balanced',max_iter=1000).fit(X_train,y_train)
    y_pred=log_reg.predict(X_test)
    recall_scores.append(recall_score(y_test,y_pred))
    auc_scores.append(roc_auc_score(y_test,log_reg.predict_proba(X_test)[:,1]))
print("The Logistic Regression average recall score accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The Logistic Regression average AUC score without accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))

An example of a confusion matrix in one of the 20 repetitions estimated above

|Confusion Matrix|y_true=1|y_true=0|
|----------------|--------|--------|
|**y_predicted=1**|TP=134|FP=3,879|
|**y_predicted=0**|FN=14|TN=81,416|

Accounting for class imbalance improves bot the recall score and the AUC. 

It is worth highlighting that the number of false positives soared (~240 times the false positives found on the model without balanced weights). We can assume that a bank incurrs in more costs when failing to detect fraudulent transactions than when labelling as fraudulent a legal transaction.

### Support Vector Machines

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC, LinearSVC
import numpy as np

In [None]:
data2=StandardScaler().fit_transform(data.drop('Class',axis=1))
data2=pd.DataFrame(data2,columns=data.drop('Class',axis=1).columns)
data2=pd.concat([data2,data.Class],axis=1)

In [None]:
recall_scores_test=[]
recall_scores_train=[]
for i in range(1):
    X_train, X_test, y_train, y_test=stratified_sampling(data2,0.7)
    svc=SVC(class_weight='balanced',C=0.1).fit(X_train,y_train)
    y_pred_test=svc.predict(X_test)
    y_pred_train=svc.predict(X_train)
    recall_scores_test.append(recall_score(y_test,y_pred_test))
    recall_scores_train.append(recall_score(y_train,y_pred_train))
print("The SVC test recall score accounting for class imbalance is: {}".format(sum(recall_scores_test)/len(recall_scores_test)))
print("The SVC train recall score accounting for class imbalance is: {}".format(sum(recall_scores_train)/len(recall_scores_train)))

In [None]:
sample_size=list(np.arange(0.1,0.5,0.05))
test_scores=[]
train_scores=[]
for i in sample_size:
    X_train, X_test, y_train, y_test=stratified_sampling(data2,i)
    svc=SVC(class_weight='balanced').fit(X_train,y_train)
    y_pred=svc.predict(X_test)
    y_train_pred=svc.predict(X_train)
    test_scores.append(recall_score(y_test,y_pred))
    train_scores.append(recall_score(y_train,y_train_pred))
plt.plot(sample_size,train_scores)
plt.plot(sample_size,test_scores)

The learning curve shows that there is room for improvement in the accuracy of our SVM algorithm if we use more training data. Given that our dataset is already large, we can conclude that the gain of having more training observations does not come from non-fraudulent data. In fact the improvement in recall comes entirely from the inclussion of more class-1 observations. The aforementioned problem can be solved with oversampling techniques, which will be tested later in this project.

The SVM performance with RBF kernel is not as accurate as logistic regression and is far more computationally expensive. Using Stochastic Gradient Descent (SGD) is an alternative calculation to SVM.

### XGBoost

In [6]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.3.3-py3-none-manylinux2010_x86_64.whl (157.5 MB)
[K     |████████████████████████████████| 157.5 MB 17 kB/s s eta 0:00:01    |█████████████▍                  | 65.8 MB 14.8 MB/s eta 0:00:07     |█████████████████▉              | 87.8 MB 70.1 MB/s eta 0:00:01
Installing collected packages: xgboost
Successfully installed xgboost-1.3.3


In [7]:
import xgboost as xgb
from sklearn.metrics import accuracy_score

In [10]:
recall_scores=[]
auc_scores=[]
train_acc_scores=[]
train_recall_scores=[]
for i in range(3):
    X_train, X_test, y_train, y_test=stratified_sampling(data,0.7)
    xgb_classifier=xgb.XGBClassifier(use_label_encoder=False,objective='binary:logistic',eval_metric='auc',scale_pos_weight=600,eta=0.1,n_estimators=100,max_depth=7, alpha=0.1, n_jobs=-1)
    xgb_classifier.fit(X_train,y_train)
    y_pred=xgb_classifier.predict(X_test)
    y_train_pred=xgb_classifier.predict(X_train)
    recall_scores.append(recall_score(y_test,y_pred))
    train_acc_scores.append(accuracy_score(y_train,y_train_pred))
    train_recall_scores.append(recall_score(y_train,y_train_pred))
    auc_scores.append(roc_auc_score(y_test,xgb_classifier.predict_proba(X_test)[:,1]))

print("The XGBoost average train accuracy score accounting for class imbalance is: {}".format(sum(train_acc_scores)/len(train_acc_scores)))
print("The XGBoost average train recall score accounting for class imbalance is: {}".format(sum(train_recall_scores)/len(train_recall_scores)))
print("The XGBoost average test recall score accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The XGBoost average test AUC score accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
print(tn, fp, fn, tp)

The XGBoost average train accuracy score accounting for class imbalance is: 0.9999197447884273
The XGBoost average train recall score accounting for class imbalance is: 1.0
The XGBoost average test recall score accounting for class imbalance is: 0.8378378378378377
The XGBoost average test AUC score accounting for class imbalance is: 0.9826439004219062
85284 11 21 127


|Confusion Matrix|y_true=1|y_true=0|
|----------------|--------|--------|
|**y_predicted=1**|TP=116|FP=24|
|**y_predicted=0**|FN=32|TN=85,271|

Compared to Logistc Regression, XGBoost takes generates significantly less False Positives with the cost of a lower Recall. This model would be appropriate in case the financial institutions cannot afford to investigate each one of the thousands of potential fraudulent transactions.

### Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [18]:
recall_scores=[]
auc_scores=[]
train_acc_scores=[]
train_recall_scores=[]
for i in range(5):
    X_train, X_test, y_train, y_test=stratified_sampling(data,0.7)
    rf=RandomForestClassifier(n_estimators=170,max_depth=5,class_weight='balanced')
    rf.fit(X_train,y_train)
    y_pred=rf.predict(X_test)
    y_train_pred=rf.predict(X_train)
    recall_scores.append(recall_score(y_test,y_pred))
    train_acc_scores.append(accuracy_score(y_train,y_train_pred))
    train_recall_scores.append(recall_score(y_train,y_train_pred))
    auc_scores.append(roc_auc_score(y_test,xgb_classifier.predict_proba(X_test)[:,1]))

print("The Random forest average train accuracy score accounting for class imbalance is: {}".format(sum(train_acc_scores)/len(train_acc_scores)))
print("The Random forest average train recall score accounting for class imbalance is: {}".format(sum(train_recall_scores)/len(train_recall_scores)))
print("The Random forest average test recall score accounting for class imbalance is: {}".format(sum(recall_scores)/len(recall_scores)))
print("The Random forest average test AUC score accounting for class imbalance is: {}".format(sum(auc_scores)/len(auc_scores)))
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
print(tn, fp, fn, tp)

The Random forest average train accuracy score accounting for class imbalance is: 0.9973114504123111
The Random forest average train recall score accounting for class imbalance is: 0.8976744186046511
The Random forest average test recall score accounting for class imbalance is: 0.8391891891891892
The Random forest average test AUC score accounting for class imbalance is: 0.9928938041740668
85027 268 20 128
