# Credit Card Fraud Detection

# Introduction

The dataset contains 284,807 transactions made by credit cards during two days of September 2013 by european cardholders.<br>
Credit card companies need to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

**Goal**<br>
Build a predictive model to identify fraudulent credit card transactions (supervised binary classification problem).

**Data**<br>
According to the Kaggle 'Overview' tab:
- Features V1, V2, ... V28 are the principal components obtained by PCA
- Feature 'Time' contains the seconds elapsed between a transaction and the first transaction in the dataset
- Feature 'Amount' is the transaction Amount
- Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

**Metric**<br>
The dataset is highly unbalanced:
- 492 fraudulent transactions vs. 284,315 genuines
- So the positive class (Frauds) account for 0.172% of all transactions.

1. First, we want to catch frauds. It means: minimizing the False Negative rate
> Recall
2. Then we want to minimize the False Positive rate
> Precision

    **The Area under the Precision-Recall curve is the compromise metrics in this Precision-Recall trade-off.**<br>
   
<br>

**Outline**<br>

1. [Load Dataset](#load dataset)<br>

2. [Summarize Dataset](#summarize dataset)<br>
2.1. [Descriptive Statistics](#descriptive statistics)<br>
2.2. [Data Visualization](#data visualization)<br>
2.2.1. [When does frauds happen?](#time)<br>
2.2.2. [What is the amount of a fraud transaction?](#amount)<br>
2.2.3. [Correlations](#correlations)<br>

3. [Prepare Dataset](#prepare dataset)<br>
3.1. [Feature selection](#feature selection)<br>
3.2. [Data Transform](#data transforms)<br>
3.3. [Subsampling](#subsampling)<br>

4. [50/50 Dataset](#50/50 dataset)<br>
4.1. [Split-out validation Dataset](#50/50 split-out)<br>
4.2. [Set Metrics](#50/50 metrics)<br>
4.3. [Spot Check Algorithms](#50/50 spot check)<br>
4.3.1. [Train and Test Sets](#50/50 train test)<br>
4.3.2. [K-Fold cross validation](#50/50 kfold)<br>
4.4. [Classifiers Tuning](#tuning)<br>
4.4.1. [SVM Tuning](#svm tuning)<br>
4.4.2. [XGBoost Tuning](#xgb tuning)<br>
4.5 [Conclusion](#50/50 conclusion)<br>

# 1. Load dataset
<a id="load dataset"></a>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

seed = 7

%matplotlib inline

dataset = pd.read_csv('../input/creditcard.csv')

print('This dataset contains ',dataset.shape[0],'rows')
print('This dataset contains ',dataset.shape[1],'columns')

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
# Check if NA values are present
dataset.isnull().sum().sum()

In [None]:
# Change the type of the Class column
dataset.Class = dataset.Class.astype('bool')

# Get the count of each Class
dataset.groupby('Class').size()

> There is no 'NA' values.

> The dataset is unbalanced.

# 2. Summarize Data
<a id="summarize dataset"></a>

## 2.1. Descriptive Statistics
<a id="descriptive statistics"></a>

In [None]:
dataset.describe()

## 2.2. Data Visualization
<a id="data visualization"></a>

In [None]:
import seaborn as sns

frauds = dataset[dataset.Class==True]
genuines = dataset[dataset.Class==False]

### 2.2.1. When does frauds happen?
<a id="time"></a>

>Hypothesis: Frauds happen at night.

In [None]:
# 'Time' visualization
sns.distplot(dataset.Time,
             bins=80, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
plt.title('Density plot and Histogram of Time (in seconds)')
plt.show()

> A day is 86,400 seconds so we can recongnize the 2 days of transactions. Each top of slope correspond to a day and each valley correspond to a night.

In [None]:
# 'Time' visualization for frauds
sns.distplot(frauds.Time,
             bins=80, color = 'darkgreen', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
plt.title('Density plot and Histogram of Time for Frauds (in seconds)')
plt.show()

>Density is more flat for frauds. So the proportion of frauds may be higher at night.

### 2.2.2. What is the amount of a fraud transaction?
<a id="amount"></a>

>Hypothesis: the higher an amount is, the more likely the transaction is fraudulent.

In [None]:
# 'Amount' visualization
# According to the Introduction, we assume the currency is Euro
sns.distplot(dataset.Amount, 
             bins=80, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
plt.title('Density plot and Histogram of Amount (in €)')
plt.show()

In [None]:
# Ratio of Frauds vs. Amount
amounts = np.linspace(0,5000,1001)
ratios = np.array([])

for amount in amounts:
    
    nbGenuine = len(genuines[genuines.Amount > amount])
    nbFrauds = len(frauds[frauds.Amount > amount])
    ratio = 100*nbFrauds/nbGenuine
    
    ratios = np.append(ratios,ratio)

plt.plot(amounts,ratios,'r-')
plt.title('Ratio #Fraud/#Genuine vs. Amount')
plt.xlabel('Amount (in €)')
plt.ylabel('Ratio (in %)')
plt.show()

>Contrary to what we could believe, a large amount in a transaction does not involve that it is a fraud.

### 2.2.3. Correlations
<a id="correlations"></a>

In [None]:
from pandas.plotting import scatter_matrix

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
cax = ax.matshow(dataset.corr(), vmin=-1, vmax=1, interpolation='none')
fig.colorbar(cax)
plt.show()

>'V' predictors are not correlated to each other. Indeed, as Principal Components, there are orthogonal to each other.

>The 'blurred' border is due to column 'Time', 'Amount' and 'Class': there are no significant correlation between the variables.

# 3. Prepare Dataset
<a id="prepare dataset"></a>

## 3.1. Feature Selection
<a id="feature selection"></a>

>'V1' to 'V28' have already been preprocessed. We will use them as predictors. We have to decide if we use 'Time' and 'Amount' as predictors:
- As it is, 'Time' does not seem very relevant.
- As it is, 'Amount' should already brings information.

## 3.2. Data Transforms
<a id="data transforms"></a>

>We will not apply any transformation to variables 'V1', 'V2', ... , 'V28'.
>'Time' and 'Amount' could be transformed:
- 'Time' could be use to set a new boolean variable called 'Day' (_True_ for Day, _False_ for Night, cf.[previousely](#time)). So regarding the 'Time' variable, we could either:
    1. Not use it
    2. Or replace it by 'Day' 
- 'Amount' could be use:
    1. As it is
    2. Normalized
    3. Standardized
    
## 3.3. Subsampling
<a id="subsampling"></a>

>To continue our analysis of the dataset, we could:
1. Use the whole dataset as it is: so no sampling
2. Create a new dataset composed of the 492 frauds and of 492 random genuine transactions (50/50 sampling). If so, we would have to ensure that the 492 random genuine transactions are representative of the genuine transactions...
3. A way to manage this  would be to resample 492 genuine transactions randomly N times and to compare results.

>So different configurations for our dataset are possible regarding 'Time', 'Amount' and subsampling:

>| 'Time'               | 'Amount'        |      Subsampling |
>| :------------        | :-------------: | -------------:   |
>| Not use it           | As it is        |               No |
>| Replace it by 'Day'  | Normalized      |            50/50 |
>|                      | Standardized    |    N times 50/50 |

>If you are curious about which configuration brings the best results, go for it!

>Next in this notebook, we will explore one of those configuration.

# 4. 50/50 Dataset
<a id="50/50 dataset"></a>

>As a first step in this analysis, we will use the following highlighted configuration for our dataset:

>| 'Time'               | 'Amount'        |      Subsampling |
>| :------------        | :-------------: | -------------:   |
>| _**Not use it**_     | _**As it is**_  |             No   |
>| Replace it by 'Day'  | Normalized      |      _**50/50**_ |
>|                      | Standardized    |  N times 50/50   |

>Let's create the new dataset.

In [None]:
# We randomly select 492 genuine transactions
genuines_sub = genuines.sample(492, random_state=seed)

# dataset_sub is the dataset composed of 492 frauds and of 492 genuine transactions
dataset_sub = frauds.append(genuines_sub, ignore_index=True)

# We drop the 'Time' column
dataset_sub = dataset_sub.drop('Time',axis=1)

print('This sub dataset contains ',dataset_sub.shape[0],'rows')
print('This sub dataset contains ',dataset_sub.shape[1],'columns')

In [None]:
dataset_sub.groupby('Class').size()

## 4.1. Split-out validation Dataset
<a id="50/50 split-out"></a>

In [None]:
from sklearn.model_selection import train_test_split

# Predictors
X = dataset_sub.drop('Class',axis=1)

# Response
y = dataset_sub.Class

# Split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state = seed)

# Proportion of fraud in train set and test set
print('Proportion of fraud in train:',y_train[y_train == True].shape[0]/X_train.shape[0])
print('Proportion of fraud in test:',y_test[y_test == True].shape[0]/X_test.shape[0])

>Proportion should be 50/50

## 4.2. Set metrics
<a id="50/50 metrics"></a>

>As evoked in the Introduction, we will use the following metrics:
- Precision
- Recall
- Area under the Precision-Recall Curve

>Confusion matrix will also be used

>**We have to take into account those metrics to find the best compromise in the Recall-Precision trade-off. Here are some extrem cases we want to avoid:**
- Recall = 1, Precision = 0: the Naive classifier that classify all transactions as Frauds
- Recall = 0, Precision = 1: the Naive classifier that classify all transactions as Genuines
- Recall = 0.5, Precision = 0.5: the Random classifier that classify randomly all transactions

>**The Area Under the Precision-Recall Curve (AUPCR) will be our standard metric ([average_precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)).**

In [None]:
from sklearn.metrics import recall_score, precision_recall_curve, average_precision_score, confusion_matrix, precision_score

scoring = 'average_precision'

## 4.3. Spot Check Algorithms
<a id="50/50 spot check"></a>

>First, the following classifier would be tested:
- Logistic Regression
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Support Vector Machine

>Then the following Ensemble methods would be tested:
- Random Forest
- Ada Boost
- Stochastic Gradient Boosting
- XGBoost

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
import xgboost

models = []
models.append(('LR',LogisticRegression(random_state=seed)))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('QDA',QuadraticDiscriminantAnalysis()))
models.append(('SVM',SVC(random_state=seed,gamma='scale')))

ensembles = []
ensembles.append(('RF', RandomForestClassifier(random_state=seed,n_estimators=100)))
ensembles.append(('ADA', AdaBoostClassifier(random_state=seed)))
ensembles.append(('GBM', GradientBoostingClassifier(random_state=seed)))
ensembles.append(('XGB', XGBClassifier(random_state=seed)))

### 4.3.1. Train and Test Sets
<a id="50/50 train test"></a>

>First, we evaluate the performance of our classifier with a direct train-test stragtegy.

In [None]:
# Models evaluation function
def get_score_models(model,X_train,X_test,y_train,y_test):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    
    # All our models implement the 'decision_function' method
    # It is not the case of all our ensembles
    y_score = model.decision_function(X_test)
    
    compare(y_test,y_pred,y_score)
    
# Ensembles evaluation function
def get_score_ensembles(ensemble,X_train,X_test,y_train,y_test):
    ensemble.fit(X_train,y_train)
    y_pred = ensemble.predict(X_test)
    
    # All our ensembles implement the 'predict_proba' method
    # It is not the case of all our models
    y_score = ensemble.predict_proba(X_test)[:,1]
    
    compare(y_test,y_pred,y_score)

# Print metrics and graph function
def compare(y_test,y_pred,y_score):
    print('Confusion matrix:')
    print(confusion_matrix(y_test,y_pred))
    
    print('Recall:',recall_score(y_test,y_pred))
    print('Precision:',precision_score(y_test,y_pred))
    print('Area under the curve:',average_precision_score(y_test,y_score))
    
    precision, recall, _ = precision_recall_curve(y_test, y_score)
    plt.step(recall, precision, alpha=0.4, color='b', where='post')
    plt.fill_between(recall, precision, alpha=0.2, color='b', step='post')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0, 1.05])
    plt.xlim([0, 1])
    plt.title('Precision-Recall curve')
    plt.show()

In [None]:
# Evaluation of each model
for name,model in models:
    print('----------',name,'----------')
    get_score_models(model,X_train,X_test,y_train,y_test)

In [None]:
# Evaluation of each ensemble method
for name,ensemble in ensembles:
    print('----------',name,'----------')
    get_score_ensembles(ensemble,X_train,X_test,y_train,y_test)

**Train-Test Split conclusion**

>Classifiers results:

>| Model          | Recall (in %)  | _AUPRC (in %)_ | _Precision (in %)_ |
>| :------------  | :------------: | :------------: |   ---------------: |
>| LR             | 92.2           |           96.7 |             _94.3_ |
>| LDA            | 85.6           |           92.6 |             _95.1_ |
>| QDA            | 94.4           |           98.0 |             _92.4_ |
>| SVM            | **97.8**       |           94.7 |             _84.6_ |
>| **RF**         | 94.4           |       **98.4** |             _95.5_ |
>| ADA            | 95.6           |           98.1 |             _91.5_ |
>| GBM            | 92.2           |           94.5 |             _95.4_ |
>| **XGB**        | 94.4           |       **98.4** |             _94.4_ |

>- Focused on Recall, the SVM classifier get the best score _(on this particular Train-Test split)_. But in the same time, the SVM classifier has one of the poorest precision rate.

>- Regarding the Area under the Precision-Recall curve, **Random Forest** and **XGBoost** get the best score.

### 4.3.2. K-Fold cross validation
<a id="50/50 kfold"></a>

>Now, we evaluate the performance of our classifiers with a 10-Fold cross validation.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 10
kfold = KFold(n_splits=num_folds,random_state=seed)

names = []
results_recall = []
results_aupcr = []
models_score = {}
ensembles_score = {}

# Function cross validating and printing Recall and AUPRC results
def cross_validation(name,classifier,classifiers_score,results_recall,results_aupcr):
    cv_results_recall = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall')
    cv_results_auprc = cross_val_score(model,X_train,y_train,cv=kfold,scoring=scoring)

    models_score[name] = [cv_results_recall.mean()]
    models_score[name].append(cv_results_recall.std())
    models_score[name].append(cv_results_auprc.mean())
    models_score[name].append(cv_results_auprc.std())
    
    results_recall.append(cv_results_recall)
    results_aupcr.append(cv_results_auprc)
    names.append(name)

    print('----------',name,'----------')
    print('Recall:',models_score[name][0],'(',models_score[name][1],')')
    print('AUPRC:',models_score[name][2],'(',models_score[name][3],')\n')

In [None]:
# 10-Fold cross validation on our models
for name,model in models:
    cross_validation(name,model,models_score,results_recall,results_aupcr)

In [None]:
# 10-Fold cross validation on ensembles
for name,ensemble in ensembles:
    cross_validation(name,ensemble,ensembles_score,results_recall,results_aupcr)

In [None]:
# Compare Classifiers regarding Recall
fig = plt.figure()
fig.suptitle('Classifiers Recall Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_recall)
ax.set_xticklabels(names)
plt.show()

In [None]:
# Compare Classifiers regarding the Precision
fig = plt.figure()
fig.suptitle('Classifiers AUPRC Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_aupcr)
ax.set_xticklabels(names)
plt.show()

**10-Fold cross validation conclusion**

>Classifiers results:

>| Model          | Mean Recall (in %) | Std Recall (in %) | Mean AUPCR (in %) | Std AUPCR (in %) |
>| :------------  | :-------------:    |   :-------------: | :---------------: | ---------------: |
>| LR             | 91.3               |              5.74 |              98.3 |             1.14 |
>| LDA            | 85.3               |              5.15 |              97.3 |             1.72 |
>| QDA            | 89.8               |              5.34 |              97.9 |             1.21 |
>| **SVM**        | **92.2**           |              3.94 |              96.4 |             2.61 |
>| RF             | 88.7               |              4.34 |              97.1 |             1.60 |
>| ADA            | 91.7               |              5.48 |              98.3 |             1.33 |
>| GBM            | 90.2               |              5.84 |              98.3 |             1.03 |
>| **XGB**        | 91.6               |              5.71 |          **98.5** |             1.03 |

>- Again, focused on recall, the SVM classifier get the best score _(on this particular genuine subsample)_. But in the same time, the SVM classifier has the poorest AUPCR.

>- Regarding Area under the Precision-Recall curve, **XGBoost** get the best score _(Note that the Random Forest get poorer results by cross validation)_.


## 4.4. Classifiers Tuning
<a id="tuning"></a>

>Before switching to a new configuration of our dataset, let's try to get a better AUPCR for the Support Vector Classifier and the XGBoost by tuning their parameter.

### 4.4.1. SVM Tuning
<a id="svm tuning"></a>

>For the SVM Classifier, the following parameters will be explored:
- _C_: the penalty of the error term
- _kernel_: the kernel type used by the algorithm.

In [None]:
from sklearn.model_selection import GridSearchCV

# Function executing the Grid Search and printing the result
def search_param(model,X_train,y_train,param_grid,scoring,kfold):
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
    grid_result = grid.fit(X_train, y_train)
    
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
# /!\ May take some time
# SVC Parameters values that will be tested:
C = [0.001,0.01,0.1,1]
kernel_values = ['rbf', 'sigmoid', 'linear']

param_grid = dict(C=C,kernel=kernel_values)

model = SVC(random_state=seed,gamma='scale')

search_param(model,X_train,y_train,param_grid,scoring,kfold)

In [None]:
# kernel = 'linear'
# C contained in [0.001;0.01]
C = np.linspace(0.001,0.01,10)
param_grid = dict(C=C)

model = SVC(kernel='linear',random_state=seed)

search_param(model,X_train,y_train,param_grid,scoring,kfold)

> With (kernel='linear', C = 0.002), SVM Classifier get 98.6% AUPRC.

In [None]:
# Check the Recall for the tuned SVM
model = SVC(kernel='linear',C=0.002,random_state=seed)
cv_results = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall')

print('Recall:',cv_results.mean(),'(',cv_results.std(),')')

> Default SVM Classifier scores (C = 1.0, kernel = 'rbf'):
- AUPCR: 96.4 % (_2.61_ %)
- Recall: **92.2** % (_3.94_ %)

> Tuned SVM Classifier scores (C = 0.002, kernel = 'linear'):
- AUPCR: **98.6** % (_1.03_ %)
- Recall: 85.9 % (_4.75_ %)

### 4.4.2. XGBoost Tuning
<a id="xgb tuning"></a>

>For the XGBoost Classifier, the following parameters will be explored:
- *learning_rate*: the Boosting learning rate (xgb’s 'eta')
- *n_estimators*: the Number of boosted trees to fit
- *max_depth*: the maximum tree depth for base learners.

In [None]:
# /!\ Take some time
# XGBoost Parameters values that will be tested:
learning_rate = [0.01,0.1,1]
n_estimators = [10,100,1000]
max_depth = np.linspace(2,5,4).astype('int')

param_grid = dict(learning_rate=learning_rate,n_estimators=n_estimators,max_depth=max_depth)

model = XGBClassifier(random_state=seed)

search_param(model,X_train,y_train,param_grid,scoring,kfold)

In [None]:
# /!\ Take some time
# max_depth = 2

learning_rate = np.linspace(0.1,1,10)
n_estimators = np.linspace(10,100,10).astype('int')

param_grid = dict(learning_rate=learning_rate,n_estimators=n_estimators)

model = XGBClassifier(max_depth=2,random_state=seed)

search_param(model,X_train,y_train,param_grid,scoring,kfold)

In [None]:
# Check the Recall for the tuned XGB
model = XGBClassifier(max_depth=2, learning_rate=0.2,n_estimators=60,random_state=seed)
cv_results = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall')

print('Recall:',cv_results.mean(),'(',cv_results.std(),')')

> Default XGB Classifier scores (max_depth = 3, learning_rate = 0.1, n_estimators = 100):
- AUPCR: 98.5 % (_1.03_ %)
- Recall: **91.6** % (_5.71_ %)

> Tuned XGB Classifier scores (max_depth = 2, learning_rate = 0.2, n_estimators = 60):
- AUPCR: **98.7** % (_0.93_ %)
- Recall: 91.1 % (_5.95_ %)

** Part 4 (50/50 Dataset) Conclusion:**
<a id="50/50 conclusion"></a>

In this part, we randomly selected half of the dataset as genuine transactions, the other half being frauds.

We were aware of the importance of the Recall, to identify fraudulent transactions and minimize False Negative rate.

But we also were aware of the Area under the Precision-Recall curve, to avoid the False Positive rate to skyrocket.

(_False Positive: Genuine transactions classify as Frauds_)
(_False Negative: Frauds classify as enuine transactions_)

Eight classifiers have been trained, by direct Train-Test split and then by cross-validation, to get more reliable results.

We shortlisted the one with the best mean Recall (Support Vector Classifier with 92.2 %) and the one with the best mean Area under the Precision-Recall curve (XGBoost Classifier with 91.6 %).

By tuning their parameters, we were able to increase the AUPRC metric, often at the expense of the Recall:
- Standard SVM: Lowest AUPRC (96.4%) but best Recall (92.2%)
- Tuned XGBoost: Best AUPRC (98.7%) with a high Recall (91.1%)
- Standard XGBoost: Both high AUPRC (98.5%) and Recall (91.6%)
- Tuned SVM: Very high AUPRC (98.6%) but lowest Recall (85.9%)

Note: On my side, I also tuned SVM and XGB with the Recall as the scoring metric. This leads to a very high Recall (~98%) but the AUPRC drops (~50%).

Thank you for reading this notebook. Part 5 will come later.