## Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

## Content
The datasets contains transactions made by credit cards in **September 2013** by european cardholders.
This dataset presents transactions that occurred in two days, where we have **492** frauds out of **284,807** transactions. The dataset is highly unbalanced, the positive class (frauds) account for **0.172%** of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

## Inspiration
Identify fraudulent credit card transactions.

## Recommendation
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import StackingClassifier

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, f1_score, roc_auc_score, accuracy_score, precision_score
from sklearn.metrics import classification_report, recall_score

from timeit import default_timer as timer
import os


Using TensorFlow backend.


In [2]:
# Set random seed to be fixed. Comment to generate different results every time.
seed = 10
plt.close('all')


In [3]:
# Read data
data = pd.read_csv("data/creditcard.csv")
print(data.head())
# data = data[:100000]

   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28 

### Since most of the components are coming from PCA, excluding Time and Amount columns, we will apply RobustScaler to transform these columns.

In [4]:
# scaler = RobustScaler()
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data['Time'] = scaler.fit_transform(data['Time'].values.reshape(-1, 1))


In [5]:
print(data.head())

       Time        V1        V2        V3        V4        V5        V6  \
0 -1.996583 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388   
1 -1.996583  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361   
2 -1.996562 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499   
3 -1.996562 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203   
4 -1.996541 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921   

         V7        V8        V9  ...       V21       V22       V23       V24  \
0  0.239599  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928   
1 -0.078803  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846   
2  0.791461  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281   
3  0.237609  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575   
4  0.592941 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267   

        V25       V26       V27       V28    Amount  Class  
0  0.12

**Data has a total 31 columns.  
Class column is considered Y.  
Rest 30 columns taken as X.   
This contains normalized Time and Amount column**

In [6]:
# Select data
X = data.iloc[:, range(30)]
Y = data.Class

In [7]:
# Check class imbalance
class_count = Y.value_counts()
print(class_count)

0    284315
1       492
Name: Class, dtype: int64


## Dataset is highly imbalanced.

## Dataset split into 80:20 before applying any class imbalance solution. This measure is taken to prevent data leaking.

In [8]:
# Train Test split dataset into 80:20 ratio.
X_train, X_test, Y_train, Y_test = tts(X, Y,
                                       test_size=0.2,
                                       random_state=42,
                                       shuffle=True)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(227845, 30)
(56962, 30)
(227845,)
(56962,)


## In class imbalance problem, two common approaches are:
1) **Undersampling**: Remove excess data.  
2) **Oversampling**: Generate random missing data.

We are using Oversampling from imblearn package. There are other approaches also like KNearestNeighbour sampling.

## Check class imbalance and apply oversampling if count difference is more than 10%


In [9]:
class_count = Y_train.value_counts()
print(class_count)
print()

if np.abs(Y.value_counts()[0] - Y.value_counts()[1]) / np.sum(Y.value_counts()) > 0.1:
    print("Class imbalance problem. Using Oversampler.")
    ros = RandomOverSampler(random_state=42)
    X_train, Y_train = ros.fit_resample(X_train, Y_train)

class_count = Y_train.value_counts()
print(class_count)
print()

print("Data loaded and transformed.")


0    227451
1       394
Name: Class, dtype: int64

Class imbalance problem. Using Oversampler.
1    227451
0    227451
Name: Class, dtype: int64

Data loaded and transformed.


## Create some empty lists to store measured accuracy parameters

In [10]:
# Create some empty lists
recall_list = []
precision_list = []
f1_list = []
accuracy_list = []
auc_list = []
model_list = []

In [11]:
# Function to print parameters for each classification model and appending list for exporting to a file.
def print_parameters(y_real, y_pred):
    cm = confusion_matrix(y_real, y_pred)
    acc = 100 * accuracy_score(y_real, y_pred)
    precision = precision_score(y_real, y_pred)
    recall = recall_score(y_real, y_pred)
    f1 = f1_score(y_real, y_pred)
    auc_roc = roc_auc_score(y_real, y_pred)

    accuracy_list.append(acc)
    precision_list.append(precision)
    recall_list.append(recall)
    f1_list.append(f1)
    auc_list.append(auc_roc)
    model_list.append(model_name)

    print(cm)
    print("Accuracy: %0.2f" % acc)
    print("Precision score = %0.2f" % precision)
    print("Recall score = %0.2f" % recall)
    print("F1 score = %0.2f" % f1)
    print("Area Under curve %0.2f" % auc_roc)
    print("\n")

# Apply some Classifier models. 

## Logistic Regression Classifier

In [12]:
# =======================================
# Logistic regression
start_time = timer()
model_name = 'LogisticRegression'
clf_log = LogisticRegression(C=0.01,
                             random_state=1,
                             max_iter=2000,
                             n_jobs=-1)
clf_log.fit(X_train, Y_train)
Y_pred_log = clf_log.predict(X_test)
# plot_confusion_matrix(clf_xgb, X_test, Y_test)
print("==============================")
print('Time for Logistic Classifier: %f s' % (timer() - start_time))
print_parameters(Y_test, Y_pred_log)
print(classification_report(Y_test, Y_pred_log))

Time for Logistic Classifier: 16.414969 s
[[55543  1321]
 [    8    90]]
Accuracy: 97.67
Precision score = 0.06
Recall score = 0.92
F1 score = 0.12
Area Under curve 0.95


              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.06      0.92      0.12        98

    accuracy                           0.98     56962
   macro avg       0.53      0.95      0.55     56962
weighted avg       1.00      0.98      0.99     56962



**As visible, Precision is worst**

## Support Vector Classifier

In [13]:
# # =======================================
# # Using sklearn classifiers
# # Support Vector Classifier
# start_time = timer()
# model_name = 'SupportVectorClassifier'
# ker = 'rbf'
# clf_svc = SVC(gamma='auto', kernel=ker)  # Kernels available: rbf = Gaus, linear, poly,
# clf_svc.fit(X_train, Y_train)
# Y_pred_svm = clf_svc.predict(X_test)
# # plot_confusion_matrix(clf_svc, X_test, y_test)
# end_time = timer()
# time_svm = end_time - start_time
# print("Time for Support Vector Machine: %f s" % time_svm)
# print_parameters(Y_test, Y_pred_svm)

**SVC stuck because of very large matrix and convex solution takes very long time**

## Random Forest Grid Search

In [14]:
# =======================================
# Random forest classifier
# start_time = timer()
# model_name = 'RandomForestGridSearch'
# n_estimators = [100]  # [20, 100, 500, 1200]
# max_depth = [30]
# min_samples_split = [2]  # [2, 15, 50]
# min_samples_leaf = [1]  # [1, 2, 5]
#
# hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
#               min_samples_split=min_samples_split,
#               min_samples_leaf=min_samples_leaf)
#
# forest = RandomForestClassifier()
# gridF = GridSearchCV(forest, hyperF, cv=3, scoring='precision_macro', verbose=1, n_jobs=-1)
# gridF.fit(X_train, Y_train)
#
# means = gridF.cv_results_['mean_test_score']
# stds = gridF.cv_results_['std_test_score']
# for mean, std, params in zip(means, stds, gridF.cv_results_['params']):
#     print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
#
# print()
#
# print("Detailed classification report:")
# print()
# print("The model is trained on the full development set.")
# print("The scores are computed on the full evaluation set.")
# print()
# print(classification_report(Y_test, gridF.predict(X_test)))
# print()
# print(gridF.best_params_)
# print()
#
# max_depth = gridF.best_params_['max_depth']
# min_samples_leaf = gridF.best_params_['min_samples_leaf']
# min_samples_split = gridF.best_params_['min_samples_split']
# n_estimators = gridF.best_params_['n_estimators']


## Random Forest Classifier

In [15]:
# =======================================
# Random forest classifier
start_time = timer()
model_name = 'RandomForest'

max_depth = 30
n_estimators = 100
min_samples_leaf = 1
min_samples_split = 2

clf_rfc = RandomForestClassifier(n_estimators=n_estimators,
                                 max_depth=max_depth,
                                 min_samples_split=min_samples_split,
                                 min_samples_leaf=min_samples_leaf,
                                 random_state=1,
                                 n_jobs=-1)
clf_rfc.fit(X_train, Y_train)
Y_pred_rfc = clf_rfc.predict(X_test)
# plot_confusion_matrix(clf_rfc, X_test, Y_test)
print("==============================")
print('Time for Random Forrest Classifier: %f s' % (timer() - start_time))
print(classification_report(Y_test, Y_pred_rfc))
print_parameters(Y_test, Y_pred_rfc)


Time for Random Forrest Classifier: 30.660415 s
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.99      0.80      0.88        98

    accuracy                           1.00     56962
   macro avg       0.99      0.90      0.94     56962
weighted avg       1.00      1.00      1.00     56962

[[56863     1]
 [   20    78]]
Accuracy: 99.96
Precision score = 0.99
Recall score = 0.80
F1 score = 0.88
Area Under curve 0.90




**Precision is good, Recall is nice, F1 score is OK**

## Decision Tree Classifier

In [16]:
# =======================================
# Decision Tree Classifier
model_name = 'DecisionTree'
start_time = timer()
clf_dtc = DecisionTreeClassifier(max_depth=30, random_state=0)
clf_dtc.fit(X_train, Y_train)
Y_pred_dtc = clf_dtc.predict(X_test)
# plot_confusion_matrix(clf_dtc, X_test, Y_test)
print('Time for Decsion Tree Classifier: %f s' % (timer() - start_time))
print_parameters(Y_test, Y_pred_dtc)

Time for Decsion Tree Classifier: 11.391353 s
[[56791    73]
 [   25    73]]
Accuracy: 99.83
Precision score = 0.50
Recall score = 0.74
F1 score = 0.60
Area Under curve 0.87




**Recall is OK but Precsion is worst**

## EXtreme Gradient Boost Classifier


In [17]:
# =======================================
# XGBoost
start_time = timer()
model_name = 'XGBoost'
clf_xgb = XGBClassifier(max_depth=100,
                        learning_rate=0.2,
                        random_state=1,
                        n_jobs=-1)
clf_xgb.fit(X_train, Y_train)
Y_pred_xgb = clf_xgb.predict(X_test)
# plot_confusion_matrix(clf_xgb, X_test, Y_test)
print("==============================")
print('Time for XGBoost Classifier: %f s' % (timer() - start_time))
print_parameters(Y_test, Y_pred_xgb)
print(classification_report(Y_test, Y_pred_xgb))

Time for XGBoost Classifier: 48.670715 s
[[56858     6]
 [   18    80]]
Accuracy: 99.96
Precision score = 0.93
Recall score = 0.82
F1 score = 0.87
Area Under curve 0.91


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.93      0.82      0.87        98

    accuracy                           1.00     56962
   macro avg       0.96      0.91      0.93     56962
weighted avg       1.00      1.00      1.00     56962



**Precision, Recall and F1 are all decent**

Let us model a stacking classifier consisting of various models. Let us see if it improves the quality of weak classifier

## Stacking Classifier Model
1) Random Forest  
2) Logistic Regression   
3) XGBoost

In [18]:
# =======================================
# Stacking Classifier
start_time = timer()
model_name = 'StackingClassifier'
estimators = [
    ('RFC', RandomForestClassifier(n_estimators=n_estimators,
                                   max_depth=max_depth,
                                   min_samples_split=min_samples_split,
                                   min_samples_leaf=min_samples_leaf,
                                   random_state=1,
                                   n_jobs=-1)),
    ('Log', LogisticRegression(C=0.1,
                               random_state=1,
                               max_iter=2000,
                               n_jobs=-1)),
    # ('SVC', SVC(kernel='rbf', gamma='auto', C=2)),
    ('XGB', XGBClassifier(max_depth=100,
                          learning_rate=0.2,
                          random_state=1,
                          n_jobs=-1))
]

stacking_model = StackingClassifier(estimators=estimators,
                                    final_estimator=LogisticRegression(),
                                    verbose=1
                                    )

stacking_model.fit(X_train, Y_train)
Y_pred_log = stacking_model.predict(X_test)
# plot_confusion_matrix(clf_xgb, X_test, Y_test)
print("==============================")
print('Time for Stacking Classifier: %f s' % (timer() - start_time))
print_parameters(Y_test, Y_pred_log)
print(classification_report(Y_test, Y_pred_log))


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.1min finished


Time for Stacking Classifier: 480.309735 s
[[56862     2]
 [   19    79]]
Accuracy: 99.96
Precision score = 0.98
Recall score = 0.81
F1 score = 0.88
Area Under curve 0.90


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.98      0.81      0.88        98

    accuracy                           1.00     56962
   macro avg       0.99      0.90      0.94     56962
weighted avg       1.00      1.00      1.00     56962



**Quality does not improve. XGB still produces better result**

In [19]:
# Create a dataframe of results
d = {'ModelName': model_list,
     'Accuracy': accuracy_list,
     'Precision': precision_list,
     'Recall': recall_list,
     'F1': f1_list,
     'AreaUnderCurve': auc_list}
measures = pd.DataFrame(d)
measures.to_csv('results.csv', index=False)
measures


Unnamed: 0,ModelName,Accuracy,Precision,Recall,F1,AreaUnderCurve
0,LogisticRegression,97.666866,0.063785,0.918367,0.119284,0.947568
1,RandomForest,99.963133,0.987342,0.795918,0.881356,0.89795
2,DecisionTree,99.827955,0.5,0.744898,0.598361,0.871807
3,XGBoost,99.957867,0.930233,0.816327,0.869565,0.908111
4,StackingClassifier,99.963133,0.975309,0.806122,0.882682,0.903044


# Conclusions

* Five models were applied for this classification problem.
* Since the data is highly imbalanced, and positive (fraud) events are low, **Accuracy** is not a good metric to trust. In a simple case, if all classes are considered 0, Accuracy will be close to 98%.     
* **Recall** and **Precision** are important metric in this case. 
* Another derived metric, **Area Under Curve** is also calcualted for improving classification.

* Logistic Regression and Single Decision Tree have excellent Recall but Precsion is bad.  
* Random Forest and XGBoost are both good classifiers. **XGBoost** outperforms all other classifiers.  
* Stacking model did not improve quality, although taking enormous time to perform calculations.


# Best metrics achieved for XGBoost at Recall=0.816 and AUC=0.908