# Credit and Debit Card Fraud Detection

##  Modeling

## Table Of Contents: <a class="anchor" id="mtop"></a>

- [Introduction](#intro)
- [Fraud Classification with Under Sampling](#Unclassification)
    * [Under Sampling Data balancing](#UndataBalancing)
    * [Feature Selection](#UnfeatureSelection)
    * [Decision Trees models](#UndecisionTrees)
    * [Random Forests models](#UnrandomForests)
    * [AdaBoost model](#UnAdaBoost)
    * [Suport Vector Machines models](#UnSVM)
    * [Artificial Neural Networks](#UnANN)
    * [Results](#UnResults)
- [Fraud Classification with Upper Sampling](#Upclassification)
    * [Upper-sampling data balancing](#UpdataBalancing)
    * [Feature Selection](#UpfeatureSelection)
    * [Decision Trees models](#UpdecisionTrees)
    * [Random Forests models](#UprandomForests)
    * [AdaBoost model](#UpAdaBoost)
    * [Suport Vector Machines models](#UpSVM)
    * [Artificial Neural Networks](#UpANN)
    * [Results](#UpResults)
- [Conclusions](#mconclusions)

## Introduction: <a class="anchor" id="intro"></a>

[TOC](#mtop)

This notebook comprises the third and last part of the project. One can see the machine learning models in action. The dataset was prepared and saved in the prior notebooks (exploratory data analysis and feature engineering). The current notebook applies five models (Decision Trees, Random Forest, AdaBoost, Support Vector Machines, and Artificial Neural Networks) to the problem of credit and debit card fraud detection. The performance metrics (accuracy, precision, recall, and f1-score) were collected for all models and tabulated for comparison.

In [3]:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.ensemble import AdaBoostClassifier
from sklearn import svm
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

The following cell loads the prepared data. The data were submitted to an exploratory data analysis and feature engineering. These two prior stages transform, clean, factorize, scale, and select the data to add this point to be used by the machine learning models. 

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [6]:
#Importing the dataset used to train and test the machine learning models
#fraud = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/preparedFraud2.csv')
fraud = pd.read_csv('..\CreditCardFraudDetectionData\preparedFraud2.csv')

In [7]:
X=fraud[['TransactionAmt','card1','addr1','addr2',
        'C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14',
        'V95', 'V96', 'V97','V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111',
        'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126' , 'V127',
        'V128', 'V129', 'V130', 'V131', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137']]

y=fraud['isFraud']

## Fraud Classification with Under Sampling <a class="anchor" id="Unclassification"></a>

[TOC](#mtop)

### Under-sampling data balancing  <a class="anchor" id="UndataBalancing"></a>

[TOC](#mtop)

The imbalanced datasets are characteristic of the problem of credit card fraud detection. The issue presents much more legitimate transactions than fraudulent transactions. It is not different in this case, so we must use a technique to balance the data. First, I will use the under-sampling technique. After this process, the remaining data has over 10 thousand legitimate and fraudulent transactions. The graphic about the balance data was presented in the attached feature engineering notebook.

In [None]:
#Let's perform an undersampling to balance the data. In this case, it was used a technique of random sampling with replacement.
rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
x_rus, y_rus = rus.fit_resample(X, y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_rus, y_rus, test_size=0.2, random_state=42)

This section presents models for advanced modeling. One can see the performance of five models: Decision Trees, Random Forests, Adaboost, Support Vector Machines and Artificial Neural Networks. A feature selection is set for all models using a recursive feature elimination with cross-validation based on the Random Forests classifier. Next, a pipeline is designed with these steps: feature selection, feature scaling, model hyperparameters optimization, and training. To articulate a model evaluation, a performance metric report for each model is presented with accuracy, precision, recall, and f1-score.

### Feature Selection  <a class="anchor" id="UnfeatureSelection"></a>

[TOC](#mtop)

The feature selection is included in the Pipeline's first step for all models. This way, it is necessary to perform the feature selection here in the Modeling as presented in the feature engineering notebook.

In [None]:
min_features_to_select = 1  # Minimum number of features to consider
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)

cv = StratifiedKFold(5)

rfecv = RFECV(
    estimator=clf,
    step=1,
    cv=cv,
    scoring="accuracy",
    min_features_to_select=min_features_to_select,
    n_jobs=2,
)
rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {X_train.columns[rfecv.support_]}")

Optimal number of features: 13
Selected features: Index(['TransactionAmt', 'card1', 'addr1', 'C1', 'C2', 'C5', 'C9', 'C13',
       'C14', 'V126', 'V127', 'V128', 'V133'],
      dtype='object')


### Decision trees models <a class="anchor" id="UndecisionTrees"></a>

[TOC](#mtop)

The first model used was the decision trees (DTs)—a non-parametric supervised learning method for classification and regression. The goal is to create a model that predicts the value of a transaction by learning simple decision rules inferred from the data features.

In [None]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = DecisionTreeClassifier()


CV_dt = GridSearchCV(clf,
                      param_grid={
                          'criterion': ['gini', 'entropy'],
                          'max_depth': [2,4,6],
                          'min_samples_leaf': [1, 2]},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_dt)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.78      0.75      2056
           1       0.77      0.70      0.73      2106

    accuracy                           0.74      4162
   macro avg       0.74      0.74      0.74      4162
weighted avg       0.74      0.74      0.74      4162



The Decision Trees performance was:
- Accuracy equals to 74%.
- Precision equals 72% for legitimate class and 77% for fraudulent class.
- Recall equals 78% for genuine class and 70% for fraudulent class.
- F1-score equals 75% for the genuine class and 73% for the fraudulent class.

### Random forests models<a class="anchor" id="UnrandomForests"></a>

[TOC](#mtop)

A random forest is a meta-estimator that fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best-split strategy. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default); otherwise, the whole dataset is used to build each tree.

In [None]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42,
                             class_weight="balanced")
CV_rfc = GridSearchCV(clf,
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.79      0.74      2056
           1       0.76      0.65      0.70      2106

    accuracy                           0.72      4162
   macro avg       0.73      0.72      0.72      4162
weighted avg       0.73      0.72      0.72      4162



The Random Forests Trees performance was:
- Accuracy equals to 72%.
- Precision equals 69% for legitimate class and 76% for fraudulent class.
- Recall equals 79% for genuine class and 65% for fraudulent class.
- F1-score equals 74% for the genuine class and 70% for the fraudulent class.

### AdaBoost model <a class="anchor" id="UnAdaBoost"></a>

[TOC](#mtop)

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on complex cases.

In [None]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
DTC = DecisionTreeClassifier(class_weight = "balanced")
clf = AdaBoostClassifier()

CV_ada = GridSearchCV(clf,
                      param_grid = {
                          'estimator':[DTC],
                          'learning_rate': [0.1, 0.5],
                          'n_estimators': [10, 20]},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_ada)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.79      0.78      2056
           1       0.79      0.77      0.78      2106

    accuracy                           0.78      4162
   macro avg       0.78      0.78      0.78      4162
weighted avg       0.78      0.78      0.78      4162



The Adaboost performance was:
- Accuracy equals to 78%.
- Precision equals 77% for the legitimate class and 79% for the fraudulent class.
- Recall equals 79% for genuine class and 77% for fraudulent class.
- F1-score equals 78% for the genuine class and 78% for the fraudulent class.

### Suport Vector Machines model <a class="anchor" id="UnSVM"></a>

[TOC](#mtop)

Support vector machines (SVMs) are supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

- Effective in high-dimensional spaces.
- It is still effective in cases where the number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), making it memory efficient.
- Versatile: Different kernel functions can be specified for the decision function. Standard kernels are provided, but it is also possible to define custom kernels.

The disadvantages of support vector machines include:

- If the number of features is much greater than the number of samples, avoiding over-fitting in choosing Kernel functions and regularization terms is crucial.
- SVMs do not directly provide probability estimates; these are calculated using an expensive five-fold cross-validation.

In [None]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = svm.SVC()

CV_svm = GridSearchCV(clf,
                      param_grid={'C':[0.1, 1.0, 10],
                                  'kernel':['linear','rbf']},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_svm)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.81      0.74      2056
           1       0.77      0.64      0.70      2106

    accuracy                           0.72      4162
   macro avg       0.73      0.73      0.72      4162
weighted avg       0.73      0.72      0.72      4162



The Support Vector Machines performance was:
- Accuracy equals to 72%.
- Precision equals 69% for the legitimate class and 77% for the fraudulent class.
- Recall equals 81% for genuine class and 64% for fraudulent class.
- F1-score equals 74% for the genuine class and 70% for the fraudulent class.

### Artificial Neural Networks <a class="anchor" id="UnANN"></a>

[TOC](#mtop)

For the problem of fraud detection, a feedforward neural net was designed using the Pytorch Python library. Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.
An ANN comprises connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. These are connected by edges, which model the synapses in a biological brain. An artificial neuron receives signals from connected neurons, processes them, and sends them to other connected neurons. The "signal" is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs, called the activation function. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection.

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden1 = nn.Linear(rfecv.n_features_, 12)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(12, 8)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(8, 1)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x

model = Net()
print(model)

Net(
  (hidden1): Linear(in_features=13, out_features=12, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=12, out_features=8, bias=True)
  (act2): ReLU()
  (output): Linear(in_features=8, out_features=1, bias=True)
  (act_output): Sigmoid()
)


In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cpu


In [None]:
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
scaler = StandardScaler().fit(X_train[X_train.columns[rfecv.support_]])
X_tr_scaled = scaler.transform(X_train[X_train.columns[rfecv.support_]])
X_te_scaled = scaler.transform(X_test[X_test.columns[rfecv.support_]])

In [None]:
X_train_nn = torch.tensor(X_tr_scaled, dtype=torch.float32).to(device)
y_train_nn = torch.tensor(y_train.to_numpy(), dtype=torch.float32).reshape(-1, 1).to(device)

In [None]:
n_epochs = 25 #50
batch_size = 10

for epoch in range(n_epochs):
    for i in range(0, len(X_train_nn), batch_size):
        Xbatch = X_train_nn[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y_train_nn[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.5931800603866577
Finished epoch 1, latest loss 0.5147647857666016
Finished epoch 2, latest loss 0.4805735647678375
Finished epoch 3, latest loss 0.4662722647190094
Finished epoch 4, latest loss 0.45030370354652405
Finished epoch 5, latest loss 0.4413246214389801
Finished epoch 6, latest loss 0.436825156211853
Finished epoch 7, latest loss 0.4327261447906494
Finished epoch 8, latest loss 0.42343202233314514
Finished epoch 9, latest loss 0.4231400191783905
Finished epoch 10, latest loss 0.4182569682598114
Finished epoch 11, latest loss 0.40914949774742126
Finished epoch 12, latest loss 0.4110928475856781
Finished epoch 13, latest loss 0.41539740562438965
Finished epoch 14, latest loss 0.4185698926448822
Finished epoch 15, latest loss 0.42534133791923523
Finished epoch 16, latest loss 0.41759946942329407
Finished epoch 17, latest loss 0.4302314519882202
Finished epoch 18, latest loss 0.4158015549182892
Finished epoch 19, latest loss 0.42040184140205383
Fini

In [None]:
X_test_nn = torch.tensor(X_te_scaled, dtype=torch.float32).to(device)
y_test_nn = torch.tensor(y_test.to_numpy(), dtype=torch.float32).reshape(-1, 1).to(device)

In [None]:
with torch.no_grad(): y_pred = model(X_test_nn)
accuracy = (y_pred.round() == y_test_nn).float().mean()
print(f"Accuracy {accuracy}")

Accuracy 0.7426717877388


In [None]:
y_pred=pd.DataFrame(y_pred.round().cpu().detach().numpy())
print(f"Precision {precision_score(y_test, y_pred, pos_label=1)}")
print(f"Recall {recall_score(y_test, y_pred, pos_label=1)}")
print(f"F1-score {f1_score(y_test, y_pred, pos_label=1)}")

Precision 0.7841845140032949
Recall 0.6780626780626781
F1-score 0.7272727272727273


In [None]:
print(f"Precision {precision_score(y_test, y_pred, pos_label=0)}")
print(f"Recall {recall_score(y_test, y_pred, pos_label=0)}")
print(f"F1-score {f1_score(y_test, y_pred, pos_label=0)}")

Precision 0.7103801794105084
Recall 0.808852140077821
F1-score 0.7564248351148511


### Results <a class="anchor" id="UnResults"></a>

[TOC](#mtop)

A comparison based on the performance metrics is tabulated and presented as follows. The results for under-sampling are similar for all models. Accuracy is around 75%, with a slight advantage for Adaboost at 78%. For all other metrics, verifying an equilibrium in the resulting values is possible. Decision trees and ANN show very similar metric values for the second-best model.

<table>
    <thead>
        <tr>
            <th>Models</th>
            <th>Accuracy</th>
            <th colspan=2 align=center>Precision</th>
            <th colspan=2 align=center>Recall</th>
            <th colspan=2 align=center>F1-score</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>       
            <td></td>
            <td>Legitimate</td>
            <td>Fraud</td>
            <td>Legitimate</td>
            <td>Fraud</td>
            <td>Legitimate</td>
            <td>Fraud</td>
        </tr>
        <tr>
            <td align=right>Decision trees</td>
            <td align=center>74%</td>            
            <td align=center>72%</td>
            <td align=center>77%</td>
            <td align=center>78%</td>
            <td align=center>70%s</td>
            <td align=center>75%</td>
            <td align=center>73%</td>
        </tr>
        <tr>
            <td align=right>Random forests</td>
            <td align=center>72%</td>
            <td align=center>69%</td>
            <td align=center>76%</td>
            <td align=center>79%</td>
            <td align=center>65%</td>
            <td align=center>74%</td>
            <td align=center>70%</td>
        </tr>
        <tr>
            <td align=right><b>Adaboost</b></td>
            <td align=center><b>78%</b></td>
            <td align=center><b>77%</b></td>
            <td align=center><b>79%</b></td>
            <td align=center><b>79%</b></td>
            <td align=center><b>77%</b></td>
            <td align=center><b>78%</b></td>
            <td align=center><b>78%</b></td>
        </tr>
        <tr>
            <td align=right>Suport vector machines</td>
            <td align=center>72%</td>
            <td align=center>69%</td>
            <td align=center>77%</td>
            <td align=center>81%</td>
            <td align=center>64%</td>
            <td align=center>74%</td>
            <td align=center>70%</td>
        </tr>
        <tr>
            <td align=right>Artificial neural networks</td>
            <td align=center>74%</td>
            <td align=center>71%</td>
            <td align=center>78%</td>
            <td align=center>80%</td>
            <td align=center>67%</td>
            <td align=center>75%</td>
            <td align=center>72%</td>
        </tr>
    </tbody>
</table>

## Fraud Classification with Upper Sampling  <a class="anchor" id="Upclassification"></a>

[TOC](#mtop)

### Upper-sampling data balancing  <a class="anchor" id="UpdataBalancing"></a>

[TOC](#mtop)

In [8]:
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2,  random_state=42)

The feature selection is included in the Pipeline's first step for all models. This way, it is necessary to perform the feature selection here in the Modeling as presented in the feature engineering notebook.

In [10]:
min_features_to_select = 1  # Minimum number of features to consider
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)

cv = StratifiedKFold(5)

rfecv = RFECV(
    estimator=clf,
    step=1,
    cv=cv,
    scoring="accuracy",
    min_features_to_select=min_features_to_select,
    n_jobs=2,
)
rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {X_train.columns[rfecv.support_]}")

Optimal number of features: 16
Selected features: Index(['TransactionAmt', 'card1', 'addr1', 'C1', 'C2', 'C5', 'C6', 'C9', 'C10',
       'C11', 'C13', 'C14', 'V95', 'V97', 'V102', 'V133'],
      dtype='object')


### Feature Selection  <a class="anchor" id="UpfeatureSelection"></a>

[TOC](#mtop)

### Decision trees models <a class="anchor" id="UpdecisionTrees"></a>

[TOC](#mtop)

The first model used in the upper-sampling case was also the decision trees (DTs)—a non-parametric supervised learning method for classification and regression. The goal is to create a model that predicts the value of a transaction by learning simple decision rules inferred from the data features.

In [11]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = DecisionTreeClassifier()


CV_dt = GridSearchCV(clf,
                      param_grid={
                          'criterion': ['gini', 'entropy'],
                          'max_depth': [2,4,6],
                          'min_samples_leaf': [1, 2]},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_dt)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.89      0.86     84160
           1       0.88      0.83      0.85     84195

    accuracy                           0.86    168355
   macro avg       0.86      0.86      0.86    168355
weighted avg       0.86      0.86      0.86    168355



The Decisioon Trees performance was:
- Accuracy equals to 86%.
- Precision equals 84% for the legitimate class and 88% for the fraudulent class.
- Recall equals 89% for genuine class and 83% for fraudulent class.
- F1-score equals 86% for the genuine class and 85% for the fraudulent class.

### Random forests models<a class="anchor" id="UprandomForests"></a>

[TOC](#mtop)

A random forest was also used for the data upper-sampling balancing. It is a meta-estimator that fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best-split strategy. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default); otherwise, the whole dataset is used to build each tree.

In [12]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42,
                             class_weight="balanced")
CV_rfc = GridSearchCV(clf,
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.84      0.81     84160
           1       0.83      0.75      0.79     84195

    accuracy                           0.80    168355
   macro avg       0.80      0.80      0.80    168355
weighted avg       0.80      0.80      0.80    168355



The Random Forest performance was:
- Accuracy equals to 80%.
- Precision equals 77% for the legitimate class and 83% for the fraudulent class.
- Recall equals 84% for genuine class and 75% for fraudulent class.
- F1-score equals 81% for the genuine class and 79% for the fraudulent class.

### AdaBoost model <a class="anchor" id="UpAdaBoost"></a>

[TOC](#mtop)

An AdaBoost classifier was also used for upper-sampling balancing technique. It is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on complex cases.

In [13]:
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
DTC = DecisionTreeClassifier(class_weight = "balanced")
clf = AdaBoostClassifier()

CV_ada = GridSearchCV(clf,
                      param_grid = {
                          'estimator':[DTC],
                          'learning_rate': [0.1, 0.5],
                          'n_estimators': [10, 20]},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_ada)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99     84160
           1       0.99      0.98      0.98     84195

    accuracy                           0.99    168355
   macro avg       0.99      0.99      0.99    168355
weighted avg       0.99      0.99      0.99    168355



The Adaboost performance was:
- Accuracy equals to 99%.
- Precision equals 98% for the legitimate class and 99% for the fraudulent class.
- Recall equals 99% for genuine class and 98% for fraudulent class.
- F1-score equals 99% for the genuine class and 98% for the fraudulent class.

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

### Suport Vector Machines model <a class="anchor" id="UpSVM"></a>

[TOC](#mtop)

Support vector machines (SVMs) are supervised learning methods used for classification, regression and outliers detection. In case of the fraud detection problem it was analyzed with upper-sampling balancing technique.

The advantages of support vector machines are:

- Effective in high-dimensional spaces.
- It is still effective in cases where the number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), making it memory efficient.
- Versatile: Different kernel functions can be specified for the decision function. Standard kernels are provided, but it is also possible to define custom kernels.

The disadvantages of support vector machines include:

- If the number of features is much greater than the number of samples, avoiding over-fitting in choosing Kernel functions and regularization terms is crucial.
- SVMs do not directly provide probability estimates; these are calculated using an expensive five-fold cross-validation.

In [None]:
this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=10,
                             random_state=42)

rfecv = RFECV(estimator=clf_featr_sele,
              step=1,
              cv=5,
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = svm.SVC()

CV_svm = GridSearchCV(clf,
                      param_grid={'C':[0.1, 1.0, 10],
                                  'kernel':['linear','rbf']},
                      cv= 5,
                      scoring = 'roc_auc')


pipeline  = Pipeline([('feature_sele',rfecv),
                      ('scale', StandardScaler()),
                      ('clf_cv',CV_svm)])

pipeline.fit(X_train, y_train)
y_pred=pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

The Support Vector Machines performance was:
- Accuracy equals to xx%.
- Precision equals xx% for the legitimate class and xx% for the fraudulent class.
- Recall equals xx% for genuine class and xx% for fraudulent class.
- F1-score equals xx% for the genuine class and xx% for the fraudulent class.

### Artificial Neural Networks <a class="anchor" id="UpANN"></a>

[TOC](#mtop)

For the problem of fraud detection with upper-sampling balancing technique, a feedforward neural net was designed using the Pytorch Python library. Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.
An ANN comprises connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. These are connected by edges, which model the synapses in a biological brain. An artificial neuron receives signals from connected neurons, processes them, and sends them to other connected neurons. The "signal" is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs, called the activation function. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection.

In [14]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden1 = nn.Linear(rfecv.n_features_, 20)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(20, 12)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x

model = Net()
print(model)

Net(
  (hidden1): Linear(in_features=14, out_features=20, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=20, out_features=12, bias=True)
  (act2): ReLU()
  (output): Linear(in_features=12, out_features=1, bias=True)
  (act_output): Sigmoid()
)


In [15]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)

Net(
  (hidden1): Linear(in_features=14, out_features=20, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=20, out_features=12, bias=True)
  (act2): ReLU()
  (output): Linear(in_features=12, out_features=1, bias=True)
  (act_output): Sigmoid()
)

In [16]:
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [17]:
scaler = StandardScaler().fit(X_train[X_train.columns[rfecv.support_]])
X_tr_scaled = scaler.transform(X_train[X_train.columns[rfecv.support_]])
X_te_scaled = scaler.transform(X_test[X_test.columns[rfecv.support_]])

In [18]:
X_train_nn = torch.tensor(X_tr_scaled, dtype=torch.float32).to(device)
y_train_nn = torch.tensor(y_train.to_numpy(), dtype=torch.float32).reshape(-1, 1).to(device)

In [19]:
n_epochs = 25 #50
batch_size = 10

for epoch in range(n_epochs):
    for i in range(0, len(X_train_nn), batch_size):
        Xbatch = X_train_nn[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y_train_nn[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.8473950028419495
Finished epoch 1, latest loss 0.767339289188385
Finished epoch 2, latest loss 0.8081995844841003
Finished epoch 3, latest loss 0.7584068179130554
Finished epoch 4, latest loss 0.7549437284469604
Finished epoch 5, latest loss 0.6677432656288147
Finished epoch 6, latest loss 0.5764896869659424
Finished epoch 7, latest loss 0.5270395874977112
Finished epoch 8, latest loss 0.5065106749534607
Finished epoch 9, latest loss 0.512797474861145
Finished epoch 10, latest loss 0.4941381812095642
Finished epoch 11, latest loss 0.4938358664512634
Finished epoch 12, latest loss 0.5181329846382141
Finished epoch 13, latest loss 0.5119580030441284
Finished epoch 14, latest loss 0.4945756196975708
Finished epoch 15, latest loss 0.5018322467803955
Finished epoch 16, latest loss 0.4999777376651764
Finished epoch 17, latest loss 0.5062088370323181
Finished epoch 18, latest loss 0.5073754191398621
Finished epoch 19, latest loss 0.49982303380966187
Finished ep

In [20]:
X_test_nn = torch.tensor(X_te_scaled, dtype=torch.float32).to(device)
y_test_nn = torch.tensor(y_test.to_numpy(), dtype=torch.float32).reshape(-1, 1).to(device)

In [21]:
with torch.no_grad(): y_pred = model(X_test_nn)
accuracy = (y_pred.round() == y_test_nn).float().mean()
print(f"Accuracy {accuracy}")

Accuracy 0.8144159913063049


In [22]:
y_pred=pd.DataFrame(y_pred.round().cpu().detach().numpy())
print(f"Precision {precision_score(y_test, y_pred, pos_label=1)}")
print(f"Recall {recall_score(y_test, y_pred, pos_label=1)}")
print(f"F1-score {f1_score(y_test, y_pred, pos_label=1)}")

Precision 0.845502355505096
Recall 0.7695231308272463
F1-score 0.8057255136049346


In [23]:
print(f"Precision {precision_score(y_test, y_pred, pos_label=0)}")
print(f"Recall {recall_score(y_test, y_pred, pos_label=0)}")
print(f"F1-score {f1_score(y_test, y_pred, pos_label=0)}")

Precision 0.7884460240280836
Recall 0.8593274714828897
F1-score 0.8223622118872451


### Results <a class="anchor" id="UpResults"></a>

[TOC](#mtop)

In the case of upper sampling, one can verify a much better performance for all models. More data benefits the model's performance. Adaboost presents the best performance with a high accuracy of 99%. Adaboost adapted adequately for the analyzed dataset. For precision, the values are 98% for legitimate and 99% for fraudulent transactions. The recall is 99% for legitimate transactions and 98% for fraudulent transactions, and the f1-score presented a performance of 99% for legitimate transactions and 98% for fraudulent transactions. The Adaboost performance is high for the prepared dataset obtained from the Kaggle competition.

<table>
    <thead>
        <tr>
            <th>Models</th>
            <th>Accuracy</th>
            <th colspan=2 align=center>Precision</th>
            <th colspan=2 align=center>Recall</th>
            <th colspan=2 align=center>F1-score</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>       
            <td></td>
            <td>Legitimate</td>
            <td>Fraud</td>
            <td>Legitimate</td>
            <td>Fraud</td>
            <td>Legitimate</td>
            <td>Fraud</td>
        </tr>
        <tr>
            <td align=right>Decision trees</td>
            <td  align=center>86%</td>            
            <td align=center>84%</td>
            <td align=center>88%</td>
            <td align=center>89%</td>
            <td align=center>83%</td>
            <td align=center>86%</td>
            <td align=center>85%</td>
        </tr>
        <tr>
            <td align=right>Random forests</td>
            <td align=center>80%</td>
            <td align=center>77%</td>
            <td align=center>83%</td>
            <td align=center>84%</td>
            <td align=center>75%</td>
            <td align=cer>81%</td>
            <td align=cer>79%</td>
        </tr>
        <tr>
            <td align=right><b>Adaboost</b></td>
            <td align=center><b>99%</b></td>
            <td align=center><b>98%</b></td>
            <td align=center><b>99%</b></td>
            <td align=center><b>99%</b></td>
            <td align=center><b>98%</b></td>
            <td align=center><b>99%</b></td>
            <td align=center><b>98%</b></td>
        </tr>
        <tr>
            <td align=right>Suport vector machines</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
            <td align=center>xx%</td>
        </tr>
        <tr>
            <td align=right>Artificial neural networks</td>
            <td align=center>81%</td>
            <td align=center>78%</td>
            <td align=center>84%</td>
            <td align=center>85%</td>
            <td align=center>76%</td>
            <td align=center>82%</td>
            <td align=center>80%</td>
        </tr>
    </tbody>
</table>

## Conclusions <a class="anchor" id="mconclusions"></a>

[TOC](#mtop)

The data was investigated by checking for data unbalancing, visualizing the features, checking the null values, and understanding the relationship between different features. The data were split into train and test sets and scaled, and a feature selection technique based on feature importance was employed and evaluated for use in further pipelines. The hyperparameters were prepared for optimization, and a pipeline was created for each model with feature selection, scaling, hyperparameters tunning, and classification steps. The performance of Decision Trees, Random Forests, Support Vector Machines and artificial neural networks are very similar. However, Adaboost ensemble machine learning models present the general best performance compared to all analyzed models.