## Fraud Detection

This notebook presents an implementation of machine learning model for Fraud Detection on financial dataset. The main focus is to develop a robust model for anamoly detection with high efficency.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Content of the notebook:
- 1. Read Data
- 2. Exploratory Data Analysis
- 3. Data Preprocessing
- 4. Model Development
  - 4.1 Random Forest
  - 4.2 Random Forest with SMOTE
- 5. Results
- 6. Discussion

Import the necessary libraries for data analysis and visualization.

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

KeyboardInterrupt: 

## 1. Read Data

In [None]:
paysim = pd.read_csv("PS_20174392719_1491204439457_log.csv")

In [None]:
paysim.head()

- step: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

- type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

- amount: amount of the transaction in local currency.

- nameOrig: customer who started the transaction

- oldbalanceOrg: initial balance before the transaction

- newbalanceOrig: new balance after the transaction.

- nameDest: customer who is the recipient of the transaction

- oldbalanceDest: initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

- newbalanceDest: new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

- isFraud: This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

- isFlaggedFraud: The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.



## 2. Exploratory Data Analysis

In [None]:
paysim.describe()

Check for missing value.

In [None]:
paysim.isnull().sum()

In [None]:
print('Number of unique values/Categories:')
for col in paysim.columns:
    print('- '+col+': ', paysim[col].nunique())

In [None]:
sns.set_style('whitegrid')
sns.set_context('notebook')
plt.figure(figsize=(8, 4))
paysim['isFraud_str'] = paysim['isFraud'].apply(str)
counplot = sns.countplot(data=paysim, x='type', hue='isFraud_str',palette= "pastel")
counplot.set_xlabel('Type ')
counplot.set_ylabel(f'Count')
counplot.set_yscale('log')
plt.savefig('figures/payment_count.jpeg', dpi=300)
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
palette = sns.color_palette("pastel")
paysim['amount+'] = paysim['amount'].apply(lambda x: x+10**-7)
histplot = sns.histplot(ax=ax[0], data=paysim[paysim['isFraud']==False], x='amount+', hue='isFraud', 
                        kde=True, element='step', palette="Set2", log_scale=True)
histplot.set_ylabel('Number of Observations')
histplot.set_xlabel(f'Amount')
histplot1 = sns.histplot(ax=ax[1], data=paysim[paysim['isFraud']==True], x='amount+', hue='isFraud', 
                         kde=True, element='step', palette="Set1", log_scale=True)
histplot1.set_ylabel('Number of Observations')
histplot1.set_xlabel(f'Amount')
mean_value_f = paysim[paysim['isFraud']==False]['amount'].mean()
mean_value_t = paysim[paysim['isFraud']==True]['amount'].mean()
histplot.axvline(x=mean_value_f, color='k', linestyle='dashed')
histplot1.axvline(x=mean_value_t, color='k', linestyle='dashed')
print(f'Mean amount for regular transactions: ${mean_value_f:,.2f}')
print(f'Mean amount for fraudulent transactions: ${mean_value_t:,.2f}')
paysim.drop(columns = ['amount+', 'isFraud_str'], inplace=True)
plt.savefig('figures/amount_hist.jpeg', dpi=300)
plt.show()

In [None]:
paysim.corr(numeric_only=True).style.background_gradient(cmap="crest")

## 3. Data Preprocessing

In [None]:
#type Orig ==first letter from nameOrig 
paysim['New_TypeOrig']= paysim['nameOrig'].apply(lambda x: x[0])
    
#type Dest ==first letter from nameDest
paysim['New_TypeDest']= paysim['nameDest'].apply(lambda x: x[0])

In [None]:
paysim.drop(columns = ['nameOrig','nameDest'], inplace=True)

In [None]:
paysim['step'] = paysim['step'].apply(lambda x: x%24)

### 3.1 Encoding

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
paysim_dummies = pd.get_dummies(paysim, columns=['type', 'New_TypeOrig', 'New_TypeDest', 
                                                 'newbalanceOrig', 'newbalanceDest'], 
                                drop_first=True, dtype=float)
paysim_dummies.head()

### 3.2 Split training and validation set

In [None]:
y = paysim_dummies.isFraud
X = paysim_dummies.drop(['isFraud', 'isFlaggedFraud'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, train_size=0.75, random_state=42)

In [None]:
X_train.head()

### 3.3 Normalization

In [None]:
normalize_cols = ["step", "amount", "oldbalanceOrg", "newbalanceOrig", "newbalanceDest", "oldbalanceDest"]

features_train = X_train[normalize_cols]
features_test = X_test[normalize_cols]
scaler = StandardScaler().fit(features_train.values)
features_train = scaler.transform(features_train.values)
features_test = scaler.transform(features_test.values)
X_train[normalize_cols] = features_train
X_test[normalize_cols] =features_test

X_test.head()

## 4. Model Development

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

### 4.1 Random Forest

#### 4.1.1 Model Development

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 10)]
# Number of features to consider at every split
max_features = ['log2', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
# First create the base model to tune
rf = RandomForestClassifier(class_weight='balanced')
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 20, cv = 5, 
                               verbose=0, random_state=42, n_jobs = -1, scoring='roc_auc')

#### 4.1.2 Model Training

In [None]:
# Fit the random search model
start_time = time.time()
rf_random.fit(X_train, y_train)
run_time = time.time()-start_time

In [None]:
print('Time required to train random forest model:  {:.2f} seconds', run_time)

In [None]:
best_rf = rf_random.best_estimator_
rf_random.best_estimator_

In [None]:
# Generate predictions with the best model
rf_pred = best_rf.predict(X_test)
rf_prob = best_rf.predict_proba(X_test)

#### 4.1.3 Model Performance 

!!!high cost associated with False Negative =>Recall

Let's say a customer requires that our classifier correctly predict fraud 60% of the time, so as not to bother customers due to false positive results.

How to solve this problem? => It is necessary to select a threshold that maximizes recall, provided that precision > 0.6.

- F1 score The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.

- PR curve The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

- Confusion Matrix A confusion matrix is a table or chart showing the accuracy of a classifier's predictions concerning two or more classes. The classifier predictions are on the x-axis and the result (accuracy) is on the y-axis. The cells of the table are filled with the number of classifier predictions. Correct predictions go diagonally from top left to bottom right.

In [None]:
from sklearn.metrics import auc, ConfusionMatrixDisplay, confusion_matrix, roc_auc_score, precision_score, recall_score, accuracy_score, classification_report, precision_recall_curve

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))
ax.matshow(cm, cmap=plt.cm.Blues, alpha=0.3)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(x=j, y=i,s=cm[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
# calculate inputs for the PR curve
precision, recall, thresholds = precision_recall_curve(y_test, rf_prob[:, 1])

# plot PR curve
plt.plot(recall, precision, marker='.', label='Random Forest')
# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
plt.show()

# calculate and print PR AUC
auc_pr = auc(recall, precision)
print('AUC PR: %.3f' % auc_pr)

In [None]:
sorted_idx = best_rf.feature_importances_.argsort()
plt.barh(X_train.columns[sorted_idx], best_rf['model'].feature_importances_[sorted_idx])
plt.title("Random Forest with SMOTE")
plt.xlabel("Feature Importance")
plt.show()

### SMOTE

In [None]:
import imblearn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [None]:
model = RandomForestClassifier(class_weight='balanced')
over = SMOTE(sampling_strategy=0.3, k_neighbors=5)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)

In [None]:
random_grid = {'model__n_estimators': n_estimators,
               'model__max_features': max_features,
               'model__max_depth': max_depth,
               'model__min_samples_split': min_samples_split,
               'model__min_samples_leaf': min_samples_leaf,
               'model__bootstrap': bootstrap}

In [None]:
rf_random = RandomizedSearchCV(pipeline, param_distributions = random_grid, n_iter = 20, cv = 5, 
                               verbose=0, random_state=42, n_jobs = -1, scoring='roc_auc')

In [None]:
start_time = time.time()
rf_random.fit(X_train, y_train)
run_time = time.time() - start_time

In [None]:
print('Time required to train random forest with SMOTE technique: {:.2f} seconds', run_time)

In [None]:
best_smote_rf = rf_random.best_estimator_
rf_random.best_estimator_

In [None]:
# Generate predictions with the best model
smote_rf_pred = best_smote_rf.predict(X_test)
smote_rf_prob = best_smote_rf.predict_proba(X_test)

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))
ax.matshow(cm, cmap=plt.cm.Blues, alpha=0.3)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(x=j, y=i,s=cm[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
# calculate inputs for the PR curve
precision, recall, thresholds = precision_recall_curve(y_test, smote_rf_prob[:, 1])

# plot PR curve
plt.plot(recall, precision, marker='.', label='SMOTE RF')
# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
plt.show()

# calculate and print PR AUC
auc_pr = auc(recall, precision)
print('AUC PR: %.3f' % auc_pr)

In [None]:
sorted_idx = best_smote_rf['model'].feature_importances_.argsort()
plt.barh(X_train.columns[sorted_idx], best_smote_rf['model'].feature_importances_[sorted_idx])
plt.title("Random Forest with SMOTE")
plt.xlabel("Feature Importance")
plt.show()

## 5. Discussion