## Model Training & Testing

Model priorities:
- ACCURACY: fraud detection accuracy
- PRECISION: minimize false positives [TP/(TP+FP)]

### Table of Contents
> **Data Preprocessing**
>
> **Training & Testing**
>
> **Model Selection**
>
> **Feature Selection**
>
> **Feature Engineering**
>
> **Evaluating Metrics**

### **Data Preprocessing**

In [1]:
# Loading in dependencies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler

We load in the cleaned data:

In [2]:
clean_df = pd.read_csv('cleaned_df.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'cleaned_df.csv'

In [None]:
clean_df.head()

In [None]:
clean_df_preprocessed = clean_df.copy()
clean_df_preprocessed = clean_df.drop(columns=['Card Identifier', 'Transaction Date', 'Transaction Time', 'Year', 'Month', 'Day', 'Hour', 'Merchant Location'])

for columns in ['Payment Method', 'Card Present Status', 'Chip Usage', 'Cross-border Transaction (Yes/No)', 'Acquiring Institution ID', 'Merchant Identifier', 'Merchant Category']:
    lbl = LabelEncoder()
    clean_df_preprocessed[columns] = lbl.fit_transform(clean_df[columns])

# Split the data into features and the target
X = clean_df_preprocessed.drop('Fraud Indicator (Yes/No)', axis=1)
y = clean_df_preprocessed['Fraud Indicator (Yes/No)']

# Convert target variable to binary
y = lbl.fit_transform(y)

rus = RandomUnderSampler(random_state=39)
X_res, y_res = rus.fit_resample(X, y)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=39)

Random Undersampling is done before the data is split into training and testing sets to ensure that both sets reflect the same class distribution.

### **Model Selection**

Due to the fact that we have many categorical variables and it is not computationally viable to OneHotEncode some of these variables (such as Merchant Location), we choose to select between the **Random Forest Classification** or **Gradient Boosting** models, as it handles non-linear relationships and interactions between categorical features well.

In [3]:
# Load in dependencies
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Random Forest

In [4]:
# Initialize and train the Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100,
                             random_state=39,
                             class_weight='balanced',
                             max_depth=60,         # Depths
                             max_features='sqrt',  # Values
                             min_samples_split=2,  # Minimum number of samples required to split an internal node
                             min_samples_leaf=1,)   # Minimum number of samples required to be at a leaf node
rfc.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [None]:
# Make predictions
y_pred = rfc.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[48  5]
 [ 5 48]]
              precision    recall  f1-score   support

           0       0.91      0.91      0.91        53
           1       0.91      0.91      0.91        53

    accuracy                           0.91       106
   macro avg       0.91      0.91      0.91       106
weighted avg       0.91      0.91      0.91       106



In [None]:
rf_accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Model Accuracy: ", rf_accuracy)

Random Forest Model Accuracy:  0.9056603773584906


In [None]:
# seed note: 39, 39, 39

### Gradient Boosting Algorithms

Given time constraints, we have selected two models to best train on:
* CatBoost: effective with categorical variables
* LightGBM: more computationally effective than XGBoost given the size of the dataset

In [None]:
from catboost import CatBoostClassifier

# Initialize the CatBoostClassifier
cat_model = CatBoostClassifier(iterations=110,
                               learning_rate=0.2,
                               depth=3,
                               verbose=0,
                               random_state=42)

# Train the model
cat_model.fit(X_train, y_train)

# Make predictions on the test set
cat_predictions = cat_model.predict(X_test)

# Evaluate the model
print("CatBoost Model Accuracy: ", accuracy_score(y_test, cat_predictions))
print("\nCatBoost Classification Report:\n", classification_report(y_test, cat_predictions))

CatBoost Model Accuracy:  0.8113207547169812

CatBoost Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.80      0.82        56
           1       0.79      0.82      0.80        50

    accuracy                           0.81       106
   macro avg       0.81      0.81      0.81       106
weighted avg       0.81      0.81      0.81       106



### **Evaluation & Performance Metrics**

When we compare the performances of the Random Forest and CatBoost models, we notice that **Random Forest has a higher accuracy of 90.5%**, compared to the 81.1% accuracy of CatBoost. The Random Forest also leads in precision, recall, and F1-scores for both classes. It particularly excels in precisely identifying instances of one class while effectively catching all relevant instances of the other. This suggests that Random Forest is suited for the task at hand, as it is able of not only identifying correct instances but also reducing the chances of false alarms and misses.

CatBoost, shows a good balance between its performance metrics across the two classes, with recall rates that are almost even. It maintains consistent performance across different categories, but lags behind in metrics compared to the RF algorithm.

### **Implementation Plan**

#### Model Deployment

To enhance Nullfraud Bank's fraud detection capabilities, the refined Random Forest model will be integrated into the bank’s transaction processing system. This deployment will involve setting up a **secure, real-time prediction service** that analyzes each transaction as it occurs. We will ensure that this system is scalable and robust, capable of handling the increasing volume of digital transactions. The model will run alongside the existing infrastructure initially, in a **shadow mode**, to compare its fraud detection performance against current systems without affecting customer transactions. This approach minimizes risk and allows for fine-tuning while maintaining the customer-friendly nature of transaction processes.

#### Model Monitoring & Maintainence

Post-deployment, continuous **monitoring** will be crucial to maintain the efficacy of the Random Forest model. We will establish a monitoring system to track the model’s performance metrics, especially **accuracy**, **precision** (minimizing the number of false positives), in real-time. Anomalies or degradations in performance will trigger alerts for immediate review. Additionally, we will periodically retrain the model with new transaction data to adapt to evolving consumer habits and cyber threats. This retraining will take into account the feedback from frontline teams and customers to ensure the model remains relevant and effective in detecting fraud.

#### Feedback Loop

To ensure the continuous improvement of the fraud detection system, Nullfraud Bank will implement a structured feedback loop. This loop will involve gathering feedback from customers who experienced transaction interventions, analyzing false positives and false negatives, and incorporating insights from the bank’s customer service and fraud investigation teams.

This collective feedback will be used to refine the model's parameters and improve its decision-making processes. By fostering collaboration between the analytics teams, customer service, and security departments, Nullfraud Bank can create a dynamic system that evolves in response to new threats and maintains a high level of customer trust and satisfaction.