# Model Training - Fraud Detection with XGBoost

In this notebook, we will train and evaluate **machine learning models** for credit card fraud detection.

## **Steps We Will Cover**
1. **Load Preprocessed Data** → Use **both** the original & SMOTE-balanced datasets.  
2. **Train-Test Split** → Prepare data for model training.  
3. **Train Baseline XGBoost Model** → Evaluate initial performance.  
4. **Optimize Model Performance** → Use **Optuna** for hyperparameter tuning.  
5. **Evaluate Model Performance** → Check metrics like **AUC-ROC, Precision-Recall**.

Our goal is to **build an accurate fraud detection system** while handling the **highly imbalanced dataset** properly.

## Import Necessary Libraries

Before training our models, we first import essential libraries for:
- **Data Handling & Processing** → `pandas`, `numpy`
- **Model Training** → `XGBoost`, `Scikit-Learn`
- **Hyperparameter Optimization** → `Optuna`
- **Evaluation Metrics** → `classification_report`, `roc_auc_score`

In [1]:
# Data Handling
import pandas as pd
import numpy as np

# Machine Learning Models
import xgboost as xgb
from xgboost import XGBClassifier

# Model Selection & Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc

# Hyperparameter Optimization
import optuna

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

## Load Preprocessed Data & Train-Test Split

To train our fraud detection model, we will:
1. **Load the preprocessed datasets**:
   - `X_scaled.csv` → Scaled but imbalanced dataset.
   - `X_smote.csv` → SMOTE-balanced dataset.
2. **Perform Train-Test Split**:
   - **80% training, 20% testing**.
   - Ensure data is shuffled properly before training.

In [2]:
# Load preprocessed datasets
X_scaled = pd.read_csv("../datasets/X_scaled.csv")
y = pd.read_csv("../datasets/y.csv")

X_smote = pd.read_csv("../datasets/X_smote.csv")
y_smote = pd.read_csv("../datasets/y_smote.csv")

# Ensure y is a 1D array (avoid shape issues)
y = y.values.ravel()
y_smote = y_smote.values.ravel()

# Train-Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42, stratify=y_smote)

# Display dataset shapes
print("Original Dataset Shape:")
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

print("\nSMOTE Dataset Shape:")
print(f"Train: {X_train_smote.shape}, Test: {X_test_smote.shape}")

Original Dataset Shape:
Train: (227845, 30), Test: (56962, 30)

SMOTE Dataset Shape:
Train: (454904, 30), Test: (113726, 30)


## Training Three XGBoost Models

To compare different approaches, we will train three XGBoost models:

️1. **Base XGBoost Model** → Trained on the **original imbalanced dataset** without any adjustments.  
2. **Weighted XGBoost Model** → Trained on the **original dataset**, but with **higher penalty for misclassifying fraud cases** using `scale_pos_weight`.  
3. **SMOTE XGBoost Model** → Trained on the **SMOTE-balanced dataset**, where fraud cases are oversampled to match non-fraud cases.  

This will help us determine:
- Whether **handling imbalance improves fraud detection**.  
- Whether **penalizing fraud misclassification improves model focus**.  
- Which **approach provides the best overall fraud detection performance**.  

In [11]:
# Initialize Base XGBoost Model
base_xgb = XGBClassifier(objective="binary:logistic", eval_metric="logloss", random_state=42, device='gpu')

# Train on imbalanced dataset
base_xgb.fit(X_train, y_train)

# Make predictions
y_pred_base = base_xgb.predict(X_test)

# Evaluate model
print("Base XGBoost Model Performance:")
print(classification_report(y_test, y_pred_base))
print(f"AUC-ROC Score: {roc_auc_score(y_test, y_pred_base):.4f}")

Base XGBoost Model Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.89      0.81      0.84        98

    accuracy                           1.00     56962
   macro avg       0.94      0.90      0.92     56962
weighted avg       1.00      1.00      1.00     56962

AUC-ROC Score: 0.9030


In [12]:
# Calculate the imbalance ratio (weighting fraud more)
fraud_weight = len(y_train) / sum(y_train)  # Gives more weight to minority class

# Initialize Weighted XGBoost Model
weighted_xgb = XGBClassifier(objective="binary:logistic", eval_metric="logloss", scale_pos_weight=fraud_weight, random_state=42, device="gpu")

# Train with class weighting
weighted_xgb.fit(X_train, y_train)

# Make predictions
y_pred_weighted = weighted_xgb.predict(X_test)

# Evaluate model
print("Weighted XGBoost Model Performance (Higher Fraud Penalty):")
print(classification_report(y_test, y_pred_weighted))
print(f"AUC-ROC Score: {roc_auc_score(y_test, y_pred_weighted):.4f}")

Weighted XGBoost Model Performance (Higher Fraud Penalty):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.88      0.83      0.85        98

    accuracy                           1.00     56962
   macro avg       0.94      0.91      0.93     56962
weighted avg       1.00      1.00      1.00     56962

AUC-ROC Score: 0.9132


In [13]:
# Initialize SMOTE XGBoost Model
smote_xgb = XGBClassifier(objective="binary:logistic", eval_metric="logloss", random_state=42, device="gpu")

# Train on SMOTE-balanced dataset
smote_xgb.fit(X_train_smote, y_train_smote)

# Make predictions
y_pred_smote = smote_xgb.predict(X_test_smote)

# Evaluate model
print("SMOTE XGBoost Model Performance:")
print(classification_report(y_test_smote, y_pred_smote))
print(f"AUC-ROC Score: {roc_auc_score(y_test_smote, y_pred_smote):.4f}")


SMOTE XGBoost Model Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56863
           1       1.00      1.00      1.00     56863

    accuracy                           1.00    113726
   macro avg       1.00      1.00      1.00    113726
weighted avg       1.00      1.00      1.00    113726

AUC-ROC Score: 0.9996


## Model Performance Comparison

After training three different XGBoost models, we can compare their effectiveness in fraud detection.

| Model | Precision (Fraud) | Recall (Fraud) | F1-Score (Fraud) | AUC-ROC Score |
|---|---|---|---|---|
| **Base XGBoost** | **0.89** | **0.81** | **0.84** | **0.9030** |
| **Weighted XGBoost (Higher Fraud Penalty)** | **0.88** | **0.83** | **0.85** | **0.9132** |
| **SMOTE XGBoost** | **1.00** | **1.00** | **1.00** | **0.9996** |

### **Observations**
- **Base XGBoost Model** performed well but had slightly **lower recall (81%)**, meaning it **missed** some fraud cases.
- **Weighted XGBoost Model** (higher fraud misclassification penalty) **slightly improved recall (83%)**, making it **better at detecting fraud** while maintaining high precision.
- **SMOTE XGBoost Model** achieved **perfect scores**, but this is likely **overfitting**, as the dataset was fully balanced by oversampling fraud cases.

### **Key Takeaways**
✔ **Weighted XGBoost improved recall while maintaining precision** → A better model for fraud detection.  
✔ **SMOTE Model may be overfitting** due to the perfectly balanced dataset.  
✔ **Base Model is strong but could miss some fraud cases** (lower recall).  

### **Next Steps**
1. **Use SHAP for feature importance analysis** → Understand which features impact predictions.  
2. **Hyperparameter tuning with Optuna** → Improve model performance further.  
3. **Final model selection** → Choose the best approach for real-world fraud detection.