# Overfitting Analysis - FraudDetectAI

## Understanding Model Generalization

Overfitting occurs when a model **performs exceptionally well on training data** but fails to generalize to unseen data.  
Since our dataset is highly imbalanced and we applied **SMOTE**, there is a chance that some models may have **memorized synthetic data** rather than learning actual fraud patterns.

---

## **Objectives of This Notebook**
- Compare **training vs. test performance** to detect overfitting.  
- Visualize **learning curves** for potential divergence between train & validation loss.  
- Analyze **feature importance stability** (does SMOTE make the model over-rely on certain features?).  
- Decide whether to **adjust the SMOTE ratio** or introduce **undersampling**.

---

## **Steps We Will Cover**
1. **Compare Train vs. Test Performance** → Precision, Recall, F1-score, AUC-ROC  
2. **Plot Learning Curves** → Check whether the model is overfitting  
3. **Feature Importance Stability Check** → Compare SHAP values across datasets  
4. **Decision on SMOTE Adjustments** → Reduce synthetic samples? Add undersampling?  

**Imports**:

In [3]:
# Data Handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model Evaluation
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split

# SHAP for Explainability
import shap

# Load Saved Models
import joblib

**Data Preparation**:

In [4]:
# Load normal dataset
X = pd.read_csv("../datasets/X_scaled.csv")
y = pd.read_csv("../datasets/y.csv")

# Load SMOTE dataset
X_smote = pd.read_csv("../datasets/X_smote.csv")
y_smote = pd.read_csv("../datasets/y_smote.csv")

# Perform Train-Test Split (for Base & Weighted models)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Perform Train-Test Split (for SMOTE model)
X_train_smote, _, y_train_smote, _ = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42, stratify=y_smote)

# X_test_smote should be the same as X_test (we don't oversample test data)
X_test_smote, y_test_smote = X_test, y_test

**Model Load**:

In [5]:
# Load trained models
base_xgb_opt = joblib.load("../models/optimized_base_xgb.pkl")
weighted_xgb_opt = joblib.load("../models/optimized_weighted_xgb.pkl")
smote_xgb_opt = joblib.load("../models/optimized_smote_xgb.pkl")

print("Models are loaded successfully.")

Models are loaded successfully.
