# Model Building & Evaluation

**Objective**:  
Train and evaluate fraud detection models using proper preprocessing, imbalance handling, and metrics.

This notebook uses modular functions from `src/model_preprocessing.py`.
___
## 1. Setup & Load Processed Data

In [1]:
import sys
import os
# Add project root (one directory above "notebooks")
sys.path.append(os.path.abspath(".."))

In [2]:
import pandas as pd
from src.model_preprocessing import prepare_data_for_modeling

In [3]:
# Load engineered datasets
fraud_df = pd.read_csv('../data/processed/fraud_data_engineered.csv')
cc_df = pd.read_csv('../data/processed/creditcard_processed.csv')
print("Fraud_Data loaded:", fraud_df.shape)
print("CreditCard loaded:", cc_df.shape)

Fraud_Data loaded: (151112, 13)
CreditCard loaded: (283726, 31)



## 2. Data Transformation & Handling Class Imbalance

a. `fraud_data_engineered.csv` 


- **justification** : For the E-commerce dataset (9.4% fraud), I chose SMOTE over undersampling. Undersampling would have required discarding over 80% of the legitimate transaction data, significantly reducing the model's ability to learn normal patterns. Since the minority class was sufficiently represented (not extremely rare), SMOTE allowed me to balance the classes while retaining all valuable information from the majority class.

In [4]:
fraud_df# Separate features and target
X_fraud = fraud_df.drop('class', axis=1)
y_fraud = fraud_df['class']

In [5]:
# Split and balance — SMOTE is good here (~9.4% fraud)
X_train_bal, y_train_bal, X_test_proc, y_test, preprocessor = prepare_data_for_modeling(
    X_fraud, y_fraud,
    dataset_name="Fraud_Data",
    imbalance_technique="smote",   # Change to "undersample" for creditcard
    test_size=0.2,
    random_state=42
)


=== Preparing Fraud_Data for Modeling ===
Before Handling imbalance:
Train: (120889, 12), Test: (30223, 12)
Fitting preprocessor on training data...
Applying SMOTE...
Class distribution BEFORE balancing:
{0: 0.9064, 1: 0.0936}
Class distribution AFTER balancing:
{0: 0.5, 1: 0.5}
✅ Ready for modeling! Train Shape: (219136, 197)


a. `creditcard_processed.csv` 


- **justification** : For the Credit Card dataset, the class imbalance was extreme (0.17% fraud vs 99.83% legitimate). Unlike the E-commerce data, applying standard SMOTE here would have required generating approximately 600 synthetic samples for every real fraud instance. This would have introduced massive synthetic noise, causing the model to learn the mathematical artifacts of SMOTE rather than real fraud patterns. Therefore, I opted for Undersampling. This approach reduced the overwhelming volume of the majority class to create a balanced training set, allowing the model to identify the decision boundary clearly without the risk of overfitting on synthetic data.

In [6]:
X_cc = cc_df.drop('Class', axis=1)
y_cc = cc_df['Class']
X_train_bal, y_train_bal, X_test_proc, y_test, preprocessor = prepare_data_for_modeling(
    X_cc, y_cc,
    dataset_name="CreditCard",
    imbalance_technique="undersample",  # or "smotetomek"
    test_size=0.2
)


=== Preparing CreditCard for Modeling ===
Before Handling imbalance:
Train: (226980, 30), Test: (56746, 30)
Fitting preprocessor on training data...
Applying UNDERSAMPLE...
Class distribution BEFORE balancing:
{0: 0.9983, 1: 0.0017}
Class distribution AFTER balancing:
{0: 0.5, 1: 0.5}
✅ Ready for modeling! Train Shape: (756, 30)
