# Fraud Detection Model Development

This notebook demonstrates the complete machine learning pipeline for fraud detection:

1. **Data Generation & Loading**
2. **Advanced Feature Engineering**
3. **Individual Model Training & Evaluation**
4. **Ensemble Model Development**
5. **Model Explainability with SHAP**
6. **Performance Analysis & Visualization**
7. **Model Persistence**

---

**Author**: Sunny Nguyen  
**Date**: September 2025  
**Objective**: Build production-ready fraud detection system with 95%+ accuracy

## 🚀 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ Libraries imported successfully!")
print(f"📅 Training started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Import custom modules
import sys
import os
sys.path.append('../src')

from data_processing.generate_data import create_fraud_dataset
from data_processing.feature_engineering import AdvancedFeatureEngineering
from models.fraud_detector import (
    RandomForestDetector, 
    XGBoostDetector, 
    LogisticRegressionDetector,
    EnsembleFraudDetector
)

print("✅ Custom modules imported successfully!")

In [None]:
# Generate synthetic fraud dataset
print("🔄 Creating synthetic fraud dataset...")

# Create a substantial dataset for training
DATASET_SIZE = 50000  # Increase for more robust training

df = create_fraud_dataset(n_samples=DATASET_SIZE)

print(f"📊 Dataset created with {len(df):,} transactions")
print(f"🎯 Fraud rate: {df['Class'].mean():.4f} ({df['Class'].mean()*100:.2f}%)")
print(f"💰 Average transaction amount: ${df['Amount'].mean():.2f}")

# Display basic statistics
print("\n📈 Dataset Overview:")
display(df.describe())
display(df.head())

## 🔧 2. Advanced Feature Engineering

In [None]:
# Initialize feature engineering pipeline
print("🔧 Initializing advanced feature engineering pipeline...")

fe_pipeline = AdvancedFeatureEngineering(target_column='Class')

# Apply comprehensive feature engineering
df_engineered = fe_pipeline.fit_transform(df)

print(f"\n📊 Feature Engineering Results:")
print(f"Original features: {df.shape[1]}")
print(f"Engineered features: {df_engineered.shape[1]}")
print(f"Features added: {df_engineered.shape[1] - df.shape[1]}")

# Show feature engineering summary
feature_summary = fe_pipeline.get_feature_importance_summary()
print(f"\n🎯 Selected features: {feature_summary['total_features_created']}")
print(f"📝 Encoders fitted: {len(feature_summary['encoders_fitted'])}")
print(f"📏 Scalers fitted: {len(feature_summary['scalers_fitted'])}")

## 🤖 3. Individual Model Training

In [None]:
# Prepare features and target
X = df_engineered.drop('Class', axis=1)
y = df_engineered['Class']

print(f"🎯 Final dataset for training:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"Class distribution: {dict(y.value_counts())}")

## 🎯 4. Ensemble Model Training

In [None]:
# Train ensemble model
print("🎯 Training Ensemble Fraud Detector...")

ensemble = EnsembleFraudDetector(random_state=42)
ensemble.train(X, y, test_size=0.2, balance_data=True)

print(f"\n✅ Ensemble training completed!")
print(f"🎯 Ensemble F1 Score: {ensemble.ensemble_metrics['f1_score']:.4f}")
print(f"🎯 Ensemble ROC-AUC: {ensemble.ensemble_metrics['roc_auc']:.4f}")

## 💾 5. Model Persistence

In [None]:
# Save trained models
import os
import joblib

os.makedirs('../models', exist_ok=True)

# Save feature engineering pipeline
joblib.dump(fe_pipeline, '../models/feature_engineering_pipeline.pkl')

# Save ensemble model
ensemble.save_ensemble('../models/fraud_detection_ensemble.pkl')

print("✅ All models saved successfully!")

## 📊 6. Final Results

In [None]:
# Print final summary
print("🎯 FRAUD DETECTION MODEL TRAINING COMPLETE!")
print(f"📊 Dataset: {len(df):,} transactions")
print(f"🎯 Final Performance:")
for metric, value in ensemble.ensemble_metrics.items():
    print(f"  • {metric.replace('_', ' ').title()}: {value:.4f}")
print("🚀 Ready for production deployment!")