# Model Training and Evaluation
## Transaction Fraud Detection System

This notebook covers:
1. Time-based train/test split
2. Baseline model: Logistic Regression
3. Main model: LightGBM
4. Class imbalance handling (weights vs SMOTE)
5. Model comparison
6. Feature importance analysis

In [None]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb

from data_loader import load_raw_data, clean_data
from features import engineer_features, get_feature_columns
from train import (
    time_based_split,
    train_baseline,
    train_lightgbm,
    train_with_smote,
    save_model
)
from evaluate import (
    evaluate_model,
    compare_models,
    plot_pr_curve,
    plot_confusion_matrix,
    plot_precision_recall_at_k
)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 1. Load and Prepare Data

In [None]:
# Load data
# For full dataset: df = load_raw_data()
# For sample: df = load_raw_data(nrows=500000)

df = load_raw_data(nrows=500000)  # Adjust based on your system
print(f"\nLoaded {len(df):,} transactions")

In [None]:
# Clean data
df_clean = clean_data(df)
print(f"\nAfter cleaning: {len(df_clean):,} transactions")

In [None]:
# Engineer features
df_featured, encoders = engineer_features(df_clean)
print(f"\nAfter feature engineering: {len(df_featured.columns)} columns")

In [None]:
# Get feature columns
feature_cols = get_feature_columns(df_featured)
print(f"\nNumber of features for modeling: {len(feature_cols)}")
print("\nFeature columns:")
for i, col in enumerate(feature_cols, 1):
    print(f"{i:2d}. {col}")

## 2. Time-Based Train/Test Split

**Critical:** We use time-based split, not random split.
- In production, we train on historical data and predict on future data
- Random splitting would leak future information into training

In [None]:
# Perform time-based split
train_df, test_df = time_based_split(df_featured, test_size=0.2)

# Prepare features and labels
X_train = train_df[feature_cols]
y_train = train_df['isFraud']
X_test = test_df[feature_cols]
y_test = test_df['isFraud']

print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

## 3. Baseline Model: Logistic Regression

Start with a simple, interpretable baseline.

In [None]:
# Train baseline model with class weights
baseline_model = train_baseline(X_train, y_train, X_test, y_test, use_class_weights=True)

In [None]:
# Evaluate baseline model
baseline_results = evaluate_model(
    baseline_model, 
    X_test, 
    y_test,
    model_name='Logistic Regression (Baseline)',
    threshold=0.5,
    k_values=[100, 500, 1000, 5000],
    model_type='sklearn'
)

In [None]:
# Plot PR curve for baseline
plot_pr_curve(
    y_test, 
    baseline_results['y_pred_proba'],
    model_name='Logistic Regression',
    save_path='../outputs/baseline_pr_curve.png'
)
plt.show()

## 4. Main Model: LightGBM

LightGBM is our main model for:
- Excellent performance on tabular data
- Handles imbalanced data well
- Fast training and prediction
- Built-in feature importance

In [None]:
# Train LightGBM with class weights
lgb_model = train_lightgbm(X_train, y_train, X_test, y_test, use_class_weights=True)

In [None]:
# Evaluate LightGBM model
lgb_results = evaluate_model(
    lgb_model,
    X_test,
    y_test,
    model_name='LightGBM (Main Model)',
    threshold=0.5,
    k_values=[100, 500, 1000, 5000],
    model_type='lightgbm'
)

In [None]:
# Plot PR curve for LightGBM
plot_pr_curve(
    y_test,
    lgb_results['y_pred_proba'],
    model_name='LightGBM',
    save_path='../outputs/lightgbm_pr_curve.png'
)
plt.show()

## 5. SMOTE Comparison

Compare class weights vs SMOTE oversampling.

In [None]:
# Train LightGBM with SMOTE
# Note: This may take longer due to oversampling
lgb_smote_model = train_with_smote(
    X_train, y_train, X_test, y_test,
    model_type='lightgbm'
)

In [None]:
# Evaluate SMOTE model
lgb_smote_results = evaluate_model(
    lgb_smote_model,
    X_test,
    y_test,
    model_name='LightGBM + SMOTE',
    threshold=0.5,
    k_values=[100, 500, 1000, 5000],
    model_type='lightgbm'
)

## 6. Model Comparison

In [None]:
# Compare all models
comparison_df = compare_models({
    'Logistic Regression': baseline_results,
    'LightGBM (Class Weights)': lgb_results,
    'LightGBM + SMOTE': lgb_smote_results
})

comparison_df

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['PR-AUC', 'Precision@100', 'Recall@1000', 'F1-Score']
for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    comparison_df.plot(x='Model', y=metric, kind='bar', ax=ax, legend=False)
    ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    ax.set_xlabel('')
    ax.set_ylabel(metric)
    ax.tick_params(axis='x', rotation=45)
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Plot PR curves for all models
fig, ax = plt.subplots(figsize=(12, 8))

from sklearn.metrics import precision_recall_curve, auc

for name, results in [
    ('Logistic Regression', baseline_results),
    ('LightGBM (Class Weights)', lgb_results),
    ('LightGBM + SMOTE', lgb_smote_results)
]:
    precision, recall, _ = precision_recall_curve(y_test, results['y_pred_proba'])
    pr_auc = auc(recall, precision)
    ax.plot(recall, precision, linewidth=2, label=f'{name} (AP={pr_auc:.4f})')

ax.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Baseline (Random): {y_test.mean():.4f}')
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curves - All Models', fontsize=14, fontweight='bold')
ax.legend(loc='best', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/all_models_pr_curves.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Feature Importance Analysis

Understanding which features drive predictions.

In [None]:
# Get feature importance from LightGBM
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': lgb_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)

print("\nTop 20 Most Important Features:")
print("="*70)
print(feature_importance.head(20).to_string(index=False))

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(12, 10))

top_20 = feature_importance.head(20)
ax.barh(range(len(top_20)), top_20['importance'], color='steelblue')
ax.set_yticks(range(len(top_20)))
ax.set_yticklabels(top_20['feature'])
ax.set_xlabel('Importance (Gain)', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_title('Top 20 Feature Importance (LightGBM)', fontsize=14, fontweight='bold')
ax.invert_yaxis()
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

## 8. Confusion Matrix Analysis

In [None]:
# Plot confusion matrix for best model (LightGBM)
plot_confusion_matrix(
    y_test,
    lgb_results['y_pred'],
    model_name='LightGBM',
    save_path='../outputs/confusion_matrix.png'
)
plt.show()

## 9. Precision@K and Recall@K Analysis

In [None]:
# Plot Precision@K and Recall@K curves
plot_precision_recall_at_k(
    y_test,
    lgb_results['y_pred_proba'],
    k_values=None,  # Auto-generate K values
    save_path='../outputs/precision_recall_at_k.png'
)
plt.show()

## 10. Save Best Model

In [None]:
# Save the best performing model
# Based on comparison, choose the model with best Precision@K and PR-AUC

save_model(lgb_model, '../models/lightgbm_best.txt', model_type='lightgbm')
save_model(baseline_model, '../models/baseline_lr.pkl', model_type='sklearn')

print("\nModels saved successfully!")

## 11. Key Findings

### Model Performance
- **LightGBM significantly outperforms** the baseline Logistic Regression
- **Class weights vs SMOTE**: [Compare results from your run]
- **PR-AUC** is the most reliable metric for this imbalanced dataset

### Precision@K Insights
- **Precision@100**: High precision in top predictions means we can confidently flag top cases
- **Recall@K**: Shows how many frauds we catch in top K predictions
- Trade-off between precision and recall based on business constraints

### Feature Importance
- **Balance inconsistency features** are most important
- **Transaction type** (TRANSFER/CASH_OUT) is critical
- **Amount-based features** provide additional signal

### Why Accuracy is Misleading
- With 0.13% fraud rate, predicting "all legitimate" gives 99.87% accuracy
- But catches ZERO frauds!
- **Precision@K and PR-AUC** are the right metrics

### Next Steps
1. Threshold tuning for business constraints
2. SHAP explainability analysis
3. Generate fraud score predictions

In [None]:
print("\n" + "="*70)
print("MODEL TRAINING COMPLETE!")
print("="*70)
print(f"\nBest Model: LightGBM")
print(f"PR-AUC: {lgb_results['pr_auc']:.4f}")
print(f"Precision@100: {lgb_results['precision_at_100']:.4f}")
print(f"Recall@1000: {lgb_results['recall_at_1000']:.4f}")
print("\nNext notebook: 04_threshold_tuning.ipynb")