# 05_Model Comparison — Credit Card Fraud Detection

## Objective
This notebook compares multiple classification models for fraud detection
under the same preprocessing, evaluation metrics, and business-aligned
threshold selection framework.

## Persisting Final Model Artifacts

To enable downstream business evaluation and cost-sensitive analysis,
the trained models and their predicted probabilities are persisted as artifacts.

Saving model outputs ensures that:
- Cost evaluation is fully decoupled from model training
- All models are compared under identical data and assumptions
- Results are reproducible and production-oriented

The persisted artifacts will be reused in the cost evaluation stage
without retraining or re-running the full comparison pipeline.

In [None]:
import numpy as np
import pandas as pd
import joblib

# Metrics
from sklearn.metrics import (
    average_precision_score,
    precision_score,
    recall_score,
    confusion_matrix
)

#loading models outputs
baseline_outputs = joblib.load("../artifacts/model_outputs_baseline.pkl")
rf_outputs = joblib.load("../artifacts/model_outputs_random_forest.pkl")
xgb_outputs = joblib.load("../artifacts/model_outputs_xgboost.pkl")

y_test = baseline_outputs["y_test"]

y_proba_baseline = baseline_outputs["y_pred_proba"]
y_proba_rf = rf_outputs["y_pred_proba"]
y_proba_xgb = xgb_outputs["y_pred_proba"]

## Unified Evaluation Framework

To ensure a fair and unbiased comparison between different models,
a unified evaluation function is used across all experiments.

This guarantees that:
- All models are evaluated using the same probability threshold logic
- Precision and recall are computed consistently
- Confusion matrices are directly comparable

This approach avoids metric leakage and aligns the comparison process
with real-world model selection practices.


In [None]:
def evaluate_model(y_true, y_proba, threshold):
    """
    Evaluate a model at a given probability threshold.
    """
    y_pred = (y_proba >= threshold).astype(int)
    
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred)
    
    return {
        "precision": precision,
        "recall": recall,
        "confusion_matrix": cm
    }

## Logistic Regression — Baseline Reference

The Logistic Regression model serves as a fixed baseline reference.
Its predictions were generated during the modeling stage and are reused
here without retraining to ensure a fair comparison.

### Baseline Performance — Threshold-Independent Evaluation

Before comparing multiple models, we first establish a reference performance
for the baseline Logistic Regression model.

PR-AUC is used as a threshold-independent metric that measures the model’s
ability to rank fraudulent transactions above normal ones. This metric is
particularly suitable for highly imbalanced fraud detection problems and
serves as the primary comparison criterion across all models.

Baseline PR-AUC is reproduced here using stored probabilities
to ensure consistency with other models.

In [None]:
# PR-AUC for baseline Logistic Regression
baseline_pr_auc = average_precision_score(
    y_test,
    y_proba_baseline
)

baseline_pr_auc

### Baseline Performance — Operating Threshold Evaluation

While PR-AUC evaluates ranking quality, real-world fraud detection systems
require a concrete decision threshold.

The baseline model is therefore evaluated at the previously selected operating
threshold (0.7) to quantify the trade-off between fraud recall and false alert
volume. This establishes a practical reference point for comparing operational
performance across models.

In [None]:
baseline_threshold = 0.7

baseline_eval = evaluate_model(
    y_test,
    y_proba_baseline,
    baseline_threshold
)

baseline_eval

- For the baseline Logistic Regression model, the operating threshold was finalized
during the standalone evaluation stage and is reused here.

- For subsequent models, threshold tuning is performed within this notebook
to identify their optimal operating points under the same evaluation framework.

### Random Forest — Threshold-Independent Evaluation (PR-AUC)

In [None]:
rf_pr_auc = average_precision_score(
    y_test,
    y_proba_rf
)

rf_pr_auc

### Random Forest — Operating Threshold Evaluation

In [None]:
rf_threshold = 0.7

rf_eval = evaluate_model(
    y_test,
    y_proba_rf,
    rf_threshold
)

rf_eval

### Random Forest — Threshold Tuning

To better align Random Forest with fraud detection objectives,
multiple probability thresholds are evaluated to explore the
precision–recall trade-off and identify a more suitable operating point.

In [None]:
rf_thresholds = np.round(np.arange(0.3, 0.81, 0.05), 2)

rf_results = []

for t in rf_thresholds:
    eval_res = evaluate_model(y_test, y_proba_rf, t)
    
    rf_results.append({
        "threshold": t,
        "precision": eval_res["precision"],
        "recall": eval_res["recall"]
    })

rf_threshold_df = pd.DataFrame(rf_results)
rf_threshold_df


### Random Forest — Threshold Selection

Based on the precision–recall trade-off, a threshold of 0.35 was selected
as the operating point for Random Forest.

This threshold achieves a strong balance between fraud recall and alert
precision, significantly reducing false positives while maintaining
high fraud detection coverage compared to the baseline Logistic Regression.

In [None]:
rf_final_threshold = 0.35

# Evaluate at threshold 0.35
rf_final_eval = evaluate_model(
    y_test,
    y_proba_rf,
    rf_final_threshold
)

rf_final_eval


## Interim Model Comparison — Logistic Regression vs Random Forest

At this stage, two models have been evaluated under the same preprocessing
and evaluation framework.

The Logistic Regression model prioritizes fraud recall, successfully detecting
most fraudulent transactions but generating a high volume of false positive alerts.

In contrast, the Random Forest model demonstrates substantially stronger
ranking performance (higher PR-AUC) and dramatically reduces false positives,
at the cost of a moderate reduction in fraud recall.

This comparison highlights the inherent trade-off between fraud detection
coverage and customer experience, and serves as a foundation for evaluating
more advanced models.

In [None]:
comparison_so_far = pd.DataFrame([
    {
        "model": "Logistic Regression",
        "pr_auc": 0.716,
        "threshold": 0.70,
        "precision": baseline_eval["precision"],
        "recall": baseline_eval["recall"],
        "false_positives": baseline_eval["confusion_matrix"][0, 1]
    },
    {
        "model": "Random Forest",
        "pr_auc": rf_pr_auc,
        "threshold": 0.35,
        "precision": rf_final_eval["precision"],
        "recall": rf_final_eval["recall"],
        "false_positives": rf_final_eval["confusion_matrix"][0, 1]
    }
])

comparison_so_far

## Interim Conclusion

The Random Forest model substantially outperforms the Logistic Regression
baseline in terms of ranking quality (PR-AUC) and false positive reduction.

However, this improvement comes with a moderate decrease in fraud recall.
As a result, model selection depends on business priorities:
whether maximizing fraud detection coverage or minimizing customer disruption
is the primary objective.

This interim conclusion establishes a clear baseline for evaluating more
advanced models.

## Gradient Boosting Model — XGBoost

Gradient Boosting is evaluated to determine whether a boosted tree-based
approach can achieve a better balance between fraud recall and false
positive reduction compared to both Logistic Regression and Random Forest.

The model is trained and evaluated under the same preprocessing and
evaluation framework to ensure a fair comparison.

### XGBoost — Threshold-Independent Evaluation (PR-AUC)

In [None]:
xgb_pr_auc = average_precision_score(y_test, y_proba_xgb)
xgb_pr_auc

### XGBoost — Threshold Tuning

In [None]:
xgb_thresholds = np.round(np.arange(0.3, 0.81, 0.05), 2)

xgb_results = []

for t in xgb_thresholds:
    eval_res = evaluate_model(y_test, y_proba_xgb, t)
    xgb_results.append({
        "threshold": t,
        "precision": eval_res["precision"],
        "recall": eval_res["recall"]
    })

xgb_threshold_df = pd.DataFrame(xgb_results)
xgb_threshold_df

### XGBoost — Selected Operating Threshold

Based on the precision–recall trade-off, a threshold of 0.50 was selected
as the operating point for XGBoost to balance fraud detection coverage
and false alert volume.

In [None]:
xgb_final_threshold = 0.50

xgb_final_eval = evaluate_model(
    y_test,
    y_proba_xgb,
    xgb_final_threshold
)

xgb_final_eval

## Final Model Comparison

The following table summarizes the performance of all evaluated models
using their selected operating thresholds.

Each model is compared in terms of ranking quality (PR-AUC), fraud recall,
precision, and false positive volume to support an informed final model
selection decision.

In [None]:
final_comparison = pd.DataFrame([
    {
        "model": "Logistic Regression",
        "pr_auc": 0.716,
        "threshold": 0.70,
        "precision": baseline_eval["precision"],
        "recall": baseline_eval["recall"],
        "false_positives": baseline_eval["confusion_matrix"][0, 1]
    },
    {
        "model": "Random Forest",
        "pr_auc": rf_pr_auc,
        "threshold": 0.35,
        "precision": rf_final_eval["precision"],
        "recall": rf_final_eval["recall"],
        "false_positives": rf_final_eval["confusion_matrix"][0, 1]
    },
    {
        "model": "XGBoost",
        "pr_auc": xgb_pr_auc,
        "threshold": 0.50,
        "precision": xgb_final_eval["precision"],
        "recall": xgb_final_eval["recall"],
        "false_positives": xgb_final_eval["confusion_matrix"][0, 1]
    }
])

final_comparison

## Final Model Selection

Three models were evaluated under a unified preprocessing and evaluation
framework: Logistic Regression, Random Forest, and XGBoost.

Logistic Regression achieved the highest fraud recall but generated an
excessive number of false positive alerts, making it impractical for
real-world deployment.

Random Forest significantly reduced false positives but missed a larger
portion of fraudulent transactions.

XGBoost achieved the best overall balance, delivering the highest ranking
performance (PR-AUC), strong fraud recall, and a substantial reduction in
false positives compared to the baseline.

Based on this trade-off, XGBoost was selected as the preferred production
candidate under a moderate risk tolerance setting.

## Precision–Recall Curve Comparison

The following plot compares the precision–recall trade-offs across all
evaluated models and visually supports the final model selection decision.

In [None]:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": y_proba_baseline,
    "Random Forest": y_proba_rf,
    "XGBoost": y_proba_xgb
}

plt.figure(figsize=(7,5))

for name, proba in models.items():
    precision, recall, _ = precision_recall_curve(y_test, proba)
    plt.plot(recall, precision, label=name)

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()

## Final Remarks

This project demonstrates an end-to-end fraud detection pipeline,
from baseline modeling to advanced model comparison and business-aware
model selection.

Through systematic evaluation and threshold tuning, XGBoost was selected
as the final model due to its superior balance between fraud detection
coverage and false positive reduction.

The presented approach reflects real-world decision-making practices
in cost-sensitive and highly imbalanced classification problems.