## 04_evaluation — Model Performance Analysis

- This notebook evaluates the baseline fraud detection model using
probability-based metrics suitable for highly imbalanced datasets.
- The goal is to assess the model’s decision quality and determine
appropriate operating thresholds.

In [None]:
#Importing libraries and loading the model outputs
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    precision_recall_curve,
    roc_curve,
    auc,
    precision_score,
    recall_score,
    confusion_matrix,
    classification_report
)

model_outputs = joblib.load("../artifacts/model_outputs_baseline.pkl")

y_test = model_outputs["y_test"]
y_pred_proba = model_outputs["y_pred_proba"]

### Why Probability-Based Evaluation?
In fraud detection, hard class predictions are insufficient due to
severe class imbalance. Evaluating predicted probabilities allows
flexible threshold selection and better control over precision–recall
trade-offs.

### Overall Model Discrimination
This section evaluates the model’s ability to discriminate between
fraudulent and normal transactions using probability-based metrics
that are independent of a specific classification threshold.

In [None]:
roc_auc = roc_auc_score(y_test, y_pred_proba)
pr_auc = average_precision_score(y_test, y_pred_proba)

roc_auc, pr_auc

- **PR-AUC** focuses on the minority (fraud) class and is more informative
  in imbalanced settings.
These metrics confirm that the model produces meaningful probability scores and is suitable for threshold-based decision making.

- **ROC-AUC** measures the model’s ability to rank fraudulent transactions
  above normal ones across all thresholds.

These results confirm that the model produces meaningful probability
scores and is suitable for threshold-based decision making.

### Precision–Recall Curve

In [None]:
precision_vals, recall_vals, thresholds = precision_recall_curve(
    y_test, y_pred_proba
)
# Plot PR curve
plt.figure(figsize=(6,4))
plt.plot(recall_vals, precision_vals)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.show()

The Precision–Recall curve illustrates the trade-off between detecting
fraudulent transactions and minimizing false alerts across different
classification thresholds.

### ROC Curve

In [None]:
# Compute ROC curve points
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

# Compute Area Under the ROC Curve (AUC)
roc_auc_value = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f"ROC AUC = {roc_auc_value:.3f}")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")  # Random baseline
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

The ROC curve confirms that the model achieves strong separation between
fraudulent and normal transactions across a wide range of thresholds.
However, precision–recall analysis remains more informative for
threshold selection in highly imbalanced fraud detection scenarios.

### Baseline Performance (Threshold = 0.5)

In [None]:
y_pred_05 = (y_pred_proba >= 0.5).astype(int)

# Evaluate model performance using the default threshold (0.5)
precision_05 = precision_score(y_test, y_pred_05)
recall_05 = recall_score(y_test, y_pred_05)
cm_05 = confusion_matrix(y_test, y_pred_05)

precision_05, recall_05, cm_05

### Cost Considerations
- In real-world fraud detection systems, false negatives typically incur
a higher cost due to direct financial loss, while false positives impact
customer experience and operational workload, threshold selection should therefore be aligned with business risk tolerance
rather than purely statistical optimality.

- The selected threshold reflects a conscious trade-off between minimizing
missed fraud cases and controlling the volume of false alerts.

## Threshold Tuning
The default threshold (0.5) resulted in high recall but a large number of false positives.
To reduce unnecessary alerts while maintaining reasonable fraud detection performance, multiple threshold values will be evaluated.

In [None]:
# Evaluate precision-recall trade-off across multiple thresholds
thresholds = np.round(np.arange(0.3, 0.71, 0.05), 2)

results = []

for t in thresholds:
    y_pred = (y_pred_proba >= t).astype(int)
    
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    results.append({
        "threshold": t,
        "precision": precision,
        "recall": recall
    })

threshold_results = pd.DataFrame(results)
threshold_results

Based on the precision-recall trade-off, a threshold of **0.7**
was selected as the final operating point.

This threshold provides:
- A significant reduction in false positive alerts.
- A high fraud recall (~91%), ensuring most fraud cases are still detected.
- A more practical balance between customer experience and fraud prevention.
  
**note**
While a threshold of 0.7 was selected as the primary operating point, a slightly
lower threshold (0.65) may also be considered in more conservative scenarios.
This option maintains the same recall level while generating a higher volume
of alerts, making it suitable for environments where missing fraud cases is
significantly more costly than investigating additional false positives.

The final threshold choice should therefore be aligned with business risk
tolerance, operational capacity, and customer experience considerations.

In [None]:
# Final selected threshold
final_threshold = 0.7

# Generate final predictions
y_final_pred = (y_pred_proba >= final_threshold).astype(int)

## Confusion Matrix (Final Threshold)
The confusion matrix summarizes the model’s performance
using the selected threshold and highlights the balance
between detected fraud cases and false alerts.

In [None]:
final_cm = confusion_matrix(y_test, y_final_pred)

final_cm

In [None]:
plt.figure(figsize=(6, 4))
sns.heatmap(
    final_cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Normal", "Fraud"],
    yticklabels=["Normal", "Fraud"]
)

plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix (Threshold = 0.7)")
plt.tight_layout()
plt.show()

## Final Confusion Matrix Analysis
At the selected threshold (0.7), the confusion matrix shows:

- True Positives (Fraud detected): 89
- False Negatives (Fraud missed): 9
- False Positives (False alerts): 644
- True Negatives (Normal transactions correctly classified): 56,220

At the selected threshold:
- The number of false positives is significantly reduced compared to the default threshold.
- Only a small number of fraud cases are missed, which may be acceptable
  depending on business risk tolerance.

## Classification Report (Final Model)
The classification report provides detailed performance metrics
for both normal and fraudulent transactions.

In [None]:
print(classification_report(y_test, y_final_pred, digits=3))

## Classification Report Interpretation
The classification report highlights the effectiveness of the selected threshold (0.7)
in detecting fraudulent transactions.

- The model achieves a high recall (~91%) for the fraud class, ensuring that most
  fraudulent transactions are successfully detected.
- Precision for the fraud class remains relatively low, which is expected in highly
  imbalanced fraud detection problems and reflects a trade-off to minimize missed fraud.
- Overall accuracy is high but not considered a primary metric due to class imbalance.
- Threshold tuning proved essential in reducing false alarms
  while maintaining strong fraud detection performance.

This performance aligns with real-world fraud detection requirements, where detecting
fraud is prioritized over minimizing false alerts.

## Evaluation Summary
- The baseline Logistic Regression model demonstrates strong ranking
  capability (ROC-AUC = 0.97).
- Threshold tuning was critical to control false positive rates.
- A threshold of 0.7 provides a practical balance between fraud detection
  and customer experience.
- This baseline establishes a solid reference point for future models.