# Classification - Imbalanced Data

This notebook discusses various strategies for dealing with imbalanced data. It also demonstrates probability calibration and threshold optimization.

## Setup

Install the `imbalanced-learn` package, which implements a number of over / under sampling methods in a SKL-friendly manner.

In [None]:
!python -m pip install imbalanced-learn

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import brier_score_loss


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Dataset Generation

First, create a synthetic dataset using SKL's `make_classification` function. This models fraudulent charges as the minority class (5%), with 5x higher amounts.

In [None]:
# Create a synthetic dataset that mimics a fraud detection scenario
def create_fraud_dataset(n_samples=10000, fraud_ratio=0.05, n_features=10):
    """
    Create a synthetic fraud detection dataset with natural cost implications.
    
    Parameters:
    -----------
    n_samples : int
        The total number of samples.
    fraud_ratio : float
        The proportion of fraudulent transactions.
    n_features : int
        The number of features.
        
    Returns:
    --------
    X : DataFrame
        The feature matrix.
    y : Series
        The target vector (0: legitimate, 1: fraudulent).
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=5,
        n_redundant=3,
        n_repeated=0,
        n_classes=2,
        weights=[1-fraud_ratio, fraud_ratio],
        random_state=42
    )
    
    # Convert to pandas for easier handling
    feature_names = [f'feature_{i}' for i in range(n_features)]
    X_df = pd.DataFrame(X, columns=feature_names)
    y_df = pd.Series(y, name='fraud')
    
    # Add a transaction amount feature (higher for fraudulent transactions)
    # This will help illustrate the cost implications
    amounts = np.random.exponential(scale=100, size=n_samples)
    # Make fraudulent transactions have higher amounts on average
    amounts[y == 1] = amounts[y == 1] * 5
    X_df['transaction_amount'] = amounts
    
    return X_df, y_df

# Create the dataset
X, y = create_fraud_dataset(n_samples=10000, fraud_ratio=0.05)

In [None]:
X.head()

Display stats about the dataset.

In [None]:
# Display dataset info
print("Dataset Information:")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print("\nClass distribution:")
print(y.value_counts())
print("\nClass distribution (%):")
print(y.value_counts(normalize=True) * 100)

# Display transaction amount statistics by class
fraud_amounts = X.loc[y == 1, 'transaction_amount']
legit_amounts = X.loc[y == 0, 'transaction_amount']

print("\nTransaction Amount Statistics:")
print(f"Legitimate transactions - Mean: ${legit_amounts.mean():.2f}, Median: ${legit_amounts.median():.2f}")
print(f"Fraudulent transactions - Mean: ${fraud_amounts.mean():.2f}, Median: ${fraud_amounts.median():.2f}")

Look at the costs associated with fraudulent charges, assuming that the cost of a false positive (legitimate charge flagged as fraudulent) is, on average, $20 in customer service.

In [None]:
# Calculate potential cost implications
false_negative_cost = fraud_amounts.sum()
false_positive_cost = legit_amounts.mean() * 20  # Assume customer service cost is $20 per false alert

print("\nCost Implications:")
print(f"Total cost of missing all fraud (False Negatives): ${false_negative_cost:.2f}")
print(f"Cost per false alert (False Positive): ${20:.2f}")

# Visualize the transaction amounts by class
plt.figure(figsize=(10, 6))
plt.hist([legit_amounts, fraud_amounts], bins=50, alpha=0.6, 
         label=['Legitimate', 'Fraudulent'], color=['blue', 'red'])
plt.title('Transaction Amount Distribution by Class')
plt.xlabel('Transaction Amount ($)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Visualize class imbalance
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=y)
plt.title('Class Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraudulent)')
plt.ylabel('Count')

# Add count labels
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', 
                (p.get_x() + p.get_width()/2., p.get_height()), 
                ha='center', va='bottom')

plt.show()

Train-test split with 70/30 and stratified sampling to ensure that the class balances are properly represented.

In [None]:
# Split the data for later use
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("\nTraining/Testing Split:")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
print(f"Fraud ratio in training: {(y_train == 1).mean():.2%}")
print(f"Fraud ratio in testing: {(y_test == 1).mean():.2%}")

## Baseline Models

Build and evaluate three baseline models:

1. Majority class predictor
2. Vanilla logistic regression 
3. Vanilla KNN

But first, define a reusable function to evaluate and compare models.

In [None]:
# Function to evaluate and compare models
def evaluate_model(y_true, y_pred, model_name):
    """Evaluate model performance with various metrics"""
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    # Print results
    print(f"\n--- {model_name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    
    # Display confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    print("\nConfusion Matrix:")
    print(f"TN: {cm[0,0]}, FP: {cm[0,1]}")
    print(f"FN: {cm[1,0]}, TP: {cm[1,1]}")
    
    return accuracy, precision, recall, f1

For each model we will:
1. instantiate the model
2. fit the instance on training data
3. generate predictions for the test data
4. compare the predictions with the test labels to evaluate the model

First use `DummyClassifier` to train a majority class predictor.

In [None]:
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train, y_train)
dummy_pred = dummy_clf.predict(X_test)
dummy_metrics = evaluate_model(y_test, dummy_pred, "Majority Class Predictor")

Next, fit a `LogisticRegression` model.

In [None]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg_metrics = evaluate_model(y_test, logreg_pred, "Vanilla Logistic Regression")

Now, KNN with 5 neighbors using `KNeighborsClassifier`.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_metrics = evaluate_model(y_test, knn_pred, "Vanilla KNN")

Compare the results in a table.

In [None]:
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
models = ['Majority Class', 'Logistic Regression', 'KNN']
results = pd.DataFrame({
    'Majority Class': dummy_metrics,
    'Logistic Regression': logreg_metrics,
    'KNN': knn_metrics
}, index=metrics)

# Display table
print("\n--- Model Comparison ---")
print(results)

Based on F1-Score, LogReg comes out on top, so far. How does it fare on ROC AUC?

In [None]:
# Get predicted probabilities for the positive class
y_prob = logreg.predict_proba(X_test)[:, 1]

# Calculate ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Calculate ROC AUC
roc_auc = roc_auc_score(y_test, y_prob)

# Plot ROC curve
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Logistic Regression')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)

# Mark some threshold points on the curve
# Choose a few interesting thresholds to highlight
threshold_indices = [
   min(range(len(thresholds)), key=lambda i: abs(thresholds[i] - 0.3)),
   min(range(len(thresholds)), key=lambda i: abs(thresholds[i] - 0.5)),
   min(range(len(thresholds)), key=lambda i: abs(thresholds[i] - 0.7))
]

# Mark those points on the curve
for i in threshold_indices:
   plt.plot(fpr[i], tpr[i], 'ro')
   plt.annotate(f'threshold = {thresholds[i]:.2f}', 
                (fpr[i], tpr[i]), 
                xytext=(10, -10),
                textcoords='offset points')

plt.show()

# Print additional information
print(f"ROC AUC Score: {roc_auc:.4f}")
print("\nThreshold analysis:")
for i in threshold_indices:
   print(f"Threshold: {thresholds[i]:.2f}, FPR: {fpr[i]:.4f}, TPR: {tpr[i]:.4f}")

## Class Weight Adjustment

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, confusion_matrix, roc_auc_score)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Define evaluation function
def evaluate_model(y_true, y_pred, y_prob, model_name):
   """Evaluate model performance with various metrics"""
   # Calculate metrics
   accuracy = accuracy_score(y_true, y_pred)
   precision = precision_score(y_true, y_pred)
   recall = recall_score(y_true, y_pred)
   f1 = f1_score(y_true, y_pred)
   roc_auc = roc_auc_score(y_true, y_prob)
   
   # Print results
   print(f"\n--- {model_name} ---")
   print(f"Accuracy: {accuracy:.4f}")
   print(f"Precision: {precision:.4f}")
   print(f"Recall: {recall:.4f}")
   print(f"F1 Score: {f1:.4f}")
   print(f"ROC AUC: {roc_auc:.4f}")
   
   # Display confusion matrix
   cm = confusion_matrix(y_true, y_pred)
   print("\nConfusion Matrix:")
   print(f"TN: {cm[0,0]}, FP: {cm[0,1]}")
   print(f"FN: {cm[1,0]}, TP: {cm[1,1]}")
   
   return accuracy, precision, recall, f1, roc_auc

Now let's analyze the effect of class weight adjustment on the LogReg model. Not all models (e.g., KNN) support this approach.

It is achieved with the `class_weight` argument, which can be set to several values. First, let's use the `balanced` option, which automatically adjusts weights inverely proportional to class frequencies.

In [None]:
logreg_balanced = LogisticRegression(class_weight='balanced', random_state=42)
logreg_balanced.fit(X_train, y_train)
logreg_balanced_pred = logreg_balanced.predict(X_test)
logreg_balanced_prob = logreg_balanced.predict_proba(X_test)[:, 1]

Another option is to explicitly set the weights using a dictionary. Here we set fraud class as 20x ore important than non-fraud.

In [None]:
class_weights = {0: 1, 1: 20}
logreg_custom = LogisticRegression(class_weight=class_weights, solver='liblinear', random_state=42)
logreg_custom.fit(X_train, y_train)
logreg_custom_pred = logreg_custom.predict(X_test)
logreg_custom_prob = logreg_custom.predict_proba(X_test)[:, 1]

Summarize the results.

In [None]:
# Evaluate all models
print("=== CLASS WEIGHT ADJUSTMENT ANALYSIS ===")

# Baseline models (from previous cell)
baseline_logreg_metrics = evaluate_model(y_test, logreg_pred, logreg.predict_proba(X_test)[:, 1], "Baseline LogReg")
baseline_knn_metrics = evaluate_model(y_test, knn_pred, knn.predict_proba(X_test)[:, 1], "Baseline KNN")

# Weighted models
balanced_logreg_metrics = evaluate_model(y_test, logreg_balanced_pred, logreg_balanced_prob, "Balanced LogReg")
custom_logreg_metrics = evaluate_model(y_test, logreg_custom_pred, logreg_custom_prob, "Custom Weight LogReg")

# Compare all models in a table
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
results = pd.DataFrame({
   'Baseline LogReg': baseline_logreg_metrics,
   'Balanced LogReg': balanced_logreg_metrics,
   'Custom LogReg': custom_logreg_metrics,
   'Baseline KNN': baseline_knn_metrics
}, index=metrics)

# Display table
print("\n--- Model Comparison ---")
print(results)

From this we can see that the class weight adjustments improved the ROC AUC of our LogReg model, but the F1 Score decreased.

Weighted models catch more of the minority class instances (higher recall) but generate more false alarms (lower precision). This effect increases with the weighting - custom recall > balanced, etc.

Let's look at the ROC curves...

In [None]:
# Visualize ROC curves for all models
plt.figure(figsize=(7, 5))

# Function to plot ROC curve
def plot_roc_curve(y_true, y_prob, model_name, color):
   fpr, tpr, _ = roc_curve(y_true, y_prob)
   roc_auc = roc_auc_score(y_true, y_prob)
   plt.plot(fpr, tpr, color=color, lw=2, label=f'{model_name} (AUC = {roc_auc:.3f})')

# Plot curves for each model
plot_roc_curve(y_test, logreg.predict_proba(X_test)[:, 1], 'Baseline LogReg', 'blue')
plot_roc_curve(y_test, logreg_balanced_prob, 'Balanced LogReg', 'green')
plot_roc_curve(y_test, logreg_custom_prob, 'Custom LogReg', 'red')
plot_roc_curve(y_test, knn.predict_proba(X_test)[:, 1], 'Baseline KNN', 'purple')

# Add diagonal line (random classifier)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Class Weight Adjustment Comparison')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

Doesn't seem like much of a difference. What does a 0.0037 increase in ROC AUC (balanced - baseline) equate to in dollars? The following code calculates the associated costs and benefits for the 3000 observations in the test data.

In [None]:
# Estimate the financial benefit of the Balanced LogReg model over Baseline LogReg

# Get predictions from both models
baseline_pred = logreg.predict(X_test)
baseline_prob = logreg.predict_proba(X_test)[:, 1]
balanced_pred = logreg_balanced.predict(X_test)
balanced_prob = logreg_balanced.predict_proba(X_test)[:, 1]

# Get transaction amounts from test set
test_amounts = X_test['transaction_amount'].values

# Define a function to calculate financial impact
def calculate_financial_impact(y_true, y_pred, transaction_amounts):
    """
    Calculate financial impact of model predictions
    
    Parameters:
    - y_true: True labels (0 = legitimate, 1 = fraud)
    - y_pred: Predicted labels
    - transaction_amounts: Transaction amounts in dollars
    
    Returns:
    - Dictionary with financial metrics
    """
    # Convert to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    transaction_amounts = np.array(transaction_amounts)
    
    # Identify different prediction types
    true_positives = (y_true == 1) & (y_pred == 1)  # Correctly identified fraud
    false_negatives = (y_true == 1) & (y_pred == 0)  # Missed fraud
    false_positives = (y_true == 0) & (y_pred == 1)  # False alarms
    
    # Calculate financial impact
    saved_amount = np.sum(transaction_amounts[true_positives])  # Money saved by catching fraud
    lost_amount = np.sum(transaction_amounts[false_negatives])  # Money lost by missing fraud
    investigation_cost = np.sum(false_positives) * 20  # Cost of investigating false alarms ($20 per case)
    
    # Net financial impact
    net_impact = saved_amount - investigation_cost
    
    return {
        'Saved Amount': saved_amount,
        'Lost Amount': lost_amount,
        'Investigation Cost': investigation_cost,
        'Net Financial Impact': net_impact,
        'Total Fraud Amount': saved_amount + lost_amount,
        'Caught Fraud Percentage': saved_amount / (saved_amount + lost_amount) * 100 if (saved_amount + lost_amount) > 0 else 0
    }

# Calculate financial impact for both models
baseline_impact = calculate_financial_impact(y_test, baseline_pred, test_amounts)
balanced_impact = calculate_financial_impact(y_test, balanced_pred, test_amounts)

# Display results
print("=== Financial Impact Analysis ===")
print("\nBaseline LogReg Model:")
for metric, value in baseline_impact.items():
    if 'Percentage' in metric:
        print(f"{metric}: {value:.2f}%")
    else:
        print(f"{metric}: ${value:.2f}")

print("\nBalanced LogReg Model:")
for metric, value in balanced_impact.items():
    if 'Percentage' in metric:
        print(f"{metric}: {value:.2f}%")
    else:
        print(f"{metric}: ${value:.2f}")

# Calculate difference
difference = {key: balanced_impact[key] - baseline_impact[key] for key in baseline_impact}

print("\nDifference (Balanced - Baseline):")
for metric, value in difference.items():
    if 'Percentage' in metric:
        print(f"{metric}: {value:.2f}%")
    else:
        print(f"{metric}: ${value:.2f}")

# Calculate ROI
if difference['Investigation Cost'] > 0:
    roi = (difference['Saved Amount'] - difference['Investigation Cost']) / difference['Investigation Cost'] * 100
    print(f"\nROI: {roi:.2f}%")
else:
    print("\nROI: Cannot calculate (no additional investigation cost)")

# Plot the financial comparison
import matplotlib.pyplot as plt

metrics = ['Saved Amount', 'Lost Amount', 'Investigation Cost', 'Net Financial Impact']
baseline_values = [baseline_impact[m] for m in metrics]
balanced_values = [balanced_impact[m] for m in metrics]

x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 7))
rects1 = ax.bar(x - width/2, baseline_values, width, label='Baseline LogReg')
rects2 = ax.bar(x + width/2, balanced_values, width, label='Balanced LogReg')

# Add labels and title
ax.set_title('Financial Impact Comparison')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

# Add value labels on bars
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate(f'${height:.2f}',
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom', rotation=45)

autolabel(rects1)
autolabel(rects2)

plt.tight_layout()
plt.show()

# Create a second plot showing percentage of caught fraud
labels = ['Baseline LogReg', 'Balanced LogReg']
caught_percentages = [baseline_impact['Caught Fraud Percentage'], balanced_impact['Caught Fraud Percentage']]

plt.figure(figsize=(10, 6))
plt.bar(labels, caught_percentages, color=['blue', 'green'])
plt.title('Percentage of Fraud Amount Caught by Model')
plt.ylabel('Percentage of Total Fraud Amount')
plt.ylim(0, 100)

# Add percentage labels above bars
for i, v in enumerate(caught_percentages):
    plt.text(i, v + 1, f"{v:.1f}%", ha='center')

plt.tight_layout()
plt.show()

The balanced model saved about \\$10k (net financial impact difference) by catching 20% more fraud than the baseline. That is over only 3k observations. Visa alone processed 720m transactions per **day** in 2023.

## Over / Under Sampling Techniques

We'll use over and under sampling techniques with the KNN model, which didn't support class weight adjustment.

First, refit and evaluate the baseline KNN again to remind us of the results.

In [None]:
# 1. Baseline KNN model (from previous cells)
print("=== BASELINE KNN MODEL ===")
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_prob = knn.predict_proba(X_test)[:, 1]
baseline_metrics = evaluate_model(y_test, knn_pred, knn_prob, "Baseline KNN")

Next, use SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic examples of the minority class (fraud) by interpolating between existing minority class examples.

By adding new observations based on the existing data (not copies of observations), it creates a balanced dataset.

In [None]:
# 2. SMOTE Oversampling
print("\n=== SMOTE OVERSAMPLING ===")
# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Display class distribution after SMOTE
print("\nClass distribution after SMOTE:")
print(pd.Series(y_train_smote).value_counts(normalize=True) * 100)

Note that the size of the training data has nearly doubled. This can introduce noise.

In [None]:
y_train_smote.shape

Use the SMOTE'd data to fit a new KNN model.

In [None]:
# Train KNN on SMOTE-resampled data
knn_smote = KNeighborsClassifier(n_neighbors=5)
knn_smote.fit(X_train_smote, y_train_smote)
knn_smote_pred = knn_smote.predict(X_test)
knn_smote_prob = knn_smote.predict_proba(X_test)[:, 1]
smote_metrics = evaluate_model(y_test, knn_smote_pred, knn_smote_prob, "KNN with SMOTE")

Again, ROC AUC has increased but F1 has decreased, as gains in Precision take away from Recall, or vice versa.

Next, let's try random undersampling, where random examples from the majority class are simply removed from the dataset. Not surprisingly, this can result in information loss. Yes, it is a real approach.

In [None]:
# 3. Random Undersampling
print("\n=== RANDOM UNDERSAMPLING ===")
# Apply random undersampling to the training data
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Display class distribution after undersampling
print("\nClass distribution after undersampling:")
print(pd.Series(y_train_rus).value_counts(normalize=True) * 100)

In [None]:
y_train_rus.shape

We're now working with 1/10th the original data!

Fit a new model...

In [None]:
# Train KNN on undersampled data
knn_rus = KNeighborsClassifier(n_neighbors=5)
knn_rus.fit(X_train_rus, y_train_rus)
knn_rus_pred = knn_rus.predict(X_test)
knn_rus_prob = knn_rus.predict_proba(X_test)[:, 1]
rus_metrics = evaluate_model(y_test, knn_rus_pred, knn_rus_prob, "KNN with Random Undersampling")

Compare the results.

In [None]:
models = ['Baseline KNN', 'KNN with SMOTE', 'KNN with Undersampling']
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
results = pd.DataFrame({
    models[0]: baseline_metrics,
    models[1]: smote_metrics,
    models[2]: rus_metrics
}, index=metrics)

print("\n--- Model Comparison ---")
print(results)

The baseline model is very cautious about flagging transactions as fraudulent (high precision, low recall). The resampled models are more aggressive, catching more fraud (higher recall) but generating more false alarms (lower precision).

Both resampling techniques show meaningful improvements in ROC AUC. This is significant because ROC AUC measures performance across all possible thresholds.

## Probability Calibration

We'll demonstrate this using the LogReg model we built with custom weights:

```text
                Custom LogReg
Accuracy             0.853667
Precision            0.247253
Recall               0.828221
F1 Score             0.380818
ROC AUC              0.923279
```

It is a good candidate because:
- It shows the highest ROC AUC (0.923279), indicating excellent discriminative ability
- It has the most severe precision-recall trade-off (very low precision of 0.247 with very high recall of 0.828)
- Class weighting directly influences the model's probability estimates by modifying the loss function
- The dramatic gap between accuracy (0.854) and ROC AUC (0.923) suggests probabilities are poorly calibrated with the default threshold

First we'll generate a reliability diagram (calibration curve) for the custom LogReg model, to illustrate how far out of calibration its probs are.

In [None]:
# Generate a reliability diagram (calibration curve) for the custom LogReg model

from sklearn.calibration import calibration_curve

# Get predicted probabilities from the custom weighted model
custom_probs = logreg_custom.predict_proba(X_test)[:, 1]

# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_test, custom_probs, n_bins=10)

# Create reliability diagram
plt.figure(figsize=(7, 5))

# Plot perfectly calibrated line
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated', color='gray')

# Plot calibration curve
plt.plot(prob_pred, prob_true, marker='o', linewidth=2, 
        label=f'Custom LogReg (Brier score: {brier_score_loss(y_test, custom_probs):.3f})',
        color='red')

# Add histogram of predicted probabilities
plt.hist(custom_probs, bins=10, range=(0, 1), alpha=0.2, color='red', 
        density=True, align='mid', label='Histogram')

# Add details
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives (Actual Frequency)')
plt.title('Calibration Curve - Custom Weighted Logistic Regression')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)

# Annotate key points where calibration is poor
max_deviation_idx = np.argmax(np.abs(prob_true - prob_pred))
plt.annotate(f'Poor calibration\n({prob_pred[max_deviation_idx]:.2f}, {prob_true[max_deviation_idx]:.2f})',
            xy=(prob_pred[max_deviation_idx], prob_true[max_deviation_idx]),
            xytext=(prob_pred[max_deviation_idx] + 0.1, prob_true[max_deviation_idx] - 0.1),
            arrowprops=dict(facecolor='black', shrink=0.05, width=1.5, headwidth=8),
            fontsize=10)

plt.tight_layout()
plt.show()

In this plot, the X axis is the predicted probabilities and the y is the actual rate. For example, where the model predicts 84% chance of fraud, it occurs only 24% of the time. The model dramatically underestimates true probabilities.

Points above the diagonal show where the model overestimates probabilities. Points below the diagonal are underestimated. This model dramatically overestimates probabilities in order to improve the likelihood of correctly identifying fraud.

How Platt Scaling Works:
1. Platt scaling fits a logistic regression model to the original model's predictions
2. This creates a mapping function from original scores to calibrated probabilities
3. The mapping preserves the ranking of predictions (same ROC AUC) but adjusts the scale
4. The calibrated probabilities more accurately reflect the true likelihood of fraud
5. This enables better decision-making when selecting probability thresholds

In [None]:
# Apply probability calibration to improve the Custom LogReg model
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import brier_score_loss, precision_recall_curve
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split

In [None]:
# First, let's split our training data to have a separate calibration set
# This avoids calibrating on the same data used for model training
X_train_model, X_train_calib, y_train_model, y_train_calib = train_test_split(
   X_train, y_train, test_size=0.3, random_state=42, stratify=y_train
)

# Train base model on the training subset
base_model = LogisticRegression(class_weight={0:1, 1:20}, solver='liblinear', random_state=42)
base_model.fit(X_train_model, y_train_model)

# Create calibrated model using Platt scaling (sigmoid method)
calibrated_model = CalibratedClassifierCV(base_model, method='sigmoid', cv='prefit')
calibrated_model.fit(X_train_calib, y_train_calib)

# Get predictions from both models on test set
base_probs = base_model.predict_proba(X_test)[:, 1]
calibrated_probs = calibrated_model.predict_proba(X_test)[:, 1]

# Calculate calibration curves for both models
prob_true_base, prob_pred_base = calibration_curve(y_test, base_probs, n_bins=10)
prob_true_calib, prob_pred_calib = calibration_curve(y_test, calibrated_probs, n_bins=10)

# Calculate Brier scores (lower is better)
brier_base = brier_score_loss(y_test, base_probs)
brier_calib = brier_score_loss(y_test, calibrated_probs)

In [None]:
# Create reliability diagram comparing the two models
plt.figure(figsize=(7, 5))

# Plot perfectly calibrated line
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated', color='gray')

# Plot calibration curves
plt.plot(prob_pred_base, prob_true_base, marker='o', linewidth=2, 
        label=f'Original Custom LogReg (Brier: {brier_base:.3f})',
        color='red')
plt.plot(prob_pred_calib, prob_true_calib, marker='s', linewidth=2, 
        label=f'Platt Calibrated LogReg (Brier: {brier_calib:.3f})',
        color='blue')

# Add histograms of predicted probabilities
plt.hist(base_probs, bins=10, range=(0, 1), alpha=0.2, color='red', 
        density=True, align='mid')
plt.hist(calibrated_probs, bins=10, range=(0, 1), alpha=0.2, color='blue', 
        density=True, align='mid')

plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives (Actual Frequency)')
plt.title('Reliability Diagram - Before and After Platt Scaling')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)

# Add explanation
plt.figtext(0.15, 0.02, 
           "Platt scaling fits a logistic regression to model outputs\n" +
           "to map them to calibrated probabilities", 
           fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Compare metrics for both models
print("Calibration Comparison:")
print(f"Original model Brier score: {brier_base:.4f}")
print(f"Calibrated model Brier score: {brier_calib:.4f}")
print(f"Improvement in Brier score: {(1 - brier_calib/brier_base)*100:.2f}%")

# Calculate mean absolute calibration error
calib_error_base = np.mean(np.abs(prob_true_base - prob_pred_base))
calib_error_calib = np.mean(np.abs(prob_true_calib - prob_pred_calib))
print(f"Original model mean absolute calibration error: {calib_error_base:.4f}")
print(f"Calibrated model mean absolute calibration error: {calib_error_calib:.4f}")
print(f"Improvement in calibration error: {(1 - calib_error_calib/calib_error_base)*100:.2f}%")

## Threshold Optimization

Demonstrate this with the calibrated LogReg model:

- highest ROC AUC (0.923279) implies best performance over all thresholds
- already calibrated!

First, define a function that calculates the cost for a given threshold, true classes, predicted probabilities, and the FP cost.

In [None]:
def calculate_expected_cost(y_true, y_prob, threshold, fp_cost=20):
    """
    Calculate the expected cost given probabilities and a threshold
    
    Parameters:
    - y_true: True labels (0=legitimate, 1=fraud)
    - y_prob: Predicted probabilities of fraud
    - threshold: Probability threshold for classification
    - fp_cost: Cost of investigating a false positive ($)
    
    Returns:
    - Total expected cost
    """
    y_pred = (y_prob >= threshold).astype(int)
    
    # Get transaction amounts from test set
    transaction_amounts = X_test['transaction_amount'].values
    
    # Identify prediction types
    false_negatives = (y_true == 1) & (y_pred == 0)  # Missed fraud
    false_positives = (y_true == 0) & (y_pred == 1)  # False alarms
    
    # Calculate costs
    fn_cost = np.sum(transaction_amounts[false_negatives])  # Lost money
    fp_cost_total = np.sum(false_positives) * fp_cost  # Investigation cost
    
    return fn_cost + fp_cost_total

Use that function to calculate the cost for a range of threshold values and find the minimum.

In [None]:
# Find the cost-optimal threshold by trying a range of thresholds
thresholds_to_try = np.linspace(0.01, 0.99, 99)
costs = []

for threshold in thresholds_to_try:
    cost = calculate_expected_cost(y_test, calibrated_probs, threshold)
    costs.append(cost)

# Find minimum cost threshold
cost_optimal_idx = np.argmin(costs)
cost_optimal_threshold = thresholds_to_try[cost_optimal_idx]
min_cost = costs[cost_optimal_idx]

print(f"\nCost-optimal threshold: {cost_optimal_threshold:.4f}")
print(f"Minimum expected cost: ${min_cost:.2f}")

Plot the cost vs threshold.

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(thresholds_to_try, costs, lw=2, color='blue')
plt.axvline(x=cost_optimal_threshold, color='red', linestyle='--', 
           label=f'Optimal threshold: {cost_optimal_threshold:.3f}')
plt.xlabel('Threshold')
plt.ylabel('Expected Cost ($)')
plt.title('Expected Cost vs. Threshold')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

Observations:

- At near-zero thresholds costs spike due to excessive false positives (investigation costs)
- At near-one thresholds costs spike due to excessive false negatives (missed fraud)
- In the "saddle" the FP / FN tradeoff plays out at a rate proportional to the cost differences

Left of the optimal point the cost of FPs dominate. Right of it, the costs are dominated by FNs. Moving left from the optimal increases FPs faster than it reduces FNs. Moving right the opposite is true.

This is a very valuable, actionable chart.

Create a summary of financial impact for different threshold values.

First, define a helper function to calculate the values of interest.

- Saved = total cost of fraudulent charges avoided
- Lost = total fraudulent charges missed
- FP Cost = total cost of false postives (investigations)
- Net Impact = saved - fp cost
- Alerts = TP + FP
- Fraud Cases Caught = TP
- Fraud Cases Missed = FN

**Why doesn't Lost factor into Net Impact?**

In [None]:
def calculate_financial_metrics(y_true, y_prob, threshold, fp_cost=20):
    """Calculate comprehensive financial metrics for a given threshold"""
    y_pred = (y_prob >= threshold).astype(int)
    
    # Get confusion matrix elements
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    # Get transaction amounts from test set
    transaction_amounts = X_test['transaction_amount'].values
    
    # Identify different prediction types
    true_positives = (y_true == 1) & (y_pred == 1)  # Correctly identified fraud
    false_negatives = (y_true == 1) & (y_pred == 0)  # Missed fraud
    
    # Calculate financial impact
    saved_amount = np.sum(transaction_amounts[true_positives])
    lost_amount = np.sum(transaction_amounts[false_negatives])
    investigation_cost = fp * fp_cost
    
    # Net financial impact
    net_impact = saved_amount - investigation_cost
    
    return {
        'Thresh': threshold,
        'Saved': saved_amount,
        'Lost': lost_amount,
        'FP Cost': investigation_cost,
        'Net Impact': net_impact,
        'Prec': precision_score(y_true, y_pred) if tp + fp > 0 else 0,
        'Recall': recall_score(y_true, y_pred) if tp + fn > 0 else 0,
        'FPR': fp / (fp + tn) if (fp + tn) > 0 else 0,
        'Alerts': tp + fp,
        'Fraud Cases Caught': tp,
        'Fraud Cases Missed': fn
    }

Calculate metrics at thresholds of interest.

In [None]:
thresholds_to_compare = {
    'Default (0.5)': 0.5,
    'Cost-Optimal': cost_optimal_threshold,
    'High-Recall (0.05)': 0.05,
    'High-Precision (0.95)': 0.95
}

# Calculate metrics for each threshold
financial_results = []
for name, threshold in thresholds_to_compare.items():
    metrics = calculate_financial_metrics(y_test, calibrated_probs, threshold)
    metrics['Name'] = name
    financial_results.append(metrics)

# Convert results to DataFrame for better display
results_df = pd.DataFrame(financial_results)
results_df = results_df[['Name', 'Thresh', 'Net Impact', 
                        'Saved', 'Lost', 'FP Cost',
                        'Prec', 'Recall', 
                        'FPR', 'Alerts']]

# Format as currency where appropriate
for col in ['Net Impact', 'Saved', 'Lost', 'FP Cost']:
    results_df[col] = results_df[col].map('${:,.0f}'.format)

# Format percentages
for col in ['Prec', 'Recall', 'FPR']:
    results_df[col] = results_df[col].map('{:.1%}'.format)

Display results dataframe.

In [None]:
print("\nFinancial Impact at Different Thresholds:\n")
# print(results_df.to_string(index=False))
results_df

Observations:

- Cost-optimal maximizes net impact by catching 71.8% of fraud while keeping investigation costs in hand
- Default isn't very sensitive - it misses fraud (37.4%) but also has few FPs as a result
- High-precision perfectly minimizes FPs
- High-recall catches more fraud (82.8%) than cost-optimal, at the expense of higher FPR (14.3%) and associated costs