# Credit Card Fraud Detection with Machine Learning

## Introduction

This analysis explores a dataset of anonymized credit card transactions to build machine learning models that can identify fraudulent activity. The dataset was published by the [Machine Learning Group at ULB (Universit\u00e9 Libre de Bruxelles)](http://mlg.ulb.ac.be) and is available on [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).

### About the Dataset

The dataset contains **284,807 credit card transactions** made by European cardholders over a two-day period in September 2013. Of these, only **492 are fraudulent** \u2014 just 0.172% of all transactions. This extreme class imbalance is a central challenge of this analysis.

For privacy reasons, the original transaction details have been transformed using **Principal Component Analysis (PCA)**. The resulting features `V1` through `V28` are anonymous numerical values. Only two features remain in their original form:

- **Time**: The number of seconds elapsed since the first transaction in the dataset
- **Amount**: The transaction amount (in Euros)

The target variable is **Class**, where `1` indicates fraud and `0` indicates a legitimate transaction.

The dataset was collected through a research collaboration between Worldline and the Machine Learning Group at ULB, and has been widely used as a benchmark for fraud detection research.

In a typical credit card dataset, the original features behind these might include things like the merchant category (grocery store, gas station, online retailer), the geographic distance from the cardholder's home, how frequently the card has been used recently, or whether the purchase was made online or in person. After PCA transformation, each V-feature becomes a mathematical blend of several original details \u2014 so `V14` doesn't directly mean "merchant type," but it might capture a combined pattern involving merchant type, location, and other factors.

### Our Approach

We will:
1. Explore the data to understand patterns in fraudulent vs. legitimate transactions
2. Address the severe class imbalance using SMOTE (Synthetic Minority Oversampling)
3. Train and compare three supervised models: Logistic Regression, Random Forest, and XGBoost
4. Try an unsupervised approach using Isolation Forest for anomaly detection
5. Evaluate all models using metrics designed for imbalanced data \u2014 primarily **AUPRC** (Area Under the Precision-Recall Curve) and **ROC-AUC**, rather than simple accuracy

> **Note:** The dataset file (`creditcard.csv`) is not included in this repository due to its size (~150MB). You can download it directly from [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).

In [None]:
%pip install imbalanced-learn xgboost -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, roc_curve, precision_recall_curve,
                             average_precision_score)
from imblearn.over_sampling import SMOTE
import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
%matplotlib inline

## Loading the Data

Let's start by loading the dataset and taking a first look at what we're working with.

In [None]:
cards = pd.read_csv('creditcard.csv')
print(f"Dataset shape: {cards.shape[0]:,} transactions, {cards.shape[1]} columns")
print(f"Missing values: {cards.isna().sum().sum()}")
cards.head()

The dataset contains **284,807 rows** (one per transaction) and **31 columns**. There are no missing values, so we can proceed directly to our analysis.

Each row represents a single transaction with:
- 28 anonymized features (`V1` through `V28`) derived from PCA
- `Time` and `Amount` — the only features in their original form
- `Class` — our target variable (`0` = legitimate, `1` = fraud)

## Exploratory Data Analysis

Before building any models, let's explore the data to understand the patterns and challenges we're dealing with.

### How Common is Fraud?

In [None]:
class_counts = cards['Class'].value_counts()
class_pct = (class_counts / len(cards) * 100).round(3)

print("Transaction Counts:")
print(f"  Legitimate: {class_counts[0]:,} ({class_pct[0]}%)")
print(f"  Fraudulent: {class_counts[1]:,} ({class_pct[1]}%)")

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

bars = axes[0].bar(['Legitimate', 'Fraud'], class_counts.values,
                    color=['steelblue', 'firebrick'])
axes[0].set_title('Number of Transactions by Class')
axes[0].set_ylabel('Count')
for bar, count in zip(bars, class_counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2000,
                 f'{count:,}', ha='center', fontsize=11)

axes[1].bar(['Legitimate', 'Fraud'], class_pct.values,
            color=['steelblue', 'firebrick'])
axes[1].set_title('Percentage of Transactions by Class')
axes[1].set_ylabel('Percentage (%)')
axes[1].set_ylim(0, 105)
for i, pct in enumerate(class_pct.values):
    axes[1].text(i, pct + 1.5, f'{pct}%', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

Fraud is extremely rare — only **0.172%** of all transactions. For every fraudulent transaction, there are roughly **578 legitimate ones**.

This is a major challenge for machine learning: a model that simply labels *every* transaction as legitimate would be "99.8% accurate" while catching zero fraud. This is why **accuracy is a misleading metric** for this problem.

Instead, we will use metrics that specifically measure how well the model identifies the rare fraud cases:
- **Precision**: Of the transactions flagged as fraud, how many actually are?
- **Recall**: Of all actual fraud cases, how many did we catch?
- **AUPRC** (Area Under the Precision-Recall Curve): A single number summarizing the precision-recall tradeoff — the higher, the better
- **ROC-AUC**: How well the model separates fraud from legitimate transactions overall

### Transaction Amounts

How do fraudulent transactions compare to legitimate ones in terms of amount?

In [None]:
fraud = cards[cards['Class'] == 1]
legit = cards[cards['Class'] == 0]

print("Legitimate Transaction Amounts:")
print(legit['Amount'].describe().round(2).to_string())
print(f"\nFraudulent Transaction Amounts:")
print(fraud['Amount'].describe().round(2).to_string())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(legit['Amount'], bins=50, alpha=0.7, label='Legitimate',
             color='steelblue', density=True)
axes[0].hist(fraud['Amount'], bins=50, alpha=0.7, label='Fraud',
             color='firebrick', density=True)
axes[0].set_title('Distribution of Transaction Amounts')
axes[0].set_xlabel('Amount')
axes[0].set_ylabel('Density')
axes[0].set_xlim(0, 1000)
axes[0].legend()

bp = axes[1].boxplot([legit['Amount'], fraud['Amount']],
                     labels=['Legitimate', 'Fraud'], patch_artist=True)
bp['boxes'][0].set_facecolor('steelblue')
bp['boxes'][1].set_facecolor('firebrick')
axes[1].set_title('Transaction Amount by Class')
axes[1].set_ylabel('Amount')
axes[1].set_ylim(0, 500)

plt.tight_layout()
plt.show()

Several patterns stand out:

- **Fraudulent transactions tend to be smaller.** The median fraud amount (~\$9.25) is much lower than the median legitimate amount (~\$22.00). The majority of fraud occurs under \$100.
- **Legitimate transactions have a wider range**, with some reaching over \$25,000, while the largest fraud transaction is around \$2,125.
- **Both distributions are heavily right-skewed**, with most transactions clustered at lower amounts.

This makes intuitive sense: fraudsters often test with small amounts first to see if a stolen card works, and smaller transactions are less likely to trigger manual review.

### When Do Transactions Happen?

The `Time` column records seconds since the first transaction. Since the data spans roughly 48 hours, we can convert this to an approximate hour of the day to look for daily patterns.

In [None]:
cards['Hour'] = (cards['Time'] / 3600) % 24

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

legit_hours = cards[cards['Class'] == 0]['Hour']
fraud_hours = cards[cards['Class'] == 1]['Hour']

axes[0].hist(legit_hours, bins=48, alpha=0.7, label='Legitimate',
             color='steelblue', density=True)
axes[0].hist(fraud_hours, bins=48, alpha=0.7, label='Fraud',
             color='firebrick', density=True)
axes[0].set_title('Transaction Distribution by Hour of Day')
axes[0].set_xlabel('Hour of Day')
axes[0].set_ylabel('Density')
axes[0].legend()

hourly = cards.groupby(cards['Hour'].round())['Class'].mean() * 100
axes[1].bar(hourly.index, hourly.values, color='firebrick', alpha=0.7)
axes[1].set_title('Fraud Rate by Hour of Day')
axes[1].set_xlabel('Hour of Day')
axes[1].set_ylabel('Fraud Rate (%)')

plt.tight_layout()
plt.show()

Two interesting patterns emerge:

- **Legitimate transactions drop off during nighttime hours** (roughly 11 PM to 6 AM), which reflects normal consumer behavior.
- **The fraud rate is higher during off-peak hours.** When overall transaction volume is low, a larger proportion of transactions turn out to be fraudulent. This suggests that fraudsters may prefer these quieter hours when there is less monitoring and slower response times.

This time-of-day pattern is a useful signal, so we will engineer an `Hour` feature for our models.

### Which Features Correlate with Fraud?

### 

Since most features are anonymized (from PCA), we can't interpret them directly. However, we can still identify which ones have the strongest statistical relationship with fraud.

In [None]:
The features with the strongest relationship to fraud are:

- **V17, V14, V12, V10** (negative correlation): When these values decrease, the likelihood of fraud increases.
- **V11, V4, V2** (positive correlation): When these values increase, fraud becomes more likely.

To make this more concrete: imagine V14 originally captured something related to how closely a transaction matches the cardholder's usual spending pattern. A low V14 value would mean the transaction looks unusual for that cardholder — and unusual transactions are more likely to be fraud. Conversely, if V11 captured something related to geographic distance from the cardholder's home, a high V11 value might mean the purchase happened far from where they normally shop — another red flag.

Most individual correlations are relatively weak (below |0.3|), which means **no single feature is a reliable fraud indicator on its own**. This is exactly why we need machine learning models — they can combine multiple weak signals into a strong overall prediction. It's like diagnosing an illness: no single symptom confirms the diagnosis on its own, but the right combination of several symptoms together paints a clear picture.

The `Amount` and `Time` features have near-zero correlation with fraud, suggesting that on their own they are not strong predictors. However, they may still contribute useful information when combined with other features in a model.

The features with the strongest relationship to fraud are:

- **V17, V14, V12, V10** (negative correlation): When these values decrease, the likelihood of fraud increases.
- **V11, V4, V2** (positive correlation): When these values increase, fraud becomes more likely.

Most individual correlations are relatively weak (below |0.3|), which means **no single feature is a reliable fraud indicator on its own**. This is exactly why we need machine learning models — they can combine multiple weak signals into a strong overall prediction.

The `Amount` and `Time` features have near-zero correlation with fraud, suggesting that on their own they are not strong predictors. However, they may still contribute useful information when combined with other features in a model.

## Preparing the Data for Modeling

Before training our models, we need to:

1. **Engineer a new feature**: Convert `Time` to `Hour` of day (capturing daily patterns better than raw seconds)
2. **Scale features**: The PCA features (`V1`–`V28`) are already standardized, but `Amount` and `Hour` are on different scales and need standardization
3. **Split the data**: Separate into training (80%) and testing (20%) sets, using stratified sampling to preserve the fraud ratio in both sets

In [None]:
X = cards.drop('Class', axis=1).copy()
y = cards['Class'].copy()

# Replace raw Time (seconds) with Hour of day
X['Hour'] = (X['Time'] / 3600) % 24
X = X.drop('Time', axis=1)

# Stratified split preserves the fraud ratio in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

X_train = X_train.copy()
X_test = X_test.copy()

# Scale Amount and Hour (fit ONLY on training data to prevent data leakage)
scaler = StandardScaler()
X_train[['Amount', 'Hour']] = scaler.fit_transform(X_train[['Amount', 'Hour']])
X_test[['Amount', 'Hour']] = scaler.transform(X_test[['Amount', 'Hour']])

print(f"Training set: {X_train.shape[0]:,} transactions ({y_train.sum():,} fraud)")
print(f"Test set:     {X_test.shape[0]:,} transactions ({y_test.sum():,} fraud)")
print(f"Features:     {X_train.shape[1]}")

### Handling Class Imbalance with SMOTE

Our training set has only ~394 fraud cases out of ~227,000 transactions. If we train models directly on this imbalanced data, they will likely learn to ignore the minority class entirely.

**SMOTE** (Synthetic Minority Oversampling Technique) addresses this by generating new *synthetic* fraud examples. It works by:
1. Picking a real fraud transaction
2. Finding its nearest neighbors among other fraud transactions
3. Creating a new synthetic example somewhere between them

This gives the model enough fraud examples to learn meaningful patterns, without the downsides of simply duplicating existing samples or throwing away legitimate data.

We apply SMOTE **only to the training data**. The test set remains untouched so it reflects the real-world class distribution.

In [None]:
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Before SMOTE: {dict(pd.Series(y_train).value_counts())}")
print(f"After SMOTE:  {dict(pd.Series(y_train_smote).value_counts())}")

After SMOTE, the training set is now balanced. The model will see an equal number of fraud and legitimate examples, allowing it to learn the characteristics of both classes effectively.

## Training Machine Learning Models

We will train three supervised models, each with different strengths:

| Model | How It Works | Why Include It |
|---|---|---|
| **Logistic Regression** | Finds a linear decision boundary between classes | Fast, interpretable baseline |
| **Random Forest** | Combines many decision trees through majority voting | Handles non-linear patterns, resistant to overfitting |
| **XGBoost** | Builds trees sequentially, each correcting the previous one's mistakes | State-of-the-art for tabular data, excels with imbalanced datasets |

In [None]:
models = {}

# 1. Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_smote, y_train_smote)
models['Logistic Regression'] = lr

# 2. Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train_smote, y_train_smote)
models['Random Forest'] = rf

# 3. XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=100, max_depth=6, learning_rate=0.1,
    random_state=42, eval_metric='logloss', n_jobs=-1
)
xgb_model.fit(X_train_smote, y_train_smote)
models['XGBoost'] = xgb_model

print("All models trained successfully.")

## Evaluating Model Performance

Now let's see how each model performs on the **test set** — data it has never seen during training. The test set has the original imbalanced class distribution, just like real-world transactions.

### Confusion Matrices

A confusion matrix shows exactly where each model gets things right and wrong:
- **Top-left (True Negatives)**: Legitimate transactions correctly identified
- **Top-right (False Positives)**: Legitimate transactions incorrectly flagged as fraud
- **Bottom-left (False Negatives)**: Fraud that slipped through undetected
- **Bottom-right (True Positives)**: Fraud correctly identified

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, (name, model) in zip(axes, models.items()):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt=',d', cmap='Blues', ax=ax,
                xticklabels=['Legitimate', 'Fraud'],
                yticklabels=['Legitimate', 'Fraud'])
    ax.set_title(name, fontsize=13)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.suptitle('Confusion Matrices', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

The confusion matrices reveal the fundamental tradeoff in fraud detection:

- **Catching more fraud** (higher recall) comes at the cost of **more false alarms** (lower precision).
- Each model strikes this balance differently. Compare the bottom-right cell (fraud caught) against the top-right cell (false alarms) across the three models.

In practice, this tradeoff is a business decision: investigating a false alarm costs time and resources, but missing a real fraud case means direct financial loss.

### Detailed Metrics

In [None]:
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    print(f"\n{'='*55}")
    print(f" {name}")
    print(f"{'='*55}")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    print(f"  ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
    print(f"  AUPRC:   {average_precision_score(y_test, y_prob):.4f}")

### ROC Curves

The ROC (Receiver Operating Characteristic) curve plots the tradeoff between the True Positive Rate (fraud caught) and the False Positive Rate (false alarms) at every possible classification threshold. A perfect model hugs the top-left corner. The area under this curve (**ROC-AUC**) summarizes overall performance: 1.0 is perfect, 0.5 is random guessing.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc_score = roc_auc_score(y_test, y_prob)
    ax.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {auc_score:.4f})')

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Guess (AUC = 0.5)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
ax.set_title('ROC Curve Comparison', fontsize=14)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

All three models achieve strong ROC-AUC scores, well above the 0.5 random baseline. However, **ROC-AUC can be overly optimistic when classes are highly imbalanced** — a model can appear to perform well on this metric while still producing many false positives in absolute terms.

This is why the **Precision-Recall curve** is the recommended evaluation metric for this dataset.

### Precision-Recall Curves

The Precision-Recall curve focuses specifically on the model's ability to identify the rare fraud class:
- **Precision** (y-axis): "When the model flags a transaction as fraud, how often is it correct?"
- **Recall** (x-axis): "Of all actual fraud cases, how many did the model find?"

The **AUPRC** (Area Under the Precision-Recall Curve) summarizes this in a single number. Higher AUPRC means the model catches more fraud while making fewer false alarms.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    auprc = average_precision_score(y_test, y_prob)
    ax.plot(recall, precision, linewidth=2, label=f'{name} (AUPRC = {auprc:.4f})')

baseline = y_test.mean()
ax.axhline(y=baseline, color='black', linestyle='--', linewidth=1,
           label=f'Baseline ({baseline:.4f})')
ax.set_xlabel('Recall (Fraud Cases Found)', fontsize=12)
ax.set_ylabel('Precision (Accuracy of Fraud Flags)', fontsize=12)
ax.set_title('Precision-Recall Curve Comparison', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
plt.tight_layout()
plt.show()

The Precision-Recall curves provide a more nuanced and honest picture than the ROC curves. The dashed baseline represents a model that randomly flags transactions — its precision would equal the fraud rate (~0.17%). All three models far exceed this baseline.

The model with the highest AUPRC achieves the best balance between catching fraud and avoiding false alarms, making it the strongest candidate for real-world deployment.

### Model Comparison Summary

In [None]:
summary = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()

    summary.append({
        'Model': name,
        'ROC-AUC': round(roc_auc_score(y_test, y_prob), 4),
        'AUPRC': round(average_precision_score(y_test, y_prob), 4),
        'Fraud Caught': f"{tp}/{tp+fn} ({round(tp/(tp+fn)*100, 1)}%)",
        'False Alarms': f"{fp:,}",
        'Missed Fraud': str(fn)
    })

summary_df = pd.DataFrame(summary).sort_values('AUPRC', ascending=False).reset_index(drop=True)
summary_df

This summary table makes the tradeoffs concrete. For each model we can see:
- How many of the fraud cases in the test set it caught
- How many legitimate transactions it incorrectly flagged
- Its overall ranking by AUPRC (the most reliable metric for this problem)

## Anomaly Detection with Isolation Forest

So far, we've used **supervised learning** — models that learn from labeled examples of both fraud and legitimate transactions. But what if we didn't have those labels?

**Isolation Forest** takes a completely different approach. Instead of learning from labeled fraud examples, it learns what "normal" transactions look like and flags anything unusual as a potential anomaly. This mirrors how fraud detection often works in practice: you learn the pattern of normal behavior and investigate deviations.

The key idea is that anomalies (like fraud) are rare and different from the majority, so they are easier to "isolate." The algorithm builds random decision trees and measures how quickly each transaction can be separated from the rest. Transactions that are isolated quickly are likely anomalies.

In [None]:
The Isolation Forest provides a useful comparison point. As a fully unsupervised method, it doesn't use any fraud labels during training — it simply learns the boundaries of "normal" transactions.

While it generally won't match the precision of the supervised models (which have the advantage of learning directly from labeled fraud examples), it demonstrates that **anomaly detection can identify fraudulent patterns without any labeled training data**. This is valuable in real-world scenarios where labeled fraud data is limited, expensive to obtain, or not yet available for new types of fraud.

Think of an experienced bank teller who has processed thousands of legitimate transactions over the years. Even without ever studying specific fraud cases, they develop an instinct for what "normal" looks like — and would notice something off about a sudden overseas purchase from a customer who has never traveled abroad.

The Isolation Forest provides a useful comparison point. As a fully unsupervised method, it doesn't use any fraud labels during training — it simply learns the boundaries of "normal" transactions.

While it generally won't match the precision of the supervised models (which have the advantage of learning directly from labeled fraud examples), it demonstrates that **anomaly detection can identify fraudulent patterns without any labeled training data**. This is valuable in real-world scenarios where labeled fraud data is limited, expensive to obtain, or not yet available for new types of fraud.

## Choosing the Right Threshold

By default, our models use a 0.5 probability threshold: if the predicted fraud probability exceeds 50%, the transaction is flagged. But this default is rarely optimal.

In fraud detection, the consequences of errors are not equal:
- **Missing a fraud case** (false negative) means direct financial loss
- **Flagging a legitimate transaction** (false positive) means inconvenience and investigation cost

By adjusting the threshold, we can control this tradeoff. A lower threshold catches more fraud but creates more false alarms. A higher threshold reduces false alarms but risks missing fraud.

In [None]:
# Use the best supervised model by AUPRC
best_name = max(models, key=lambda n: average_precision_score(
    y_test, models[n].predict_proba(X_test)[:, 1]
))
best_model = models[best_name]
print(f"Best model by AUPRC: {best_name}\n")

y_prob_best = best_model.predict_proba(X_test)[:, 1]
precision_vals, recall_vals, thresholds = precision_recall_curve(y_test, y_prob_best)

# F1 score at each threshold (harmonic mean of precision and recall)
f1_scores = 2 * (precision_vals[:-1] * recall_vals[:-1]) / (
    precision_vals[:-1] + recall_vals[:-1] + 1e-8)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(thresholds, precision_vals[:-1], linewidth=2, label='Precision', color='steelblue')
ax.plot(thresholds, recall_vals[:-1], linewidth=2, label='Recall', color='firebrick')
ax.plot(thresholds, f1_scores, linewidth=2, label='F1 Score', color='green')
ax.axvline(x=optimal_threshold, color='gray', linestyle='--', linewidth=1.5,
           label=f'Optimal Threshold ({optimal_threshold:.3f})')
ax.axvline(x=0.5, color='orange', linestyle=':', linewidth=1.5,
           label='Default Threshold (0.500)')
ax.set_xlabel('Classification Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title(f'Precision-Recall Tradeoff by Threshold ({best_name})', fontsize=14)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

print(f"Optimal threshold (maximizes F1): {optimal_threshold:.4f}")
print(f"  Precision at optimal: {precision_vals[optimal_idx]:.4f}")
print(f"  Recall at optimal:    {recall_vals[optimal_idx]:.4f}")
print(f"  F1 Score at optimal:  {f1_scores[optimal_idx]:.4f}")

The plot shows how precision and recall move in opposite directions as we adjust the threshold:

- **Lower threshold** (left side): More transactions are flagged as fraud. This catches more actual fraud (higher recall) but also produces more false alarms (lower precision).
- **Higher threshold** (right side): Fewer transactions are flagged. This means fewer false alarms (higher precision) but some fraud cases are missed (lower recall).

The **F1 score** (green line) balances both metrics and peaks at the optimal threshold. The gap between the optimal threshold and the default 0.5 shows that tuning this value can meaningfully improve real-world performance.

In practice, the "right" threshold depends on the business context: how much does a missed fraud case cost compared to the cost of investigating a false alarm? This is a decision best made in collaboration with domain experts and stakeholders.

## Feature Importance

Even though most features are anonymized, we can still identify which ones the models rely on most. This reveals which patterns in the data are most predictive of fraud.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Random Forest
rf_imp = pd.Series(rf.feature_importances_, index=X_train.columns)
rf_imp.nlargest(15).sort_values().plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('Random Forest — Top 15 Features', fontsize=13)
axes[0].set_xlabel('Importance')

# XGBoost
xgb_imp = pd.Series(xgb_model.feature_importances_, index=X_train.columns)
xgb_imp.nlargest(15).sort_values().plot(kind='barh', ax=axes[1], color='darkorange')
axes[1].set_title('XGBoost — Top 15 Features', fontsize=13)
axes[1].set_xlabel('Importance')

plt.suptitle('Which Features Matter Most for Detecting Fraud?', fontsize=15, y=1.02)
plt.tight_layout()
plt.show()

Both models agree on several key features, particularly **V14, V17, V12, and V10** — the same features that showed the strongest correlations with fraud in our exploratory analysis. This consistency across different model types gives us confidence that these features capture genuine fraud patterns rather than noise.

The `Amount` and `Hour` features also appear in the rankings, confirming that our feature engineering added useful information despite their weak individual correlations with fraud.

Since the V-features are PCA-transformed and anonymous, we can't map them back to specific real-world behaviors (e.g., transaction location or merchant type). However, the fact that a small number of features dominate suggests that **fraud has a distinct and recognizable statistical signature** in this transaction data.

## Conclusions

This analysis explored credit card fraud detection using both supervised classification and unsupervised anomaly detection. Here are the key takeaways:

**1. Class imbalance demands careful handling.** With fraud making up only 0.172% of transactions, naive approaches fail. SMOTE oversampling on the training set allowed our models to learn meaningful fraud patterns without discarding the vast majority of the data.

**2. The right evaluation metric matters.** Accuracy is misleading for imbalanced datasets — a model predicting "legitimate" for every transaction achieves 99.8% accuracy while catching zero fraud. **AUPRC** and the precision-recall tradeoff provide a much more honest picture of model performance.

**3. Gradient boosting and ensemble methods excel.** XGBoost and Random Forest consistently outperformed Logistic Regression, reflecting their ability to capture complex, non-linear patterns in the PCA-transformed features.

**4. Anomaly detection is a viable alternative.** Isolation Forest can detect fraud without any labeled examples, which is valuable when labeled fraud data is scarce or when dealing with previously unseen fraud patterns.

**5. Threshold tuning is essential for deployment.** The default 0.5 threshold is rarely optimal. The right threshold depends on the relative cost of false positives vs. false negatives — a decision that should be made in collaboration with business stakeholders.

**6. A small number of features drive most of the signal.** Features V14, V17, V12, and V10 consistently emerged as the most important across models, suggesting that fraud has a recognizable statistical signature in this data.

## Previous Analysis Review

An earlier version of this analysis (see `credit-card-fraud-detection-machine-learning-2021.ipynb`) approached the same dataset with a different methodology. Below are the key differences and the improvements made in this updated analysis.

### 1. Data Usage
- **Previous approach**: Created a small balanced subset of just **500 transactions** (250 fraud + 250 legitimate) by random sampling. This discarded 99.8% of the available data, severely limiting what the models could learn from the legitimate transaction patterns.
- **Updated approach**: Used the **full dataset** (284,807 transactions) with SMOTE to generate synthetic fraud examples only in the training set. This preserves all available information while still addressing the class imbalance.

### 2. Evaluation Metrics
- **Previous approach**: Used **accuracy** as the primary metric, reporting 96% accuracy for the best model. However, this was misleading — the model's precision on the fraud class was only 4%, meaning that 96 out of every 100 transactions flagged as fraud were actually legitimate.
- **Updated approach**: Used **AUPRC** (Area Under the Precision-Recall Curve) and **ROC-AUC** as primary metrics, along with detailed precision/recall analysis and threshold optimization. These metrics are specifically recommended by the dataset authors for evaluating performance on imbalanced data.

### 3. Data Leakage
- **Previous approach**: Trained models on a small subset and then evaluated them on the **full dataset**, which included the same samples used for training. This means the model was partially tested on data it had already seen, artificially inflating the reported results.
- **Updated approach**: Used a strict **stratified train/test split** with the scaler fitted only on training data, ensuring the test set is completely unseen during both training and preprocessing.

### 4. Model Selection
- **Previous approach**: Included **K-Means Clustering** as one of the four models. K-Means is an unsupervised clustering algorithm designed to group similar data points — it is not designed for supervised classification tasks and performed accordingly (42% accuracy, below random chance on the balanced subset).
- **Updated approach**: Replaced K-Means with **XGBoost** (a state-of-the-art gradient boosting algorithm) and added **Logistic Regression** as an interpretable baseline. For unsupervised analysis, used **Isolation Forest**, which is specifically designed for anomaly/outlier detection.

### 5. Feature Preprocessing
- **Previous approach**: No feature scaling was applied. The `Amount` and `Time` features were on very different scales compared to the PCA-transformed features (`V1`–`V28`), which can negatively impact distance-based models like KNN and clustering algorithms.
- **Updated approach**: Applied **StandardScaler** to `Amount` and the engineered `Hour` feature, and replaced the raw `Time` column with a more meaningful hour-of-day representation that captures daily transaction patterns.

### 6. Reproducibility
- **Previous approach**: No `random_state` was set for any model or data split, so results would change with every run. The code also used deprecated pandas methods (`DataFrame.append`) that no longer work in current versions of the library.
- **Updated approach**: Set `random_state=42` across all random operations (splits, models, SMOTE) ensuring fully reproducible results. All code uses current, supported library APIs.