<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/basic_models/AnomalyDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Fraud Detection

## Introduction
This notebook demonstrates the application of machine learning for credit card fraud detection. We'll address the challenges of highly imbalanced data and the need for high precision and recall.

## Dataset Overview
- Features: V1-V28 (PCA transformed), Time, Amount
- Target: Class (1: Fraud, 0: Normal)
- Total Transactions: 284,807
- Fraud Cases: 492 (0.172%)
- Highly imbalanced dataset

## Problem Statement
1. Identify fraudulent transactions
2. Minimize false positives
3. Maximize fraud detection rate
4. Handle class imbalance

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set plotting style
plt.style.use('seaborn')
sns.set_theme()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## Data Loading and Exploration

In this section, we'll:
1. Load and examine the dataset
2. Analyze class distribution
3. Explore feature characteristics
4. Investigate time and amount distributions
5. Check for data quality issues

In [None]:
# Load the credit card fraud dataset
df = pd.read_csv('creditcard.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
display(df.head())

# Check data types and missing values
print("\nData Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

### Class Distribution Analysis

Let's examine the distribution of fraudulent vs. normal transactions and understand the extent of class imbalance.

In [None]:
# Analyze class distribution
class_dist = df['Class'].value_counts(normalize=True)
print("Class Distribution:")
print(class_dist)

# Visualize class distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Class')
plt.title('Class Distribution (0: Normal, 1: Fraud)')
plt.xlabel('Transaction Class')
plt.ylabel('Count')

# Add percentage labels
total = len(df['Class'])
for p in plt.gca().patches:
    percentage = f'{100 * p.get_height()/total:.2f}%'
    plt.annotate(percentage, (p.get_x() + p.get_width()/2., p.get_height()),
                 ha='center', va='bottom')

plt.show()

# Print detailed statistics
print("\nDetailed Class Statistics:")
print(f"Total Transactions: {len(df)}")
print(f"Normal Transactions: {len(df[df['Class'] == 0])}")
print(f"Fraudulent Transactions: {len(df[df['Class'] == 1])}")
print(f"Fraud Ratio: 1:{len(df[df['Class'] == 0])/len(df[df['Class'] == 1]):.0f}")

### Time and Amount Analysis

Examine the distribution of transaction times and amounts, which are the only non-PCA transformed features.

In [None]:
# Time analysis
plt.figure(figsize=(15, 5))

# Distribution of transactions over time
plt.subplot(1, 2, 1)
plt.hist(df['Time'], bins=50)
plt.title('Distribution of Transactions Over Time')
plt.xlabel('Time (seconds)')
plt.ylabel('Number of Transactions')

# Time distribution by class
plt.subplot(1, 2, 2)
sns.boxplot(x='Class', y='Time', data=df)
plt.title('Transaction Time by Class')
plt.xlabel('Class (0: Normal, 1: Fraud)')
plt.ylabel('Time (seconds)')

plt.tight_layout()
plt.show()

# Amount analysis
plt.figure(figsize=(15, 5))

# Distribution of transaction amounts
plt.subplot(1, 2, 1)
plt.hist(df['Amount'], bins=50)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Amount')
plt.ylabel('Number of Transactions')

# Amount distribution by class
plt.subplot(1, 2, 2)
sns.boxplot(x='Class', y='Amount', data=df)
plt.title('Transaction Amount by Class')
plt.xlabel('Class (0: Normal, 1: Fraud)')
plt.ylabel('Amount')

plt.tight_layout()
plt.show()

# Print amount statistics by class
print("\nAmount Statistics by Class:")
print(df.groupby('Class')['Amount'].describe())

### Feature Analysis

Analyze the PCA-transformed features (V1-V28) to understand their characteristics and relationships with fraud.

In [None]:
# Select V features
v_features = [col for col in df.columns if col.startswith('V')]

# Calculate correlation with Class
correlations = df[v_features + ['Class']].corr()['Class'].sort_values()
print("Top correlations with Class:")
print("\nMost negative correlations:")
print(correlations.head())
print("\nMost positive correlations:")
print(correlations.tail())

# Visualize top correlated features
plt.figure(figsize=(15, 5))
top_features = list(correlations.head(3).index) + list(correlations.tail(3).index)

for i, feature in enumerate(top_features, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(x='Class', y=feature, data=df)
    plt.title(f'{feature} by Class')
    plt.xlabel('Class (0: Normal, 1: Fraud)')

plt.tight_layout()
plt.show()

### Feature Distributions

Examine the distributions of key features for normal and fraudulent transactions.

In [None]:
# Select top correlated features for detailed analysis
top_corr_features = top_features[:6]

# Create KDE plots
plt.figure(figsize=(15, 10))
for i, feature in enumerate(top_corr_features, 1):
    plt.subplot(2, 3, i)
    sns.kdeplot(data=df[df['Class']==0][feature], label='Normal', shade=True)
    sns.kdeplot(data=df[df['Class']==1][feature], label='Fraud', shade=True)
    plt.title(f'{feature} Distribution by Class')
    plt.legend()

plt.tight_layout()
plt.show()

### Initial Findings

1. **Class Imbalance**:
   - Extreme imbalance (0.172% fraud cases)
   - Need for special handling in modeling
   - Importance of appropriate evaluation metrics

2. **Time and Amount**:
   - Transactions spread over ~2 days
   - Amount distributions differ between classes
   - Potential for feature engineering

3. **PCA Features**:
   - Several features show strong correlation with fraud
   - Clear separation between classes in some features
   - Potential for effective fraud detection

4. **Data Quality**:
   - No missing values
   - All features properly scaled (except Time and Amount)
   - Clean dataset ready for modeling

Next steps:
1. Feature engineering (especially for Time and Amount)
2. Proper scaling of all features
3. Implementation of class imbalance handling techniques

## Feature Engineering and Preprocessing

We'll prepare our data through:
1. Feature engineering for Time and Amount
2. Feature scaling and normalization
3. Feature selection based on importance
4. Data splitting with stratification

Given the nature of fraud detection, we'll focus on creating meaningful features while preserving the ability to detect anomalies.

### Time Feature Engineering

Transform the Time feature to extract meaningful patterns:

In [None]:
# Convert time to hours and extract cyclical features
df['Hour'] = df['Time'] / 3600  # Convert seconds to hours

# Create cyclical time features
df['Hour_sin'] = np.sin(2 * np.pi * df['Hour']/24.0)
df['Hour_cos'] = np.cos(2 * np.pi * df['Hour']/24.0)

# Create time windows for transaction density
window_size = 3600  # 1 hour window
df['Trans_density'] = df['Time'].rolling(window=window_size).count()
df['Trans_density'].fillna(df['Trans_density'].mean(), inplace=True)

# Visualize new time features
plt.figure(figsize=(15, 5))

# Plot cyclical features
plt.subplot(1, 2, 1)
plt.scatter(df['Hour_sin'], df['Hour_cos'], c=df['Class'], 
            alpha=0.5, cmap='coolwarm')
plt.title('Cyclical Time Features')
plt.xlabel('Hour_sin')
plt.ylabel('Hour_cos')

# Plot transaction density
plt.subplot(1, 2, 2)
sns.boxplot(x='Class', y='Trans_density', data=df)
plt.title('Transaction Density by Class')

plt.tight_layout()
plt.show()

### Amount Feature Engineering

Create features based on transaction amounts:

In [None]:
# Log transform amount
df['Amount_log'] = np.log1p(df['Amount'])

# Calculate amount statistics in time windows
df['Amount_mean_window'] = df['Amount'].rolling(window=window_size).mean()
df['Amount_std_window'] = df['Amount'].rolling(window=window_size).std()

# Calculate z-score of amount
df['Amount_zscore'] = (df['Amount'] - df['Amount_mean_window']) / df['Amount_std_window']

# Fill NaN values with means
amount_features = ['Amount_mean_window', 'Amount_std_window', 'Amount_zscore']
for feature in amount_features:
    df[feature].fillna(df[feature].mean(), inplace=True)

# Visualize new amount features
plt.figure(figsize=(15, 5))

# Plot amount log distribution
plt.subplot(1, 3, 1)
sns.boxplot(x='Class', y='Amount_log', data=df)
plt.title('Log Amount by Class')

# Plot amount z-score
plt.subplot(1, 3, 2)
sns.boxplot(x='Class', y='Amount_zscore', data=df)
plt.title('Amount Z-score by Class')

# Plot amount statistics
plt.subplot(1, 3, 3)
sns.scatterplot(data=df, x='Amount_mean_window', y='Amount_std_window',
                hue='Class', alpha=0.5)
plt.title('Amount Statistics Window')

plt.tight_layout()
plt.show()

### Feature Interactions

Create interaction features between important V features:

In [None]:
# Get top correlated V features
top_v_features = correlations.abs().sort_values(ascending=False)[1:4].index

# Create interaction features
for i in range(len(top_v_features)):
    for j in range(i+1, len(top_v_features)):
        feat1, feat2 = top_v_features[i], top_v_features[j]
        df[f'{feat1}_{feat2}_mult'] = df[feat1] * df[feat2]
        df[f'{feat1}_{feat2}_diff'] = df[feat1] - df[feat2]

# Analyze new interaction features
interaction_features = [col for col in df.columns if '_mult' in col or '_diff' in col]

# Plot distributions of interaction features
plt.figure(figsize=(15, 5))
for i, feature in enumerate(interaction_features[:3], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='Class', y=feature, data=df)
    plt.title(f'{feature} by Class')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### Feature Selection and Scaling

Select and scale features for modeling:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Prepare feature set
feature_columns = (
    v_features +  # Original V features
    ['Hour_sin', 'Hour_cos', 'Trans_density'] +  # Time features
    ['Amount_log', 'Amount_zscore', 'Amount_mean_window', 'Amount_std_window'] +  # Amount features
    interaction_features  # Interaction features
)

# Select features using ANOVA F-value
selector = SelectKBest(f_classif, k=30)  # Select top 30 features
X = df[feature_columns]
y = df['Class']

# Fit selector
selector.fit(X, y)
selected_features_mask = selector.get_support()
selected_features = X.columns[selected_features_mask].tolist()

# Create final feature matrix
X_selected = df[selected_features]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_selected)
X_scaled = pd.DataFrame(X_scaled, columns=selected_features)

# Display selected features and their scores
feature_scores = pd.DataFrame({
    'Feature': selected_features,
    'Score': selector.scores_[selected_features_mask]
}).sort_values('Score', ascending=False)

print("Top 10 Selected Features:")
display(feature_scores.head(10))

# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Score', y='Feature', data=feature_scores.head(10))
plt.title('Top 10 Features by F-score')
plt.show()

### Train-Test Split

Split the data while preserving class distribution:

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Print split sizes
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

# Verify class distribution in splits
print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))
print("\nClass distribution in testing set:")
print(y_test.value_counts(normalize=True))

### Feature Engineering Summary

1. **Time Features**:
   - Converted to cyclical representation
   - Added transaction density
   - Captured temporal patterns

2. **Amount Features**:
   - Log transformation
   - Rolling statistics
   - Z-score normalization

3. **Interaction Features**:
   - Created from top correlated features
   - Captured feature relationships
   - Enhanced predictive power

4. **Feature Selection**:
   - Selected top 30 features
   - Used ANOVA F-scores
   - Balanced original and engineered features

Next steps:
1. Model implementation
2. Class imbalance handling
3. Model evaluation with focus on fraud detection

## Model Implementation

We'll implement several models suitable for fraud detection:
1. Logistic Regression (baseline)
2. Random Forest
3. XGBoost
4. LightGBM

For each model, we'll:
- Initialize with class weights
- Train with appropriate parameters
- Evaluate with fraud-detection metrics
- Analyze prediction probabilities

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, roc_curve, auc

def evaluate_fraud_model(y_true, y_pred, y_prob, model_name):
    """Evaluate model with fraud-specific metrics."""
    
    # Calculate metrics
    cm = confusion_matrix(y_true, y_pred)
    cr = classification_report(y_true, y_pred)
    
    # Calculate ROC and PR curves
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    
    # Plot results
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Confusion Matrix
    sns.heatmap(cm, annot=True, fmt='d', ax=ax1)
    ax1.set_title(f'{model_name} Confusion Matrix')
    ax1.set_xlabel('Predicted')
    ax1.set_ylabel('Actual')
    
    # ROC Curve
    ax2.plot(fpr, tpr, label=f'ROC curve (AUC = {auc(fpr, tpr):.3f})')
    ax2.plot([0, 1], [0, 1], 'k--')
    ax2.set_title(f'{model_name} ROC Curve')
    ax2.set_xlabel('False Positive Rate')
    ax2.set_ylabel('True Positive Rate')
    ax2.legend()
    
    # Precision-Recall Curve
    ax3.plot(recall, precision)
    ax3.set_title(f'{model_name} Precision-Recall Curve')
    ax3.set_xlabel('Recall')
    ax3.set_ylabel('Precision')
    
    # Probability Distribution
    sns.histplot(data=pd.DataFrame({
        'Probability': y_prob,
        'Class': y_true
    }), x='Probability', hue='Class', bins=50, ax=ax4)
    ax4.set_title(f'{model_name} Probability Distribution')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n{model_name} Classification Report:")
    print(cr)
    
    return {
        'confusion_matrix': cm,
        'classification_report': cr,
        'roc_auc': auc(fpr, tpr)
    }

### Logistic Regression (Baseline Model)

Implement a baseline model with class weights:

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train logistic regression
lr_model = LogisticRegression(
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

lr_model.fit(X_train, y_train)

# Make predictions
lr_pred = lr_model.predict(X_test)
lr_prob = lr_model.predict_proba(X_test)[:, 1]

# Evaluate model
lr_results = evaluate_fraud_model(y_test, lr_pred, lr_prob, 'Logistic Regression')

# Feature importance
lr_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': abs(lr_model.coef_[0])
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=lr_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Features (Logistic Regression)')
plt.show()

### Random Forest

Implement Random Forest with balanced class weights:

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train Random Forest
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=10,
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)

rf_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)
rf_prob = rf_model.predict_proba(X_test)[:, 1]

# Evaluate model
rf_results = evaluate_fraud_model(y_test, rf_pred, rf_prob, 'Random Forest')

# Feature importance
rf_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=rf_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Features (Random Forest)')
plt.show()

### XGBoost

Implement XGBoost with scale_pos_weight for imbalance:

In [None]:
from xgboost import XGBClassifier

# Calculate scale_pos_weight
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])

# Initialize and train XGBoost
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    n_jobs=-1,
    random_state=42
)

xgb_model.fit(X_train, y_train)

# Make predictions
xgb_pred = xgb_model.predict(X_test)
xgb_prob = xgb_model.predict_proba(X_test)[:, 1]

# Evaluate model
xgb_results = evaluate_fraud_model(y_test, xgb_pred, xgb_prob, 'XGBoost')

# Feature importance
xgb_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=xgb_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Features (XGBoost)')
plt.show()

### LightGBM

Implement LightGBM with built-in class weights:

In [None]:
from lightgbm import LGBMClassifier

# Initialize and train LightGBM
lgb_model = LGBMClassifier(
    n_estimators=200,
    num_leaves=31,
    learning_rate=0.1,
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)

lgb_model.fit(X_train, y_train)

# Make predictions
lgb_pred = lgb_model.predict(X_test)
lgb_prob = lgb_model.predict_proba(X_test)[:, 1]

# Evaluate model
lgb_results = evaluate_fraud_model(y_test, lgb_pred, lgb_prob, 'LightGBM')

# Feature importance
lgb_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=lgb_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Features (LightGBM)')
plt.show()

### Model Comparison

Compare all models' performance:

In [None]:
# Collect all results
models = {
    'Logistic Regression': (lr_pred, lr_prob),
    'Random Forest': (rf_pred, rf_prob),
    'XGBoost': (xgb_pred, xgb_prob),
    'LightGBM': (lgb_pred, lgb_prob)
}

# Compare ROC curves
plt.figure(figsize=(10, 8))
for name, (_, y_prob) in models.items():
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc(fpr, tpr):.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.show()

# Compare metrics
results = []
for name, (y_pred, y_prob) in models.items():
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob)
    })

results_df = pd.DataFrame(results)
print("Model Performance Comparison:")
display(results_df.round(3))

### Model Implementation Summary

1. **Model Performance**:
   - Best performing model: [based on results]
   - Trade-offs between precision and recall
   - ROC-AUC scores comparison

2. **Feature Importance**:
   - Consistent important features across models
   - Engineered features effectiveness
   - Model-specific feature rankings

3. **Practical Considerations**:
   - Model complexity vs. performance
   - Training time requirements
   - Prediction speed needs

Next steps:
1. Hyperparameter tuning
2. Ensemble methods
3. Threshold optimization
4. Model deployment preparation

## Handling Class Imbalance

We'll address the severe class imbalance (0.172% fraud) using multiple techniques:
1. Resampling methods (SMOTE, ADASYN)
2. Undersampling techniques
3. Combination approaches
4. Cost-sensitive learning

We'll evaluate each approach using our best performing model.

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek, SMOTEENN
from collections import Counter

def plot_class_distribution(y, title):
    plt.figure(figsize=(8, 6))
    sns.countplot(x=y)
    plt.title(f'Class Distribution - {title}')
    plt.xlabel('Class')
    plt.ylabel('Count')
    
    # Add percentage labels
    total = len(y)
    for p in plt.gca().patches:
        percentage = f'{100 * p.get_height()/total:.2f}%'
        plt.annotate(percentage, (p.get_x() + p.get_width()/2., p.get_height()),
                     ha='center', va='bottom')
    plt.show()
    
    print(f"\nClass distribution in {title}:")
    print(pd.Series(y).value_counts(normalize=True))

### SMOTE (Synthetic Minority Over-sampling Technique)

Apply SMOTE to create synthetic samples of the minority class:

In [None]:
# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Plot new distribution
plot_class_distribution(y_train_smote, 'SMOTE')

# Train model on SMOTE-balanced data
best_model_smote = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

best_model_smote.fit(X_train_smote, y_train_smote)

# Evaluate
smote_pred = best_model_smote.predict(X_test)
smote_prob = best_model_smote.predict_proba(X_test)[:, 1]

smote_results = evaluate_fraud_model(y_test, smote_pred, smote_prob, 'SMOTE')

### ADASYN (Adaptive Synthetic Sampling)

Apply ADASYN to generate synthetic samples based on density distribution:

In [None]:
# Apply ADASYN
adasyn = ADASYN(sampling_strategy='auto', random_state=42)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)

# Plot new distribution
plot_class_distribution(y_train_adasyn, 'ADASYN')

# Train model on ADASYN-balanced data
best_model_adasyn = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

best_model_adasyn.fit(X_train_adasyn, y_train_adasyn)

# Evaluate
adasyn_pred = best_model_adasyn.predict(X_test)
adasyn_prob = best_model_adasyn.predict_proba(X_test)[:, 1]

adasyn_results = evaluate_fraud_model(y_test, adasyn_pred, adasyn_prob, 'ADASYN')

### Combination Approaches

Apply SMOTE with Tomek links and SMOTE with ENN:

In [None]:
# SMOTE + Tomek links
smote_tomek = SMOTETomek(random_state=42)
X_train_smt, y_train_smt = smote_tomek.fit_resample(X_train, y_train)

# SMOTE + ENN
smote_enn = SMOTEENN(random_state=42)
X_train_smenn, y_train_smenn = smote_enn.fit_resample(X_train, y_train)

# Plot distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
plot_class_distribution(y_train_smt, 'SMOTE + Tomek')

plt.subplot(1, 2, 2)
plot_class_distribution(y_train_smenn, 'SMOTE + ENN')

plt.tight_layout()
plt.show()

# Train and evaluate models
# SMOTE + Tomek
best_model_smt = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

best_model_smt.fit(X_train_smt, y_train_smt)
smt_pred = best_model_smt.predict(X_test)
smt_prob = best_model_smt.predict_proba(X_test)[:, 1]

smt_results = evaluate_fraud_model(y_test, smt_pred, smt_prob, 'SMOTE + Tomek')

# SMOTE + ENN
best_model_smenn = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

best_model_smenn.fit(X_train_smenn, y_train_smenn)
smenn_pred = best_model_smenn.predict(X_test)
smenn_prob = best_model_smenn.predict_proba(X_test)[:, 1]

smenn_results = evaluate_fraud_model(y_test, smenn_pred, smenn_prob, 'SMOTE + ENN')

### Random Undersampling

Apply random undersampling to balance classes:

In [None]:
# Apply random undersampling
rus = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Plot new distribution
plot_class_distribution(y_train_rus, 'Random Undersampling')

# Train model on undersampled data
best_model_rus = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

best_model_rus.fit(X_train_rus, y_train_rus)

# Evaluate
rus_pred = best_model_rus.predict(X_test)
rus_prob = best_model_rus.predict_proba(X_test)[:, 1]

rus_results = evaluate_fraud_model(y_test, rus_pred, rus_prob, 'Random Undersampling')

### Comparison of Resampling Techniques

In [None]:
# Collect all results
resampling_results = {
    'Original': (xgb_pred, xgb_prob),
    'SMOTE': (smote_pred, smote_prob),
    'ADASYN': (adasyn_pred, adasyn_prob),
    'SMOTE + Tomek': (smt_pred, smt_prob),
    'SMOTE + ENN': (smenn_pred, smenn_prob),
    'Random Undersampling': (rus_pred, rus_prob)
}

# Compare ROC curves
plt.figure(figsize=(12, 8))
for name, (_, y_prob) in resampling_results.items():
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc(fpr, tpr):.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Resampling Techniques Comparison')
plt.legend()
plt.show()

# Compare metrics
results = []
for name, (y_pred, y_prob) in resampling_results.items():
    results.append({
        'Method': name,
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob)
    })

results_df = pd.DataFrame(results)
print("Resampling Methods Comparison:")
display(results_df.round(3))

### Class Imbalance Handling Summary

1. **Resampling Effects**:
   - Impact on model performance
   - Trade-offs between different techniques
   - Best approach for fraud detection

2. **Method Comparison**:
   - SMOTE vs ADASYN effectiveness
   - Combination methods benefits
   - Undersampling considerations

3. **Practical Implications**:
   - Model stability
   - Computational requirements
   - Real-world applicability

4. **Recommendations**:
   - Best method for production
   - Implementation considerations
   - Monitoring requirements

Next steps:
1. Fine-tune best performing approach
2. Implement ensemble with different sampling methods
3. Develop monitoring strategy for production