# üî• Credit Card Fraud Detection - Interactive Notebook

This notebook provides an interactive exploration of credit card fraud detection using machine learning.

## üìã Contents
1. Data Loading & Exploration
2. Data Preprocessing
3. Model Training
4. Model Evaluation
5. Visualization

## üéØ Goal
Build a machine learning model that accurately detects fraudulent credit card transactions.

## 1Ô∏è‚É£ Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    precision_score, recall_score, f1_score, roc_auc_score,
    roc_curve, precision_recall_curve
)
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
RANDOM_STATE = 42

## 2Ô∏è‚É£ Load & Explore Data

In [None]:
# Load dataset
df = pd.read_csv('data/creditcard.csv')

# Display basic info
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nFirst 5 rows:")
df.head()

In [None]:
# Class distribution
class_counts = df['Class'].value_counts()
print("Class Distribution:")
print(class_counts)
print(f"\nFraud Ratio: {class_counts[1]/len(df)*100:.4f}%")

# Visualize
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
class_counts.plot(kind='bar', color=['#2ecc71', '#e74c3c'])
plt.title('Class Distribution (Count)')
plt.xlabel('Class (0: Normal, 1: Fraud)')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
labels = ['Normal', 'Fraud']
sizes = class_counts.values
colors = ['#2ecc71', '#e74c3c']
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.2f%%', startangle=90)
plt.title('Class Distribution (Percentage)')

plt.tight_layout()
plt.show()

In [None]:
# Transaction amount statistics
print("Transaction Amount Statistics:")
print(df['Amount'].describe())

print("\nFraud Transaction Amount Statistics:")
print(df[df['Class']==1]['Amount'].describe())

print("\nNormal Transaction Amount Statistics:")
print(df[df['Class']==0]['Amount'].describe())

## 3Ô∏è‚É£ Data Preprocessing

In [None]:
# Separate features and target
X = df.drop('Class', axis=1)
y = df['Class']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Scale Time and Amount features
scaler = StandardScaler()
X_scaled = X.copy()
X_scaled[['Time', 'Amount']] = scaler.fit_transform(X[['Time', 'Amount']])

print("‚úÖ Features scaled successfully!")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining class distribution:")
print(np.bincount(y_train))

In [None]:
# Apply SMOTE to balance training data
smote = SMOTE(random_state=RANDOM_STATE, sampling_strategy=0.5)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print("Before SMOTE:", np.bincount(y_train))
print("After SMOTE:", np.bincount(y_train_balanced))

# Visualize
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.bar(['Normal', 'Fraud'], np.bincount(y_train), color=['#3498db', '#e74c3c'])
plt.title('Before SMOTE')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
plt.bar(['Normal', 'Fraud'], np.bincount(y_train_balanced), color=['#3498db', '#e74c3c'])
plt.title('After SMOTE')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## 4Ô∏è‚É£ Model Training

In [None]:
# Train Logistic Regression
print("Training Logistic Regression...")
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE,
    class_weight='balanced'
)
lr_model.fit(X_train_balanced, y_train_balanced)
print("‚úÖ Logistic Regression trained!")

In [None]:
# Train Random Forest
print("Training Random Forest...")
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    random_state=RANDOM_STATE,
    class_weight='balanced',
    n_jobs=-1
)
rf_model.fit(X_train_balanced, y_train_balanced)
print("‚úÖ Random Forest trained!")

## 5Ô∏è‚É£ Model Evaluation

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """Evaluate model and display metrics."""
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    print(f"\nüéØ {model_name} Performance:")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"F1-Score: {f1_score(y_test, y_pred):.4f}")
    print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
    
    return y_pred, y_pred_proba

# Evaluate both models
lr_pred, lr_proba = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
rf_pred, rf_proba = evaluate_model(rf_model, X_test, y_test, "Random Forest")

In [None]:
# Confusion Matrix Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Logistic Regression
cm_lr = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Logistic Regression - Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# Random Forest
cm_rf = confusion_matrix(y_test, rf_pred)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Random Forest - Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# ROC Curve
plt.figure(figsize=(10, 5))

# ROC Curves
plt.subplot(1, 2, 1)
fpr_lr, tpr_lr, _ = roc_curve(y_test, lr_proba)
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_proba)

plt.plot(fpr_lr, tpr_lr, label=f'Logistic Reg (AUC = {roc_auc_score(y_test, lr_proba):.3f})')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {roc_auc_score(y_test, rf_proba):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.grid(True)

# Precision-Recall Curves
plt.subplot(1, 2, 2)
precision_lr, recall_lr, _ = precision_recall_curve(y_test, lr_proba)
precision_rf, recall_rf, _ = precision_recall_curve(y_test, rf_proba)

plt.plot(recall_lr, precision_lr, label='Logistic Regression')
plt.plot(recall_rf, precision_rf, label='Random Forest')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curves')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Feature Importance (Random Forest)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(15), x='importance', y='feature', palette='viridis')
plt.title('Top 15 Feature Importances (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

## 6Ô∏è‚É£ Save Model

In [None]:
import pickle
import os

# Create model directory
os.makedirs('../backend/model', exist_ok=True)

# Save Random Forest model (best performer)
with open('../backend/model/fraud_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

# Save scaler
with open('../backend/model/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("‚úÖ Model and scaler saved to ../backend/model/")

## üéâ Summary

This notebook demonstrated:
- Loading and exploring credit card transaction data
- Handling severe class imbalance using SMOTE
- Training Logistic Regression and Random Forest models
- Evaluating models with appropriate metrics (Precision, Recall, F1, ROC-AUC)
- Visualizing results

The trained model is ready for deployment!