# Credit Card Fraud Detection: Data Preparation & Model Training

This notebook demonstrates data preprocessing, class balancing, model training (Random Forest with GridSearchCV), evaluation, and model saving for credit card fraud detection using the Kaggle dataset.

## 1. Import Required Libraries
Import the necessary libraries for data processing, modeling, and saving the trained model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.utils import resample
import joblib
import os

ModuleNotFoundError: No module named 'sklearn'

## 2. Load the Dataset
Load the Kaggle credit card fraud dataset from the data folder.

In [None]:
# Adjust the path if your dataset filename is different
DATA_PATH = '../data/creditcard.csv'
df = pd.read_csv(DATA_PATH)
df.head()

## 3. Data Cleaning & Class Balancing
Handle missing values and balance the classes using undersampling.

In [None]:
# Drop missing values (if any)
df = df.dropna()

# Check class distribution
print('Class distribution before balancing:')
print(df['Class'].value_counts())

# Undersample majority class for a 5:1 ratio
df_majority = df[df.Class == 0]
df_minority = df[df.Class == 1]
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # sample without replacement
                                   n_samples=len(df_minority)*5, # 5:1 ratio
                                   random_state=42)
df_balanced = pd.concat([df_majority_downsampled, df_minority])
df_balanced = df_balanced.sample(frac=1, random_state=42)  # Shuffle

print('Class distribution after balancing:')
print(df_balanced['Class'].value_counts())

## 4. Train/Test Split
Split the balanced data into training and testing sets (80/20 split, stratified by class).

In [None]:
X = df_balanced.drop('Class', axis=1)
y = df_balanced['Class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train set: {X_train.shape}, Test set: {X_test.shape}")

## 5. Model Training & Hyperparameter Tuning
Train a Random Forest classifier using GridSearchCV to optimize hyperparameters.

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
grid = GridSearchCV(rf, param_grid, cv=3, scoring='recall', verbose=2)
grid.fit(X_train, y_train)

print(f"Best parameters: {grid.best_params_}")

## 6. Model Evaluation
Evaluate the trained model on the test set and print metrics.

In [None]:
y_pred = grid.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

## 7. Save the Trained Model
Save the best estimator to disk for use in the API.

In [None]:
import joblib
MODEL_PATH = '../src/model/rf_model.joblib'
joblib.dump(grid.best_estimator_, MODEL_PATH)
print(f"Model saved to {MODEL_PATH}")