# Credit Card Fraud Detection: Data Preprocessing
This notebook outlines the preprocessing steps required to prepare the raw dataset for training an XGBoost model. The dataset is imbalanced and contains features that require normalization or transformation for optimal performance.

In [1]:
# Import required libraries
import os
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
os.environ['LOKY_MAX_CPU_COUNT'] = '8'

In [3]:
# Load the dataset
df = pd.read_csv('data/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Step 1: Drop Unnecessary Columns
The 'Time' column will be dropped as it does not contribute meaningfully to classification in its raw form.

In [4]:
# Drop the 'Time' column
df.drop(columns=['Time'], inplace=True)

## Step 2: Normalize the 'Amount' Column
The 'Amount' column is not scaled like the PCA-transformed features, so it needs to be standardized.

In [5]:
# Standardize the 'Amount' feature
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df[['Amount']])

## Step 3: Define Feature Matrix and Target Variable
We separate the dataset into features (X) and target (y).

In [6]:
# Split into features and target
X = df.drop(columns=['Class'])
y = df['Class']

## Step 4: Train-Test Split
The dataset is split into training and testing sets with stratification to preserve the class distribution.

In [7]:
# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Step 5: Address Class Imbalance with SMOTE
To address the severe class imbalance, we apply Synthetic Minority Oversampling Technique (SMOTE) to the training data only.

In [8]:
# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print('Original training set shape:', y_train.value_counts())
print('Resampled training set shape:', y_train_resampled.value_counts())

Original training set shape: Class
0    227451
1       394
Name: count, dtype: int64
Resampled training set shape: Class
0    227451
1    227451
Name: count, dtype: int64


## Step 6: Save Preprocessed Data
We save the resampled training set and the untouched test set to disk for use in model training.

In [9]:
# Save the preprocessed data
X_train_resampled.to_csv('data/X_train_resampled.csv', index=False)
y_train_resampled.to_csv('data/y_train_resampled.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)

## Summary
- Dropped the 'Time' column.
- Normalized the 'Amount' column using StandardScaler.
- Performed stratified train-test split.
- Applied SMOTE to balance the training data.
- Saved the preprocessed data for model training.