## Data Preprocessing & Feature Engineering

### Data Loading and Initial Partitioning

In this initial step, we load the dataset and separate our Target Variable (Class) from the Feature Set ($X$).

- Target ($y$): The Class column, where 1 represents a fraudulent transaction and 0 represents a legitimate one.

- Features ($X$): The remaining columns, including the 28 PCA-transformed variables ($V1-V28$), transaction Amount, and Time.

In [1]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv("creditcard.csv")
data.drop_duplicates(inplace=True)
X = data.drop(['Class'], axis=1)
y = data['Class']

### Feature Engineering

To improve the predictive power of our models—especially the Neural Network and Gradient Boosting algorithms—we transform raw variables into contextual features:

1. `Log_Amount`: 

Uses `np.log1p` ($log(1+x)$) to compress the extreme right-skewness of the transaction amounts.Goal: Minimizes the influence of massive outliers that could otherwise "tilt" the model's weight distribution.

2. `Hour`:

Converts the raw Time (seconds from the first transaction) into a 24-hour cycle using $(Time // 3600)$ % 24.
Goal: It captures human behavior patterns. Fraudulent activity often spikes during late-night hours when legitimate users are less likely to monitor their accounts.

3. `Amount_per_Time`:

A ratio of `Amount` to the elapsed `Time`. Goal: Identifies "Transaction Velocity." High-frequency, high-value movements relative to the account's age in the dataset can signal automated attacks or "card testing."

In [2]:
X['Log_Amount'] = np.log1p(X['Amount']) 

X['Hour'] = (X['Time'] // 3600) % 24

X['Amount_per_Time'] = X['Amount'] / (X['Time'] + 1)

### Data Preprocessing Pipeline

We use RobustScaler instead of StandardScaler because, even after a log transform, credit card data often contains extreme outliers that would distort a standard mean-based scaler.

In [3]:
num_cols = X.columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', RobustScaler(), num_cols)
    ],
    verbose_feature_names_out=False
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

X_processed = pipeline.fit_transform(X)
X_processed = pd.DataFrame(X_processed, columns=num_cols)

### Training/Testing Split with Stratification

We must maintain the class ratio.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42, stratify=y
)

### Saving the final processed variables

In [9]:
X_train.to_csv('X_train_final.csv', index=False)
X_test.to_csv('X_test_final.csv', index=False)
y_train.to_csv('y_train_final.csv', index=False)
y_test.to_csv('y_test_final.csv', index=False)
