## 02_preprocessing — Data Preparation and Feature Scaling

In [None]:
import pandas as pd
from pathlib import Path
...
from sklearn.model_selection import train_test_split
...
from sklearn.preprocessing import StandardScaler
...
import joblib
import os

PROJECT_ROOT = Path().resolve().parent
DATA_PATH = PROJECT_ROOT / "data" / "creditcard.csv"

# Load the dataset
df = pd.read_csv(DATA_PATH)


## Preprocessing Strategy

- The task is formulated as a binary classification problem, where the goal is to detect
fraudulent transactions.

- The target variable is highly imbalanced, therefore class weighting will be used during
model training instead of resampling to preserve the original data distribution.

- PCA-transformed features (V1–V28) are already scaled and do not require additional normalization.

- The Time and Amount features will be scaled using StandardScaler, as their original
distributions differ significantly from the PCA-transformed features.

- A stratified train-test split will be applied to maintain the original class distribution
in both training and test sets.

## Feature–Target Separation

In this step, we separate the dataset into input features (X) and the target variable (y).
The target variable represents whether a transaction is fraudulent (1) or normal (0).

In [None]:
# Separate features and target
# X contains all input features
# y contains the target label (fraud indicator)
X = df.drop(columns=['Class'])
y = df['Class']

In [None]:
# Verify feature matrix dimensions
X.shape
# Verify class distribution in target
y.value_counts()

## Train-Test Split

- The dataset is split into training and testing sets.
- Stratified sampling is used to preserve the original class distribution
due to the severe class imbalance.

In [None]:
# Split the data into training and testing sets
# stratify=y ensures the class distribution is preserved
# random_state is set for reproducibility

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [None]:
# Verify that class proportions are preserved after splitting
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

### Key Observation

The class distribution in both training and test sets closely matches the
original dataset, confirming that stratified sampling was applied correctly.

## Feature Scaling

Although most features are already scaled due to PCA transformation (V1–V28),
the Time and Amount features require normalization to ensure consistent
feature scales during model training.

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Columns that require scaling
scale_cols = ['Time', 'Amount']

# Create copies to avoid SettingWithCopyWarning
X_train = X_train.copy()
X_test = X_test.copy()

# Fit on training data only
X_train.loc[:, scale_cols] = scaler.fit_transform(X_train[scale_cols])

# Apply the same transformation to test data
X_test.loc[:, scale_cols] = scaler.transform(X_test[scale_cols])

# Scaling verification
X_train[scale_cols].describe()

## Scaling Verification & Key Observations

- The scaled features (`Time` and `Amount`) have a mean approximately equal to **0**
  and a standard deviation close to **1**, confirming correct standardization.

- StandardScaler was selected instead of RobustScaler because the majority of features
are already PCA-transformed and approximately standardized,
consistent scaling is preferred to preserve the relative structure learned by linear models.
  
- Feature scaling was applied **after the train-test split** to prevent data leakage.

- Only `Time` and `Amount` were scaled, since the PCA-based features (`V1–V28`)
  are already standardized by the dataset provider.

- The fitted scaler is saved to ensure consistent preprocessing during inference
and future model deployment.

## Linking Preprocessing with Modeling

To maintain a clean and modular machine learning workflow, the outputs of the
preprocessing stage are explicitly saved and later loaded by the modeling stage.

This approach ensures that:
- Each notebook remains fully independent and reproducible.
- Preprocessing decisions are applied consistently during model training.
- There is no reliance on shared notebook state or execution order.
- The workflow more closely resembles real-world production pipelines, where
  data preprocessing and model training are decoupled stages.

The preprocessed feature matrices, target splits, and fitted scaler are saved as
versioned artifacts and loaded explicitly by the modeling stage.

In [None]:
# Ensure artifacts directory exists
os.makedirs("artifacts", exist_ok=True)

# Persist preprocessing artifacts for reuse during model training and inference
joblib.dump(X_train, "artifacts/X_train.pkl")
joblib.dump(X_test, "artifacts/X_test.pkl")
joblib.dump(y_train, "artifacts/y_train.pkl")
joblib.dump(y_test, "artifacts/y_test.pkl")
joblib.dump(scaler, "artifacts/standard_scaler.pkl")

print("Preprocessing artifacts saved successfully.")

## Preprocessing Outputs

- `X_train`, `X_test`: Feature matrices after scaling
- `y_train`, `y_test`: Corresponding target labels
- `standard_scaler`: Fitted scaler to ensure consistent transformations during
  inference and future deployment