## Data Preprocessing

This notebook prepares the credit card transactions dataset for machine
learning by performing feature–target separation, feature scaling, and a
stratified train-test split. No model training is performed in this notebook.


In [1]:
import sys
sys.path.append('../src')

from loader import load_data
from preprocessing import (
    split_features_target,
    scale_amount,
    stratified_split
)

# Load dataset
df = load_data('../data/creditcard.csv')


### Feature–Target Separation

The dataset is divided into input features and the target variable (`Class`)
to enable supervised learning.


In [4]:
# Split features and target
X, y = split_features_target(df)

# Check class distribution
y.value_counts()


Class
0    284315
1       492
Name: count, dtype: int64

### Handling Class Imbalance (Observation)

The dataset is highly imbalanced, with fraudulent transactions representing
a very small fraction of the data. This imbalance is preserved during splitting
and addressed during model training.


In [3]:
X_scaled, scaler = scale_amount(X)


### Feature Scaling

The transaction amount is scaled using RobustScaler to reduce the impact of
outliers while preserving the distribution of other features.


In [8]:
X_train, X_test, y_train, y_test = stratified_split(X_scaled, y)

print("Train class distribution:")
print(y_train.value_counts())

print("\nTest class distribution:")
print(y_test.value_counts())


Train class distribution:
Class
0    227451
1       394
Name: count, dtype: int64

Test class distribution:
Class
0    56864
1       98
Name: count, dtype: int64


## Preprocessing Complete

The dataset has been successfully prepared with proper scaling and a stratified
train-test split to prevent data leakage. The processed data is now ready for
model training and evaluation.
