# 02 — Preprocessing & Feature Engineering

This notebook handles the data preprocessing steps required before model training.  
It includes:

- Loading raw data
- Train/Test split
- Scaling / Normalization
- Light feature engineering
- Handling multicollinearity (documentation only)
- Exporting processed datasets for modeling

In [1]:
import sys
sys.path.append("..")

In [2]:
# Core libraries
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# random_state and test_size
from src.config import RANDOM_STATE, TEST_SIZE

# preprocessed pipeline
from src.data.preprocessing import (
    load_raw_data,
    initial_cleaning,
    encode_target,
    split_features_target,
    split_train_test,
    scale_features,
    aggregate_features,
    save_processed_data
)

import joblib

## 1. Load Raw Data

In [3]:
df = load_raw_data('../data/raw/data.csv')

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## 2. Initial Adjustments

- Map target variable (`M` → 1, `B` → 0)
- Remove irrelevant columns (`id`, `Unnamed: 32`)
- Confirm dataset integrity

In [4]:
df = initial_cleaning(df)

df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 3. Encoding target variable

Transforming `diagnosis` 'M' -> 1 'B' -> 0

In [5]:
df = encode_target(df)

df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 4. Train/Test Split

We use stratified sampling to maintain diagnosis distribution.

In [6]:
X, y = split_features_target(df)

X_train, X_test, y_train, y_test = split_train_test(X, y)

## 5. Scaling (Standardization)

We apply StandardScaler to normalize numerical features.

In [7]:
X_train_scaled, X_test_scaled, scaler = scale_features(X_train, X_test)

## 6. Feature Aggregation (Variance Smoothing)

To reduce noise and redundancy among highly correlated feature groups, this step 
computes aggregated features that summarize related measurements. Since the 
`_mean`, `_se`, and `_worst` versions of each variable tend to carry similar 
information with different levels of variance, averaging them helps stabilize 
their signal while reducing random fluctuations and noise. Additionally, the 
total variance per sample was calculated to capture overall dispersion.

No dimensionality reduction or feature selection is applied here.

In [8]:
X_train_scaled = aggregate_features(X_train_scaled)
X_test_scaled = aggregate_features(X_test_scaled)

In [9]:
X_train_scaled.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
radius_mean,455.0,-1.737316e-16,1.001101,-2.00973,-0.686986,-0.231061,0.494783,3.900239
texture_mean,455.0,3.904081e-16,1.001101,-2.265011,-0.719258,-0.120789,0.562843,4.634299
perimeter_mean,455.0,4.704418e-16,1.001101,-1.96136,-0.687765,-0.244467,0.497536,3.899731
area_mean,455.0,-1.171224e-16,1.001101,-1.433461,-0.664343,-0.314364,0.377537,5.114742
smoothness_mean,455.0,7.24207e-16,1.001101,-2.342455,-0.759968,-0.052676,0.623134,4.715773
compactness_mean,455.0,-5.0753050000000004e-17,1.001101,-1.568307,-0.744645,-0.214571,0.49238,4.485809
concavity_mean,455.0,-4.4896930000000004e-17,1.001101,-1.092835,-0.731107,-0.364967,0.527101,4.137033
concave_points_mean,455.0,2.928061e-17,1.001101,-1.23642,-0.739855,-0.3954,0.632163,3.838961
symmetry_mean,455.0,2.3424490000000002e-17,1.001101,-2.733834,-0.704202,-0.057834,0.503438,4.435961
fractal_dimension_mean,455.0,3.669836e-16,1.001101,-1.791603,-0.729551,-0.203192,0.524949,4.987148


No feature engineering was applied yet. The dataset was kept intact except for 
scaling and basic cleanup. Feature selection and dimensionality reduction will 
be performed in the modeling stage.


## Saving the data 

In [None]:
save_processed_data(X_train_scaled, X_test_scaled, y_train, y_test)

## Saving scaler

In [12]:
import joblib

joblib.dump(scaler, "../src/models/scaler.pkl")

['../src/models/scaler.pkl']