# 02 — Preprocessing & Feature Engineering

This notebook handles the data preprocessing steps required before model training.  
It includes:

- Loading raw data
- Train/Test split
- Scaling / Normalization
- Light feature engineering
- Handling multicollinearity (documentation only)
- Exporting processed datasets for modeling

In [15]:
import sys
sys.path.append("..")

In [16]:
# Core libraries
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# random_state and test_size
from src.config import RANDOM_STATE, TEST_SIZE

## 1. Load Raw Data

In [2]:
df = pd.read_csv('../data/raw/data.csv')

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## 2. Initial Adjustments

- Map target variable (`M` → 1, `B` → 0)
- Remove irrelevant columns (`id`, `Unnamed: 32`)
- Confirm dataset integrity

In [3]:
# standardization
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

# maping
df['diagnosis'] = df['diagnosis'].map({'M' : 1, 'B' : 0})
df['diagnosis'] = df['diagnosis'].astype(int)

# drop
df = df.drop(columns=['id' ,'concave points_worst', 'concave points_se', 'concave points_mean', 'Unnamed: 32'], errors='ignore')

df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,symmetry_mean,fractal_dimension_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,symmetry_worst,fractal_dimension_worst,concave_points_worst,concave_points_se,concave_points_mean
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,...,184.6,2019.0,0.1622,0.6656,0.7119,0.4601,0.1189,0.2654,0.01587,0.1471
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,...,158.8,1956.0,0.1238,0.1866,0.2416,0.275,0.08902,0.186,0.0134,0.07017
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,...,152.5,1709.0,0.1444,0.4245,0.4504,0.3613,0.08758,0.243,0.02058,0.1279
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,...,98.87,567.7,0.2098,0.8663,0.6869,0.6638,0.173,0.2575,0.01867,0.1052
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,...,152.2,1575.0,0.1374,0.205,0.4,0.2364,0.07678,0.1625,0.01885,0.1043


## 3. Split Features (X) and Target (y)

In [4]:
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

display(X.head() , y.head())

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,symmetry_worst,fractal_dimension_worst,concave_points_worst,concave_points_se,concave_points_mean
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.2419,0.07871,1.095,...,184.6,2019.0,0.1622,0.6656,0.7119,0.4601,0.1189,0.2654,0.01587,0.1471
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.1812,0.05667,0.5435,...,158.8,1956.0,0.1238,0.1866,0.2416,0.275,0.08902,0.186,0.0134,0.07017
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.2069,0.05999,0.7456,...,152.5,1709.0,0.1444,0.4245,0.4504,0.3613,0.08758,0.243,0.02058,0.1279
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.2597,0.09744,0.4956,...,98.87,567.7,0.2098,0.8663,0.6869,0.6638,0.173,0.2575,0.01867,0.1052
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1809,0.05883,0.7572,...,152.2,1575.0,0.1374,0.205,0.4,0.2364,0.07678,0.1625,0.01885,0.1043


0    1
1    1
2    1
3    1
4    1
Name: diagnosis, dtype: int32

## 4. Train/Test Split

We use stratified sampling to maintain diagnosis distribution.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

## 5. Scaling (Standardization)

We apply StandardScaler to normalize numerical features.

In [6]:
scaler = StandardScaler()

X_test_scaled = scaler.fit_transform(X_test)
X_train_scaled = scaler.transform(X_train)

# converting back to a df
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

## 6. Feature Aggregation (Variance Smoothing)

To reduce noise and redundancy among highly correlated feature groups, this step 
computes aggregated features that summarize related measurements. Since the 
`_mean`, `_se`, and `_worst` versions of each variable tend to carry similar 
information with different levels of variance, averaging them helps stabilize 
their signal while reducing random fluctuations and noise. Additionally, the 
total variance per sample was calculated to capture overall dispersion.

No dimensionality reduction or feature selection is applied here.

In [7]:
# before
X_train_scaled["var_total"] = X_train_scaled.var(axis=1)
X_test_scaled["var_total"] = X_test_scaled.var(axis=1)

In [8]:
display(X_train_scaled['var_total'][:10], X_test_scaled["var_total"][:10])

373    0.929439
19     0.182265
527    0.160558
356    0.304566
418    0.206018
7      0.362306
35     0.478121
185    0.870349
204    0.067711
341    0.486071
Name: var_total, dtype: float64

142    0.358824
477    0.186043
476    0.136152
156    0.309783
190    2.845473
505    3.335781
243    0.313368
382    0.800224
311    0.222914
375    0.276364
Name: var_total, dtype: float64

In [9]:
prefixes = ['radius', 'perimeter', 'area', 'concavity', 'texture']

for p in prefixes:
    cols = [c for c in X_train_scaled.columns if c.startswith(p)]
    X_train_scaled[f"{p}_avg"] = X_train_scaled[cols].mean(axis=1)
    X_test_scaled[f"{p}_avg"] = X_test_scaled[cols].mean(axis=1)

In [10]:
# after
X_train_scaled["var_total"] = X_train_scaled.var(axis=1)
X_test_scaled["var_total"] = X_test_scaled.var(axis=1)

In [11]:
display(X_train_scaled['var_total'][:10], X_test_scaled["var_total"][:10])

373    0.894479
19     0.172408
527    0.167864
356    0.278735
418    0.211582
7      0.320166
35     0.400560
185    0.794812
204    0.060813
341    0.481000
Name: var_total, dtype: float64

142    0.344560
477    0.185483
476    0.118249
156    0.274224
190    2.690350
505    3.222634
243    0.284237
382    0.732282
311    0.239459
375    0.262566
Name: var_total, dtype: float64

In [12]:
X_train_scaled.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
radius_mean,455.0,-0.05868,1.036795,-2.135185,-0.770828,-0.314971,0.514923,4.039049
texture_mean,455.0,0.04347,1.083306,-2.342313,-0.733139,-0.086739,0.661396,4.995126
perimeter_mean,455.0,-0.054673,1.037966,-2.087332,-0.776221,-0.301574,0.525983,4.050839
area_mean,455.0,-0.045088,0.948014,-1.429413,-0.691168,-0.352085,0.361911,4.988509
smoothness_mean,455.0,0.109774,0.939116,-2.874384,-0.565225,0.094535,0.686557,4.628858
compactness_mean,455.0,0.005746,0.985306,-1.50929,-0.737012,-0.21848,0.495975,4.515486
concavity_mean,455.0,0.018476,1.061306,-1.15405,-0.774179,-0.364429,0.585379,4.46372
symmetry_mean,455.0,0.05324,0.955048,-2.601729,-0.629809,-0.02293,0.554046,3.892764
fractal_dimension_mean,455.0,0.069075,0.964655,-1.712564,-0.562041,-0.080763,0.50586,4.825659
radius_se,455.0,0.075218,0.887815,-0.905587,-0.493705,-0.181792,0.342227,8.175553


## 7) Saving new data

The preprocessed train and test sets are now saved to the `data/processed/` 
directory and will be used in the next notebook (`modeling.ipynb`).

In [13]:
X_train_scaled.to_csv("../data/processed/X_train_preprocessed.csv", index=False)
X_test_scaled.to_csv("../data/processed/X_test_preprocessed.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)

No feature engineering was applied yet. The dataset was kept intact except for 
scaling and basic cleanup. Feature selection and dimensionality reduction will 
be performed in the modeling stage.
