# Stage 6 — Data Preprocessing Notebook

This notebook loads a raw dataset from `data/raw/`, applies reusable cleaning functions from `src/cleaning.py`, compares original vs cleaned data, and saves the result to `data/processed/`.

> Update file paths if running from a different working directory.


In [None]:
# Imports & Paths
from pathlib import Path
import pandas as pd
import numpy as np

# Project paths (adjust if needed)
PROJECT_ROOT = Path.cwd()
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
SRC_DIR = PROJECT_ROOT / "src"

# Ensure paths exist (processed folder may be created on save)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

# Add src to sys.path so we can import cleaning utilities if running in Colab or loose env
import sys
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

from cleaning import fill_missing_median, drop_missing, normalize_data

print('PROJECT_ROOT:', PROJECT_ROOT)
print('DATA_RAW:', DATA_RAW)
print('DATA_PROCESSED:', DATA_PROCESSED)


In [None]:
# Load a raw dataset
# If multiple CSVs exist, pick one by name; here we try to auto-detect the first CSV.
import glob

raw_candidates = sorted(glob.glob(str(DATA_RAW / "*.csv")))
if not raw_candidates:
    raise FileNotFoundError(f"No CSV files found in {DATA_RAW}. Place your raw CSV there.")
raw_path = Path(raw_candidates[0])
df_raw = pd.read_csv(raw_path)

print(f"Loaded: {raw_path.name} | shape={df_raw.shape}")
df_raw.head()


In [None]:
# Quick profile of raw data
summary = {
    "shape": df_raw.shape,
    "dtypes": df_raw.dtypes.astype(str).to_dict(),
    "missing_per_column": df_raw.isna().sum().to_dict()
}
summary


In [None]:
# Apply cleaning steps
# 1) Fill numeric NaNs with median
df1 = fill_missing_median(df_raw)

# 2) Drop remaining rows with any missing values in a chosen subset (or all)
#    Adjust 'subset' to critical columns only if appropriate
df2 = drop_missing(df1, how="any", subset=None)

# 3) Normalize numeric columns (choose method="standard" or "minmax")
df_clean, scale_params = normalize_data(df2, columns=None, method="standard")

print("Scaling parameters (per column):")
scale_params


In [None]:
# Compare original vs cleaned
comparison = {
    "raw_shape": df_raw.shape,
    "clean_shape": df_clean.shape,
    "raw_missing_total": int(df_raw.isna().sum().sum()),
    "clean_missing_total": int(df_clean.isna().sum().sum()),
}
pd.DataFrame([comparison])


In [None]:
# Save cleaned dataset
clean_name = raw_path.stem + "_cleaned.csv"
clean_path = DATA_PROCESSED / clean_name
df_clean.to_csv(clean_path, index=False)
print(f"Saved cleaned dataset to: {clean_path}")


## Notes & Assumptions

- **Missingness strategy**: First filled numeric columns with medians to preserve row count, then dropped any remaining rows with missing values as a conservative second pass. If target labels exist, consider avoiding leakage by fitting medians using *training data only*.
- **Scaling choice**: Used z-score standardization. For bounded features or models that prefer [0, 1], use `method="minmax"`.
- **Columns affected**: By default, functions operate on numeric columns only. You can pass a subset list to target specific columns.
- **Reproducibility**: `normalize_data` returns a dictionary of fitted parameters. Save these if you need to transform future/holdout data consistently.
- **Tradeoffs**:
  - Filling vs dropping: Filling keeps more data at the cost of injecting central tendency; dropping can bias the sample if missingness is not MCAR.
  - Standardization vs Min-Max: Standardization is robust to outliers in *scale* but not in *fit*; Min-Max preserves shape but is sensitive to extremes.
