# üìò Feature Engineering Notebook ‚Äî AquaSafe

**Notebook:** `03_feature_engineering.ipynb`

**Input:** `data/processed/cleaned_water_quality.csv` (from Notebook 02)

**Output:** `data/processed/train.csv`, `data/processed/test.csv` (model-ready, no NaN)

---

## üéØ Objective

Transform cleaned data into **model-ready train/test datasets** with proper handling to avoid data leakage.

### ‚úÖ What This Notebook Does (In Order):
| Step | Task | Why This Order? |
|------|------|----------------|
| 1 | Load cleaned data | Start point |
| 2 | **Train-Test Split** | MUST happen before any transformation |
| 3 | Imputation (fit on train) | Prevents test statistics leaking to train |
| 4 | Encoding (fit on train) | Prevents test categories leaking to train |
| 5 | Export train/test separately | Ready for modeling |

### üí° Why Split First?
If we impute or encode on full data:
- Median/mode values include test set information
- Encoder sees categories from test set
- This causes **data leakage** ‚Üí overly optimistic metrics

---
## üîß Section 1: Imports

In [1]:
# ============================================================================
# IMPORTS
# ============================================================================

import pandas as pd
import numpy as np
import os
import json
import joblib
from pathlib import Path
from datetime import datetime

# Sklearn - Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Project modules
from utils.config import DATA_DIR
from src.data_preprocessing.create_dataframe import create_dataframe

# Display settings
pd.set_option('display.max_columns', None)

print("‚úì All imports successful")

‚úì All imports successful


---
## üì• Section 2: Load Cleaned Data

In [2]:
# ============================================================================
# LOAD DATA FROM NOTEBOOK 02
# ============================================================================

INPUT_PATH = os.path.join(DATA_DIR, "processed", "csv", "cleaned_water_quality.csv")

df = create_dataframe(INPUT_PATH)

print(f"‚úì Loaded cleaned dataset: {df.shape}")
print(f"  Rows: {df.shape[0]}")
print(f"  Columns: {df.shape[1]}")

‚úì Loaded cleaned dataset: (171, 54)
  Rows: 171
  Columns: 54


In [3]:
# ============================================================================
# INPUT VALIDATION
# ============================================================================

print("\nüîç Input Validation:")

TARGET_COL = "use_based_class"

# Check target exists
assert TARGET_COL in df.columns, f"Target column '{TARGET_COL}' not found"
print(f"   ‚úì Target column present")

# Check target is complete
assert df[TARGET_COL].isna().sum() == 0, "Target has missing values"
print(f"   ‚úì Target has no missing values")

# Check valid classes
valid_classes = {"A", "B", "C", "E"}
actual_classes = set(df[TARGET_COL].unique())
assert actual_classes.issubset(valid_classes), f"Invalid classes found"
print(f"   ‚úì Target classes: {sorted(actual_classes)}")

# Show missing values (expected - will be imputed)
missing_count = df.isna().sum().sum()
print(f"   ‚Ñπ Missing values: {missing_count} (will be imputed after split)")


üîç Input Validation:
   ‚úì Target column present
   ‚úì Target has no missing values
   ‚úì Target classes: ['A', 'B', 'C', 'E']
   ‚Ñπ Missing values: 223 (will be imputed after split)


---
## ‚úÇÔ∏è Section 3: Feature-Target Separation

In [4]:
# ============================================================================
# SEPARATE FEATURES AND TARGET
# ============================================================================

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL].copy()

print(f"‚úì Features and target separated:")
print(f"   X shape: {X.shape}")
print(f"   y shape: {y.shape}")

print(f"\nTarget distribution:")
print(y.value_counts().sort_index())

‚úì Features and target separated:
   X shape: (171, 53)
   y shape: (171,)

Target distribution:
use_based_class
A    141
B      5
C      6
E     19
Name: count, dtype: int64


In [5]:
# ============================================================================
# IDENTIFY COLUMN TYPES
# ============================================================================

numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "string"]).columns.tolist()

print(f"\nüìä Column Types:")
print(f"   Numeric columns: {len(numeric_cols)}")
print(f"   Categorical columns: {len(categorical_cols)}")

if categorical_cols:
    print(f"\nCategorical columns to encode:")
    for col in categorical_cols:
        print(f"   ‚Ä¢ {col}: {X[col].nunique()} unique values")


üìä Column Types:
   Numeric columns: 29
   Categorical columns: 7

Categorical columns to encode:
   ‚Ä¢ type_water_body: 7 unique values
   ‚Ä¢ weather: 3 unique values
   ‚Ä¢ approx_depth: 3 unique values
   ‚Ä¢ human_activities: 18 unique values
   ‚Ä¢ floating_matter: 2 unique values
   ‚Ä¢ color: 9 unique values
   ‚Ä¢ odor: 5 unique values


---
## üîÄ Section 4: Train-Test Split (BEFORE Any Transformation)

**‚ö†Ô∏è CRITICAL: This must happen BEFORE imputation and encoding!**

In [6]:
# ============================================================================
# TRAIN-TEST SPLIT
# ============================================================================
# This MUST happen before any imputation or encoding to prevent data leakage

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print(f"‚úì Train-Test Split Complete (stratified):")
print(f"   X_train: {X_train.shape}")
print(f"   X_test:  {X_test.shape}")
print(f"   y_train: {y_train.shape}")
print(f"   y_test:  {y_test.shape}")

print(f"\nClass distribution in train:")
print(y_train.value_counts().sort_index())

print(f"\nClass distribution in test:")
print(y_test.value_counts().sort_index())

‚úì Train-Test Split Complete (stratified):
   X_train: (136, 53)
   X_test:  (35, 53)
   y_train: (136,)
   y_test:  (35,)

Class distribution in train:
use_based_class
A    112
B      4
C      5
E     15
Name: count, dtype: int64

Class distribution in test:
use_based_class
A    29
B     1
C     1
E     4
Name: count, dtype: int64


In [7]:
# ============================================================================
# VERIFY STRATIFICATION
# ============================================================================

print("\nüîç Stratification Verification:")

train_pct = y_train.value_counts(normalize=True).sort_index() * 100
test_pct = y_test.value_counts(normalize=True).sort_index() * 100

for cls in sorted(y.unique()):
    print(f"   Class {cls}: Train={train_pct[cls]:.1f}%, Test={test_pct[cls]:.1f}%")

print("\n   ‚úì Class proportions preserved in both sets")


üîç Stratification Verification:
   Class A: Train=82.4%, Test=82.9%
   Class B: Train=2.9%, Test=2.9%
   Class C: Train=3.7%, Test=2.9%
   Class E: Train=11.0%, Test=11.4%

   ‚úì Class proportions preserved in both sets


---
## üîß Section 5: Imputation (Fit on Train Only)

**Strategy:**
- Numeric columns: Median imputation (robust to outliers)
- Categorical columns: Most frequent (mode) imputation

In [8]:
# ============================================================================
# CHECK MISSING VALUES BEFORE IMPUTATION
# ============================================================================

print("üìä Missing Values Before Imputation:")
print(f"\n   Train set: {X_train.isna().sum().sum()} missing values")
print(f"   Test set:  {X_test.isna().sum().sum()} missing values")

# Show columns with missing values in train
train_missing = X_train.isna().sum()
train_missing = train_missing[train_missing > 0].sort_values(ascending=False)

if len(train_missing) > 0:
    print(f"\n   Columns with missing values (train):")
    for col, count in train_missing.head(10).items():
        pct = count / len(X_train) * 100
        print(f"      ‚Ä¢ {col}: {count} ({pct:.1f}%)")

üìä Missing Values Before Imputation:

   Train set: 176 missing values
   Test set:  47 missing values

   Columns with missing values (train):
      ‚Ä¢ odor: 106 (77.9%)
      ‚Ä¢ fecal_streptococci: 24 (17.6%)
      ‚Ä¢ temperature: 16 (11.8%)
      ‚Ä¢ boron: 13 (9.6%)
      ‚Ä¢ flouride: 10 (7.4%)
      ‚Ä¢ phosphate: 4 (2.9%)
      ‚Ä¢ phenophelene_alkanity: 2 (1.5%)
      ‚Ä¢ nitrate_n: 1 (0.7%)


In [9]:
# ============================================================================
# NUMERIC IMPUTATION (MEDIAN)
# ============================================================================
# Fit on TRAIN only, transform both train and test

if len(numeric_cols) > 0:
    print("\nüîß Numeric Imputation (Median):")
    
    numeric_imputer = SimpleImputer(strategy='median')
    
    # Fit on train only
    numeric_imputer.fit(X_train[numeric_cols])
    
    # Transform both
    X_train[numeric_cols] = numeric_imputer.transform(X_train[numeric_cols])
    X_test[numeric_cols] = numeric_imputer.transform(X_test[numeric_cols])
    
    print(f"   ‚úì Imputed {len(numeric_cols)} numeric columns")
    print(f"   ‚úì Fitted on train, transformed both train and test")
else:
    numeric_imputer = None
    print("   ‚Ñπ No numeric columns to impute")


üîß Numeric Imputation (Median):
   ‚úì Imputed 29 numeric columns
   ‚úì Fitted on train, transformed both train and test


In [10]:
# ============================================================================
# CATEGORICAL IMPUTATION (MOST FREQUENT)
# ============================================================================
# Fit on TRAIN only, transform both train and test

if len(categorical_cols) > 0:
    print("\nüîß Categorical Imputation (Most Frequent):")
    
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    
    # Fit on train only
    categorical_imputer.fit(X_train[categorical_cols])
    
    # Transform both
    X_train[categorical_cols] = categorical_imputer.transform(X_train[categorical_cols])
    X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])
    
    print(f"   ‚úì Imputed {len(categorical_cols)} categorical columns")
    print(f"   ‚úì Fitted on train, transformed both train and test")
else:
    categorical_imputer = None
    print("   ‚Ñπ No categorical columns to impute")


üîß Categorical Imputation (Most Frequent):
   ‚úì Imputed 7 categorical columns
   ‚úì Fitted on train, transformed both train and test


In [11]:
# ============================================================================
# VERIFY NO MISSING VALUES AFTER IMPUTATION
# ============================================================================

print("\nüìä Missing Values After Imputation:")
print(f"   Train set: {X_train.isna().sum().sum()} missing values")
print(f"   Test set:  {X_test.isna().sum().sum()} missing values")

assert X_train.isna().sum().sum() == 0, "Train still has missing values!"
assert X_test.isna().sum().sum() == 0, "Test still has missing values!"

print("   ‚úì All missing values imputed successfully")


üìä Missing Values After Imputation:
   Train set: 0 missing values
   Test set:  0 missing values
   ‚úì All missing values imputed successfully


---
## üè∑Ô∏è Section 6: One-Hot Encoding (Fit on Train Only)

In [12]:
# ============================================================================
# ONE-HOT ENCODING
# ============================================================================
# Fit on TRAIN only, transform both train and test
# handle_unknown='ignore' ensures unseen categories in test don't cause errors

if len(categorical_cols) > 0:
    print("\nüè∑Ô∏è One-Hot Encoding:")
    
    encoder = OneHotEncoder(
        sparse_output=False,
        handle_unknown='ignore',  # Important: handles unseen categories in test
        drop=None  # Keep all categories
    )
    
    # Fit on train only
    encoder.fit(X_train[categorical_cols])
    
    # Get feature names
    encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
    
    # Transform both
    train_encoded = pd.DataFrame(
        encoder.transform(X_train[categorical_cols]),
        columns=encoded_feature_names,
        index=X_train.index
    )
    
    test_encoded = pd.DataFrame(
        encoder.transform(X_test[categorical_cols]),
        columns=encoded_feature_names,
        index=X_test.index
    )
    
    # Drop original categorical columns and add encoded ones
    X_train = X_train.drop(columns=categorical_cols)
    X_test = X_test.drop(columns=categorical_cols)
    
    X_train = pd.concat([X_train, train_encoded], axis=1)
    X_test = pd.concat([X_test, test_encoded], axis=1)
    
    print(f"   ‚úì Encoded {len(categorical_cols)} categorical columns")
    print(f"   ‚úì Created {len(encoded_feature_names)} dummy columns")
    print(f"   ‚úì Fitted on train, transformed both train and test")
else:
    encoder = None
    print("   ‚Ñπ No categorical columns to encode")


üè∑Ô∏è One-Hot Encoding:
   ‚úì Encoded 7 categorical columns
   ‚úì Created 46 dummy columns
   ‚úì Fitted on train, transformed both train and test


In [13]:
# ============================================================================
# ENCODE TARGET VARIABLE
# ============================================================================

label_encoder = LabelEncoder()

# Fit on all possible classes (A, B, C, E)
label_encoder.fit(['A', 'B', 'C', 'E'])

# Transform
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print(f"\nüè∑Ô∏è Target Encoding:")
print(f"   Classes: {label_encoder.classes_}")
print(f"   Encoded: {list(range(len(label_encoder.classes_)))}")


üè∑Ô∏è Target Encoding:
   Classes: ['A' 'B' 'C' 'E']
   Encoded: [0, 1, 2, 3]


---
## ‚úÖ Section 7: Final Validation

In [14]:
# ============================================================================
# CONVERT BOOLEAN COLUMNS TO INTEGER
# ============================================================================
# BDL flag columns are boolean (True/False) - convert to int (0/1)

bool_cols_train = X_train.select_dtypes(include=['bool']).columns.tolist()
bool_cols_test = X_test.select_dtypes(include=['bool']).columns.tolist()

if len(bool_cols_train) > 0:
    print(f"\nüîß Converting {len(bool_cols_train)} boolean columns to integer:")
    for col in bool_cols_train:
        X_train[col] = X_train[col].astype(int)
    for col in bool_cols_test:
        X_test[col] = X_test[col].astype(int)
    print(f"   ‚úì Converted: {bool_cols_train[:3]}... (and {len(bool_cols_train)-3} more)")
else:
    print("\n   ‚Ñπ No boolean columns to convert")


üîß Converting 17 boolean columns to integer:
   ‚úì Converted: ['fecal_coliform_is_bdl', 'total_coliform_is_bdl', 'fecal_streptococci_is_bdl']... (and 14 more)


In [15]:
# ============================================================================
# FINAL DATASET VALIDATION
# ============================================================================

print("\nüîç Final Dataset Validation:")
print("="*60)

# Shape check
print(f"\n   Train set: {X_train.shape[0]} rows √ó {X_train.shape[1]} features")
print(f"   Test set:  {X_test.shape[0]} rows √ó {X_test.shape[1]} features")

# Column alignment check
assert list(X_train.columns) == list(X_test.columns), "Column mismatch!"
print(f"   ‚úì Train and test have identical columns")

# No missing values
assert X_train.isna().sum().sum() == 0, "Train has NaN!"
assert X_test.isna().sum().sum() == 0, "Test has NaN!"
print(f"   ‚úì No missing values in train or test")

# Row alignment
assert len(X_train) == len(y_train_encoded), "Train X/y mismatch!"
assert len(X_test) == len(y_test_encoded), "Test X/y mismatch!"
print(f"   ‚úì X and y aligned correctly")

# All numeric
non_numeric = X_train.select_dtypes(exclude=[np.number]).columns
assert len(non_numeric) == 0, f"Non-numeric columns found: {list(non_numeric)}"
print(f"   ‚úì All features are numeric")

print("\n" + "="*60)
print("‚úÖ All validation checks passed!")


üîç Final Dataset Validation:

   Train set: 136 rows √ó 92 features
   Test set:  35 rows √ó 92 features
   ‚úì Train and test have identical columns
   ‚úì No missing values in train or test
   ‚úì X and y aligned correctly
   ‚úì All features are numeric

‚úÖ All validation checks passed!


---
## üíæ Section 8: Export Train/Test Datasets

In [16]:
# ============================================================================
# PREPARE FINAL DATAFRAMES
# ============================================================================

# Add target back to dataframes for export
train_df = X_train.copy()
train_df[TARGET_COL] = y_train_encoded

test_df = X_test.copy()
test_df[TARGET_COL] = y_test_encoded

print(f"‚úì Final dataframes prepared:")
print(f"   Train: {train_df.shape}")
print(f"   Test:  {test_df.shape}")

‚úì Final dataframes prepared:
   Train: (136, 93)
   Test:  (35, 93)


In [17]:
# ============================================================================
# CREATE OUTPUT DIRECTORIES
# ============================================================================

csv_folder = os.path.join(DATA_DIR, "processed", "csv")
parquet_folder = os.path.join(DATA_DIR, "processed", "parquet")
models_folder = os.path.join(Path(DATA_DIR).parent, "models")

Path(csv_folder).mkdir(parents=True, exist_ok=True)
Path(parquet_folder).mkdir(parents=True, exist_ok=True)
Path(models_folder).mkdir(parents=True, exist_ok=True)

print("‚úì Output directories ready")

‚úì Output directories ready


In [18]:
# ============================================================================
# EXPORT TRAIN/TEST DATASETS
# ============================================================================

# CSV exports
TRAIN_CSV = os.path.join(csv_folder, "train.csv")
TEST_CSV = os.path.join(csv_folder, "test.csv")

train_df.to_csv(TRAIN_CSV, index=False)
test_df.to_csv(TEST_CSV, index=False)

print(f"‚úì Exported: {TRAIN_CSV}")
print(f"‚úì Exported: {TEST_CSV}")

# Parquet exports
TRAIN_PARQUET = os.path.join(parquet_folder, "train.parquet")
TEST_PARQUET = os.path.join(parquet_folder, "test.parquet")

train_df.to_parquet(TRAIN_PARQUET, index=False)
test_df.to_parquet(TEST_PARQUET, index=False)

print(f"‚úì Exported: {TRAIN_PARQUET}")
print(f"‚úì Exported: {TEST_PARQUET}")

‚úì Exported: /Users/rex/Documents/personal/AquaSafe/data/processed/csv/train.csv
‚úì Exported: /Users/rex/Documents/personal/AquaSafe/data/processed/csv/test.csv
‚úì Exported: /Users/rex/Documents/personal/AquaSafe/data/processed/parquet/train.parquet
‚úì Exported: /Users/rex/Documents/personal/AquaSafe/data/processed/parquet/test.parquet


In [19]:
# ============================================================================
# SAVE PREPROCESSORS FOR DEPLOYMENT
# ============================================================================

# Save imputers
if numeric_imputer is not None:
    joblib.dump(numeric_imputer, os.path.join(models_folder, "numeric_imputer.pkl"))
    print(f"‚úì Saved: numeric_imputer.pkl")

if categorical_imputer is not None:
    joblib.dump(categorical_imputer, os.path.join(models_folder, "categorical_imputer.pkl"))
    print(f"‚úì Saved: categorical_imputer.pkl")

# Save encoder
if encoder is not None:
    joblib.dump(encoder, os.path.join(models_folder, "onehot_encoder.pkl"))
    print(f"‚úì Saved: onehot_encoder.pkl")

# Save label encoder
joblib.dump(label_encoder, os.path.join(models_folder, "label_encoder.pkl"))
print(f"‚úì Saved: label_encoder.pkl")

# Save feature names
joblib.dump(list(X_train.columns), os.path.join(models_folder, "feature_names.pkl"))
print(f"‚úì Saved: feature_names.pkl")

‚úì Saved: numeric_imputer.pkl
‚úì Saved: categorical_imputer.pkl
‚úì Saved: onehot_encoder.pkl
‚úì Saved: label_encoder.pkl
‚úì Saved: feature_names.pkl


In [20]:
# ============================================================================
# CREATE FEATURE REGISTRY
# ============================================================================

feature_registry = {
    "metadata": {
        "version": "1.0",
        "created_at": datetime.now().isoformat(),
        "created_by": "03_feature_engineering.ipynb"
    },
    "dataset_info": {
        "train_records": len(train_df),
        "test_records": len(test_df),
        "total_features": len(X_train.columns),
        "target_name": TARGET_COL,
        "target_classes": list(label_encoder.classes_)
    },
    "preprocessing": {
        "numeric_imputation": "median",
        "categorical_imputation": "most_frequent",
        "encoding": "one_hot",
        "split_ratio": "80/20",
        "stratified": True,
        "random_state": 42
    },
    "feature_names": list(X_train.columns)
}

registry_path = os.path.join(DATA_DIR, "processed", "feature_registry.json")
with open(registry_path, 'w') as f:
    json.dump(feature_registry, f, indent=2)

print(f"\n‚úì Feature registry saved: {registry_path}")


‚úì Feature registry saved: /Users/rex/Documents/personal/AquaSafe/data/processed/feature_registry.json


---
## üìã Section 9: Summary

In [21]:
# ============================================================================
# FEATURE ENGINEERING SUMMARY
# ============================================================================

print("\n" + "="*80)
print("üìã FEATURE ENGINEERING SUMMARY")
print("="*80)

print(f"""
PIPELINE EXECUTED (In Order):
-----------------------------
1. Loaded cleaned data from Notebook 02
2. Split data: 80% train / 20% test (stratified)
3. Imputed numeric columns (median) - fitted on train only
4. Imputed categorical columns (mode) - fitted on train only
5. One-hot encoded categorical columns - fitted on train only
6. Encoded target variable (LabelEncoder)

OUTPUT DATASETS:
----------------
‚Ä¢ Train: {len(train_df)} rows √ó {train_df.shape[1]} columns
‚Ä¢ Test:  {len(test_df)} rows √ó {test_df.shape[1]} columns
‚Ä¢ Features: {len(X_train.columns)}
‚Ä¢ Target classes: {list(label_encoder.classes_)}

SAVED ARTIFACTS:
----------------
‚Ä¢ train.csv / train.parquet
‚Ä¢ test.csv / test.parquet
‚Ä¢ numeric_imputer.pkl
‚Ä¢ categorical_imputer.pkl (if applicable)
‚Ä¢ onehot_encoder.pkl (if applicable)
‚Ä¢ label_encoder.pkl
‚Ä¢ feature_names.pkl
‚Ä¢ feature_registry.json

DATA LEAKAGE PREVENTION:
------------------------
‚úì Train-test split done BEFORE any transformation
‚úì Imputers fitted on train only, transformed both
‚úì Encoder fitted on train only, transformed both
‚úì No test set information leaked to training data
""")

print("="*80)
print("‚úÖ Feature engineering complete - Ready for model training")
print("="*80)


üìã FEATURE ENGINEERING SUMMARY

PIPELINE EXECUTED (In Order):
-----------------------------
1. Loaded cleaned data from Notebook 02
2. Split data: 80% train / 20% test (stratified)
3. Imputed numeric columns (median) - fitted on train only
4. Imputed categorical columns (mode) - fitted on train only
5. One-hot encoded categorical columns - fitted on train only
6. Encoded target variable (LabelEncoder)

OUTPUT DATASETS:
----------------
‚Ä¢ Train: 136 rows √ó 93 columns
‚Ä¢ Test:  35 rows √ó 93 columns
‚Ä¢ Features: 92
‚Ä¢ Target classes: [np.str_('A'), np.str_('B'), np.str_('C'), np.str_('E')]

SAVED ARTIFACTS:
----------------
‚Ä¢ train.csv / train.parquet
‚Ä¢ test.csv / test.parquet
‚Ä¢ numeric_imputer.pkl
‚Ä¢ categorical_imputer.pkl (if applicable)
‚Ä¢ onehot_encoder.pkl (if applicable)
‚Ä¢ label_encoder.pkl
‚Ä¢ feature_names.pkl
‚Ä¢ feature_registry.json

DATA LEAKAGE PREVENTION:
------------------------
‚úì Train-test split done BEFORE any transformation
‚úì Imputers fitted on 