# üìò Feature Engineering & Encoding Notebook

### **üéØ Objective**

Convert the cleaned dataset into a model-ready feature matrix by encoding categorical variables and assembling numeric features without re-cleaning, scaling, or leakage. This notebook produces a stable features artifact for modeling.

### Executive Summary

###  Notebook

**File:** `notebooks/03_feature_engineering.ipynb`

**Stage:** Feature Engineering & Encoding

**Input Contract:** `data/processed/nwmp_cleaned_v1.csv`

**Output Contract:** `data/processed/nwmp_features_v1.(csv | parquet)`

---

### Objective

Transform the **cleaned and validated dataset** into a **model-ready feature matrix** by:

* encoding categorical variables,
* assembling numeric and engineered features,
* enforcing strict schema invariants,

**without performing cleaning, imputation, scaling, or modeling decisions.**

This notebook establishes a **stable, reusable feature representation** for all downstream models.

---

### Input Assumptions (Contract Enforcement)

The input dataset satisfies the following invariants:

* ‚úÖ No missing values (NaNs)
* ‚úÖ Leakage-prone, metadata, and sparse columns already removed
* ‚úÖ Numeric features properly typed
* ‚úÖ BDL (Below Detection Limit) information preserved as binary flags
* ‚úÖ One row = one observation
* ‚úÖ Target variable (`use_based_class`) present and clean

If any of these conditions fail, the pipeline must return to **data cleaning**.

---

### Feature Engineering Scope

#### Included Operations

* Target separation (`X`, `y`)
* Identification of categorical features
* One-hot encoding of low-cardinality categorical columns
* Assembly of numeric + encoded categorical features
* Schema validation (row alignment, NaN checks)
* Export of model-ready feature matrix

#### Explicitly Excluded Operations

* ‚ùå Data cleaning or imputation
* ‚ùå Column dropping or leakage decisions
* ‚ùå Feature scaling or normalization
* ‚ùå Outlier handling
* ‚ùå Feature selection
* ‚ùå Model training

---

### Feature Composition

#### Numeric Features

* Physicochemical parameters (pH, DO, conductivity, TDS, turbidity, etc.)
* Chemical contaminants (nutrients, ions, hardness, alkalinity)
* Biological indicators (fecal coliform, total coliform, streptococci)
* Engineered **BDL indicator flags** (`*_is_bdl`)

All numeric features are passed through **unchanged**.

---

#### Categorical Features

* Domain-relevant, low-cardinality features only
* Encoded using **one-hot encoding**
* No identifiers, metadata, or leakage-prone columns included

High-cardinality or free-text columns are intentionally excluded upstream.

---

### Validation & Safety Checks

Before export, the following conditions are enforced:

* Feature matrix contains **no missing values**
* Feature rows align exactly with target labels
* No duplicate rows introduced during encoding

These checks ensure downstream models receive a **stable and deterministic input**.

---

### Output Artifact

The notebook exports a single, versioned feature dataset:

* **CSV (mandatory):** transparent, debuggable
* **Parquet (optional):** optimized for performance and scale

This artifact represents the **final feature contract** for all modeling notebooks.

---

### Key Design Principles

* **Separation of concerns:**
  Cleaning ‚Üí Feature Engineering ‚Üí Modeling are strictly isolated.

* **Reproducibility:**
  Feature generation is deterministic and independent of model choice.

* **Leakage safety:**
  Only features that exist prior to labeling are included.

* **Scalability:**
  Same features can be reused across multiple models and experiments.

---

### Status

‚úÖ Feature engineering complete
‚úÖ Schema validated
‚úÖ Model-ready dataset exported
üü¢ **Ready for model training**



### **Imports**

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
from utils.config import DATA_DIR # Path to raw data source
from src.data_preprocessing.create_dataframe import create_dataframe

### **üíø Data Loading**

In [3]:
# ============================================================================
# DATA LOADING ‚Äî CLEANED CONTRACT INPUT
# ============================================================================

INPUT_CSV = os.path.join(DATA_DIR, "processed", "csv", "cleaned_water_quality_data.csv")
# Optional parquet if available:
# INPUT_PARQUET = os.path.join(DATA_DIR, "processed", "parquet", "cleaned_water_quality_data.parquet")

df = create_dataframe(INPUT_CSV)

print("‚úì Loaded cleaned dataset")
print(df.shape)


‚úì Loaded cleaned dataset
(171, 56)


### **Define Target & Separate Features**

In [4]:
# ============================================================================
# TARGET SEPARATION
# ============================================================================

TARGET_COL = "use_based_class"

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL].copy()

print("‚úì Features / target separated")
print("X shape:", X.shape, "| y shape:", y.shape)


‚úì Features / target separated
X shape: (171, 55) | y shape: (171,)


### Feature Engineering (Encoding Scope Only)


**3.1 Identify categorical columns to encode**

(Low-cardinality only; no identifiers, no leakage, no free-text)

In [5]:
# ============================================================================
# CATEGORICAL FEATURE SELECTION (ENCODING SCOPE)
# ============================================================================

categorical_cols = X.select_dtypes(include=["object", "string"]).columns.tolist()

# Explicit exclusions (already decided in EDA / cleaning)
EXCLUDE_CATS = [
    # identifiers / metadata / free-text already removed upstream
]

# Keep only safe, low-cardinality categories
encode_cats = [c for c in categorical_cols if c not in EXCLUDE_CATS]

print("Categorical columns to encode:", encode_cats)


Categorical columns to encode: ['type_water_body', 'river_basin', 'district', 'weather', 'approx_depth', 'human_activities', 'floating_matter', 'color', 'odor']


### One-hot encode categorical features

In [6]:
# ============================================================================
# ONE-HOT ENCODING (NO SCALING, NO LEAKAGE)
# ============================================================================

X_encoded = pd.get_dummies(
    X,
    columns=encode_cats,
    drop_first=False
)

print("‚úì Encoding complete")
print("Encoded shape:", X_encoded.shape)


‚úì Encoding complete
Encoded shape: (171, 179)


### Assemble Feature Matrix (No Scaling)

In [7]:
# ============================================================================
# FINAL FEATURE MATRIX
# ============================================================================

X_features = X_encoded.copy()

# Invariants
assert X_features.isna().sum().sum() == 0, "NaNs present after encoding"
assert X_features.shape[0] == y.shape[0], "Row mismatch between X and y"

print("‚úì Feature matrix assembled and validated")


‚úì Feature matrix assembled and validated


### Export Model-Ready Features

In [8]:
# ============================================================================
# EXPORT FEATURES (MODEL-READY, UN-SCALED)
# ============================================================================

OUT_CSV = os.path.join(DATA_DIR, "processed", "csv", "nwmp_features_v1.csv")
OUT_PARQUET = os.path.join(DATA_DIR, "processed", "parquet", "nwmp_features_v1.parquet")

X_features.assign(**{TARGET_COL: y}).to_csv(OUT_CSV, index=False)

try:
    X_features.assign(**{TARGET_COL: y}).to_parquet(OUT_PARQUET, index=False)
except Exception as e:
    print("Parquet export skipped:", e)

print(f"‚úì Features exported to {OUT_CSV}")


‚úì Features exported to /Users/rex/Documents/personal/AquaSafe/data/processed/csv/nwmp_features_v1.csv
