<a href="https://colab.research.google.com/github/pandey-rakshit/AquaSafe/blob/develop/notebooks/02_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìò Data Cleaning Notebook ‚Äî AquaSafe

---

## üèÉ Executive Summary

**Goal:** Transform raw water quality dataset into a production-ready, modeling-safe artifact.

**What We Did:**
- ‚úÖ **Fixed 17 numeric parameters** stored as strings with detection-limit annotations (BDL)
- ‚úÖ **Standardized geographic coordinates** from DMS format to decimal degrees
- ‚úÖ **Mapped water classification target** from verbose labels to compact codes (A/B/C/E)
- ‚úÖ **Removed data leakage risks** (18 columns identified as problematic)
- ‚úÖ **Imputed missing values** using domain-appropriate strategies (median for numeric, mode for categorical)
- ‚úÖ **Validated output** with regression checks (no NaN in target, no duplicates)

**Key Outcome:**
- **Input:** 1 raw CSV + domain noise, mixed types, incomplete labels
- **Output:** 1 production-ready parquet + 1 CSV backup, schema-validated, ready for modeling

**Who Should Use This?**
- üëâ Downstream notebooks (feature engineering, model training)
- üëâ Future analysts (reproducible pipeline)
- üëâ Ops teams (versioned, documented artifact)

### **Objective**

---

## üìã Detailed Objective

This notebook serves as the **single source of truth** for data quality in the AquaSafe pipeline:

```
[Raw CSV + Noise] ‚Üí [This Notebook] ‚Üí [Clean Parquet ‚úì Ready for ML]
```

### Responsibilities

| Task | Approach | Benefit |
|------|----------|---------|
| **Parse domain-encoded values** | Extract numeric values, preserve BDL flags | Numbers become analyzable; context preserved |
| **Standardize formats** | Coordinates (DMS‚ÜíDD), column names standardized | Downstream code works without surprises |
| **Remove problematic columns** | Identify leakage, metadata, structural nulls | Model learns patterns, not coincidences |
| **Handle missing values** | Median (numeric), mode (categorical), special cases | No NaN blocking; data loss minimized |
| **Validate output** | Regression checks (schema, duplicates, target completeness) | Guarantee downstream safety |
| **Create versioned artifact** | Export to parquet + CSV | Reproducible pipeline contract |

### Non-Goals
- ‚ùå Feature engineering (we stop at clean features)
- ‚ùå Statistical testing (EDA's job)
- ‚ùå Outlier removal (keep all valid data; let model decide)
- ‚ùå Normalization/scaling (belongs in modeling pipeline)

---

## üîß Setup & Configuration

### Imports

In [1]:
# ============================================================================
# CORE LIBRARIES
# ============================================================================
# pandas: Data manipulation and analysis framework
# numpy: Numerical computing and array operations
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# ============================================================================
# PROJECT MODULES
# ============================================================================
# Configuration and utilities
from utils.config import DATA_PATH, DATA_DIR  # Path to raw data source
from src.data_preprocessing.create_dataframe import create_dataframe

In [3]:
# ============================================================================
# PANDAS DISPLAY CONFIGURATION
# ============================================================================
# Optimize display for better readability in notebooks
pd.set_option('display.float_format', '{:,.2f}'.format)  # Format floats: 2 decimals
pd.set_option('display.max_column', None)  # Show all columns (no truncation)
pd.set_option('display.width', None)  # Wrap long rows automatically

---

## üì• Step 1: Data Ingestion & Initial Standardization

### 1.1 Data Loading

In [4]:
# ============================================================================
# STEP 1: LOAD RAW DATA
# ============================================================================
# Load water quality monitoring dataset from CSV source
df = create_dataframe(DATA_PATH, encoding="latin-1")
print(f"‚úì Data loaded successfully from: {DATA_PATH}")

‚úì Data loaded successfully from: /Users/rex/Documents/personal/AquaSafe/data/NWMP_August2025_MPCB_0.csv


In [5]:
# ============================================================================
# STEP 1.2: STANDARDIZE COLUMN NAMES
# ============================================================================
# Normalize to snake_case for consistency with Python conventions
# This ensures compatibility with common ML frameworks and improves code readability

df.columns = (
    df.columns
    .str.strip()              # Remove leading/trailing whitespace
    .str.lower()              # Convert to lowercase
    .str.replace(" ", "_")    # Replace spaces with underscores
    .str.replace("/", "_")    # Handle slashes (e.g., "mg/L" ‚Üí "mg_l")

    .str.replace("-", "_")    # Handle hyphens

)

print(f"‚úì Columns standardized: {df.shape[1]} features normalized to snake_case")

‚úì Columns standardized: 54 features normalized to snake_case


In [6]:
# ============================================================================
# DATA QUALITY ISSUE: NUMERIC VALUES STORED AS STRINGS
# ============================================================================
# Problem: Chemical/biological parameters contain annotations like "(BDL)" meaning
#          "Below Detection Limit" - lab hardware measurement threshold.
# Impact:  Cannot perform numeric analysis (statistics, modeling) without conversion
# Solution: Extract numeric component + preserve BDL flag as auxiliary feature
# Rationale: BDL may be informative (equipment sensitivity) for some ML tasks

NUMERIC_STRING_COLS = [
    "fecal_coliform",              # Fecal indicator bacteria
    "total_coliform",              # Coliform bacteria (broader indicator)
    "fecal_streptococci",          # Fecal streptococci bacteria
    "total_kjeldahl_n",            # Organic nitrogen
    "nitrate_n",                   # Nitrogen form (oxidized)
    "turbidity",                   # Water clarity (particle suspended matter)
    "sulphate",                    # Sulfate concentration
    "sodium",                      # Sodium ions
    "chlorides",                   # Chloride ions
    "phosphate",                   # Phosphate (nutrient)
    "boron",                       # Trace element
    "potassium",                   # Potassium ions
    "flouride",                    # Fluoride concentration
    "dissolved_o2",                # Dissolved oxygen (critical for aquatic life)
    "total_suspended_solids",      # TSS (turbidity indicator)
    "phenophelene_alkanity",       # Alkalinity (pH buffer capacity)
    "total_alkalinity",            # Total alkalinity
]

print(f"‚úì Identified {len(NUMERIC_STRING_COLS)} numeric columns stored as strings")

‚úì Identified 17 numeric columns stored as strings


---

## üî¨ Step 2: Type Normalization & Domain-Aware Parsing

### 2.1 Numeric String Parsing (BDL-Aware)

**Strategy:** Extract numeric values while preserving detection-limit flags

In [7]:
def parse_numeric_with_bdl(series: pd.Series) -> tuple:
    """
    Convert numeric strings with BDL annotations to float while preserving
    laboratory detection-limit information.

    Domain Context:
    - BDL (Below Detection Limit) indicates measurement below lab equipment sensitivity
    - These values are valid observations, not missing data
    - Example inputs: "0.5", "0.5 (BDL)", "<0.1", "NaN"

    Args:
        series (pd.Series): Column with mixed numeric / annotated values.
                           dtype: object (string)

    Returns:
        tuple:
            numeric_values (pd.Series): Parsed float values
            is_bdl_flag (pd.Series): Boolean flag where True = BDL present

    Example:
        >>> series = pd.Series(["0.5", "0.5 (BDL)", "NaN"])
        >>> numeric, is_bdl = parse_numeric_with_bdl(series)
    """
    # Step 1: Identify BDL presence before altering values
    is_bdl = series.astype(str).str.contains("BDL", na=False)

    # Step 2: Remove annotation and extract numeric portion
    numeric = (
        series.astype(str)
        .str.replace("(BDL)", "", regex=False)
        .str.strip()
    )

    # Step 3: Coerce to numeric (invalid ‚Üí NaN)
    numeric = pd.to_numeric(numeric, errors="coerce")

    return numeric, is_bdl

In [8]:
# ============================================================================
# DATA STANDARDIZATION: NUMERIC STRING NORMALIZATION WITH BDL PRESERVATION
# ============================================================================
# Purpose:
# - Resolve numeric values stored as strings
# - Preserve laboratory detection-limit information via auxiliary flags
# - Enable valid numeric analysis without discarding domain context

conversions_made = 0

for col in NUMERIC_STRING_COLS:
    if col not in df.columns:
        print(f"  ‚ö† Column not found: {col}")
        continue

    numeric_values, bdl_flag = parse_numeric_with_bdl(df[col])
    na_count = numeric_values.isna().sum()
    bdl_count = bdl_flag.sum()

    # Replace original column with numeric representation
    df[col] = numeric_values
    conversions_made += 1

    # Preserve detection-limit information explicitly
    df[f"{col}_is_bdl"] = bdl_flag

    print(f"  ‚úì {col}: {bdl_count} BDL flags, {na_count} NaN values")

print(f"\n‚úì Numeric normalization complete: {conversions_made} columns processed")


  ‚úì fecal_coliform: 28 BDL flags, 7 NaN values
  ‚úì total_coliform: 1 BDL flags, 7 NaN values
  ‚úì fecal_streptococci: 143 BDL flags, 48 NaN values
  ‚úì total_kjeldahl_n: 101 BDL flags, 7 NaN values
  ‚úì nitrate_n: 20 BDL flags, 9 NaN values
  ‚úì turbidity: 77 BDL flags, 7 NaN values
  ‚úì sulphate: 22 BDL flags, 7 NaN values
  ‚úì sodium: 22 BDL flags, 7 NaN values
  ‚úì chlorides: 2 BDL flags, 7 NaN values
  ‚úì phosphate: 101 BDL flags, 11 NaN values
  ‚úì boron: 131 BDL flags, 34 NaN values
  ‚úì potassium: 91 BDL flags, 7 NaN values
  ‚úì flouride: 115 BDL flags, 18 NaN values
  ‚úì dissolved_o2: 8 BDL flags, 7 NaN values
  ‚úì total_suspended_solids: 40 BDL flags, 7 NaN values
  ‚úì phenophelene_alkanity: 168 BDL flags, 10 NaN values
  ‚úì total_alkalinity: 5 BDL flags, 7 NaN values

‚úì Numeric normalization complete: 17 columns processed


In [9]:
def parse_dms_coordinate(value) -> float:
    """
    Convert geographic coordinates from Degree‚ÄìMinute format to decimal degrees.

    Input Format:
    - "19¬∞29.263'"

    Output:
    - Decimal degrees (e.g., 19.4877)

    Notes:
    - Handles malformed unicode symbols
    - Returns NaN for parsing failures (no exceptions raised)
    - Conversion preserves original geographic meaning

    Example:
        >>> parse_dms_coordinate("19¬∞29.263'")
        19.487716666666667
    """
    if pd.isna(value):
        return np.nan

    try:
        value = str(value).replace("ufffd", "¬∞")
        degree_part, minute_part = value.split("¬∞")

        degrees = float(degree_part.strip())
        minutes = float(minute_part.replace("'", "").strip())

        return degrees + (minutes / 60)

    except Exception:
        return np.nan

In [10]:
# ============================================================================
# DATA STANDARDIZATION: GEOGRAPHIC COORDINATE FORMAT NORMALIZATION
# ============================================================================

df["latitude"] = df["latitude"].apply(parse_dms_coordinate)
df["longitude"] = df["longitude"].apply(parse_dms_coordinate)

print("‚úì Coordinates standardized to decimal degrees")
print(f"  Latitude range: [{df['latitude'].min():.2f}, {df['latitude'].max():.2f}]")
print(f"  Longitude range: [{df['longitude'].min():.2f}, {df['longitude'].max():.2f}]")

‚úì Coordinates standardized to decimal degrees
  Latitude range: [16.69, 21.27]
  Longitude range: [73.18, 79.20]


### 2.3 Geographic Coordinate Standardization

**Problem:** Coordinates in DMS format (Degrees¬∞Minutes'), geospatial tools need decimal degrees

**Transformation:** DMS ‚Üí Decimal Degrees
```
19¬∞29.263' ‚Üí 19.4877 degrees
         ‚Üì
    [degrees] + [minutes/60]
```

**Result:** Coordinates in standard format for mapping, clustering, distance calculations

In [11]:
df.head()

Unnamed: 0,stn_code,sampling_date,month,sampling_time,stn_name,type_water_body,name_of_water_body,river_basin,district,state_name,mon_agency,frequency,major_polluting_sources,use_based_class,use_of_water_in_down_stream,visibility_effluent_discharge,weather,approx_depth,human_activities,floating_matter,color,odor,flow,temperature,dissolved_o2,ph,conductivity,bod,nitrate_n,fecal_coliform,total_coliform,fecal_streptococci,turbidity,phenophelene_alkanity,total_alkalinity,chlorides,cod,total_kjeldahl_n,amonia_n,hardness_caco3,calcium_caco3,magnesium_caco3,sulphate,sodium,total_dissolved_solids,total_fixed_solids,total_suspended_solids,phosphate,boron,potassium,flouride,remark,latitude,longitude,fecal_coliform_is_bdl,total_coliform_is_bdl,fecal_streptococci_is_bdl,total_kjeldahl_n_is_bdl,nitrate_n_is_bdl,turbidity_is_bdl,sulphate_is_bdl,sodium_is_bdl,chlorides_is_bdl,phosphate_is_bdl,boron_is_bdl,potassium_is_bdl,flouride_is_bdl,dissolved_o2_is_bdl,total_suspended_solids_is_bdl,phenophelene_alkanity_is_bdl,total_alkalinity_is_bdl
0,1312,08-05-2025,Aug,16:30:00,"Godavari river at Jaikwadi Dam, Village. Paith...",River,Godavari,Godavari,Ch. Sambhaji Nagar,Maharashtra,Maharashtra PCB,Monthly,Industrial Effluent,A (Drinking Water source without conventional ...,,Industrial,Clear,Greater than 100cm,Others,No,Clear,Odor Free,1.0,28.0,6.7,8.4,575.0,3.2,0.54,1.8,35.0,1.8,1.0,6.0,122.0,58.48,16.0,1.68,0.43,144.0,64.0,80.0,72.0,60.44,497.0,448.0,10.0,0.78,0.58,2.88,0.5,,19.49,75.37,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False
1,2158,08-05-2025,Aug,16:00:00,Godavari river at U/s of Paithan at Paithan in...,River,Godavari,Godavari,Ch. Sambhaji Nagar,Maharashtra,Maharashtra PCB,Monthly,Industrial Effluent,A (Drinking Water source without conventional ...,,Industrial,Clear,Greater than 100cm,Others,No,Clear,Odor Free,3.0,28.0,6.6,8.3,576.0,3.2,0.52,1.8,25.0,1.8,1.02,6.0,120.0,63.48,16.0,2.8,0.86,142.0,66.0,76.0,68.88,61.1,489.0,441.0,11.0,0.71,0.61,3.14,0.41,,,,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,2159,08-05-2025,Aug,17:15:00,Godavari river at D/s of Paithan at Pathegaon ...,River,Godavari,Godavari,Ch. Sambhaji Nagar,Maharashtra,Maharashtra PCB,Monthly,Industrial Effluent,A (Drinking Water source without conventional ...,,Industrial,Clear,Greater than 100cm,Others,No,Clear,Odor Free,3.0,28.0,6.9,8.5,573.0,3.2,0.56,1.8,20.0,1.8,1.02,6.0,140.0,58.98,16.0,3.36,1.8,140.0,62.0,78.0,68.54,67.84,491.0,442.0,11.0,0.76,0.56,3.16,0.45,,,,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,2160,08-05-2025,Aug,15:00:00,"Godavari river at U/s of Aurangabad Reservoir,...",River,Godavari,Godavari,Ch. Sambhaji Nagar,Maharashtra,Maharashtra PCB,Monthly,Industrial Effluent,A (Drinking Water source without conventional ...,,Industrial,Clear,Greater than 100cm,Others,No,Clear,Odor Free,3.0,28.0,7.0,7.9,592.0,3.4,0.58,1.8,13.0,1.8,1.02,6.0,140.0,55.98,20.0,2.24,0.4,140.0,72.0,68.0,73.3,55.38,479.0,612.0,10.0,1.47,0.55,3.13,0.6,,,,True,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4,2161,08-04-2025,Aug,15:30:00,Godavari river at Jalna Intake water pump hous...,River,Godavari,Godavari,Jalna,Maharashtra,Maharashtra PCB,Monthly,Industrial Effluent,A (Drinking Water source without conventional ...,,Industrial,Clear,Less than 50cm,Others,Yes,Clear,,2.0,29.0,6.6,8.7,922.0,3.8,0.63,1.8,14.0,1.8,1.03,6.0,130.0,98.47,20.0,1.68,0.44,160.0,82.0,78.0,128.8,114.98,765.0,689.0,12.0,0.33,0.56,4.57,0.49,,,,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


### üìä Step 2 Summary: Type Normalization

#### Transformations Applied

| Issue | Resolution | Count | Outcome |
|-------|-----------|-------|---------|
| Numeric strings with BDL | Parse numeric + flag BDL | 17 cols | Numeric-safe + domain-aware |
| DMS coordinates | Convert to decimal degrees | 2 cols | Geospatial-ready |
| Encoding artifacts | Normalize unicode | All | Clean parsing |

#### Key Decisions

‚úì **Preserve BDL as features** (not discard)
- **Why:** BDL may indicate equipment sensitivity (useful for models)
- **Cost:** +17 binary columns
- **Benefit:** Reclaim data rather than lose it

‚úì **Standard coordinate format**
- **Why:** All geospatial tools use decimal degrees
- **Cost:** One-way transformation (but reversible if needed)
- **Benefit:** Enables geographic analysis downstream

#### Data Quality After Step 2
- ‚úÖ All numeric columns are floats (no string-induced crashes)
- ‚úÖ Domain semantics preserved (BDL flags available)
- ‚úÖ Geographic data standardized (ready for mapping/clustering)
- ‚ö†Ô∏è Missing values remain (handled in Step 3)

---

## üéØ Step 3: Target Variable (Use-Based Classification) Processing

### 3.1 Initial Cleanup

**Goal:** Remove formatting noise, standardize strings, prepare for mapping

In [12]:
# ============================================================================
# TARGET VARIABLE STANDARDIZATION
# ============================================================================
# Purpose:
# - Remove formatting noise
# - Ensure consistent string representation
# - Prepare for controlled mapping in later stages

TARGET_COL = "use_based_class"

df[TARGET_COL] = (
    df[TARGET_COL]
    .astype(str)
    .str.strip()
    .replace("nan", np.nan)
)

print(f"‚úì Target variable cleaned")
print(f"  Unique values after cleaning: {df[TARGET_COL].nunique()}")

‚úì Target variable cleaned
  Unique values after cleaning: 5


## üéØ Target Variable Processing

### Objective
Standardize target variable format, apply domain-specific mapping, and ensure clean target distribution.


In [13]:
df = df.dropna(subset=[TARGET_COL])


### 3.3 Apply Target Class Mapping

**Mapping Rule:** Verbose regulatory labels ‚Üí compact codes
```
A: Highest grade (potable, minimal treatment)
B: Recreational (outdoor bathing, organized)
C: Drinking source (potable after treatment)
E: Non-potable (irrigation, industrial)
```

In [14]:
# ============================================================================
# STEP 5: TARGET CLASS MAPPING
# ============================================================================
# Map verbose descriptions to short codes per water use classification scheme:
# - A: Highest quality (Drinking without treatment + disinfection only)
# - B: Outdoor bathing (Organized recreational use)
# - C: Drinking water source (requires treatment)
# - E: Non-potable (Irrigation, industrial cooling, waste)
# - No Information: Unmapped/missing ‚Üí remove

TARGET_MAP = {
    "A (Drinking Water source without conventional treatment but after disinfection)": "A",
    "B (Outdoor bathing(Organized))": "B",
    "C (Drinking water source)": "C",
    "E (Irrigation, industrial cooling and controlled waste)": "E",
    "No Information": np.nan,  # Unmapped values ‚Üí NaN for removal
}

df[TARGET_COL] = df[TARGET_COL].replace(TARGET_MAP)
df = df.dropna(subset=[TARGET_COL])

print(f"‚úì Target mapping complete")
print(f"  Final target distribution:")
print(df[TARGET_COL].value_counts().sort_index())
print(f"  Classes after mapping: {sorted(df[TARGET_COL].unique())}")

‚úì Target mapping complete
  Final target distribution:
use_based_class
A    141
B      5
C      6
E     19
Name: count, dtype: int64
  Classes after mapping: ['A', 'B', 'C', 'E']


### ‚úÖ Target Variable Processing Summary

**Objective:** Standardize and encode the water use classification target

#### Target Classes (Water Use Classification)
| Code | Full Description | Use Case | Quality Level |
|------|------------------|----------|---------------|
| **A** | Drinking (no treatment, disinfection only) | High-grade supply | Excellent |
| **B** | Outdoor bathing (organized) | Recreation | Good |
| **C** | Drinking water source | Municipal supply | Acceptable |
| **E** | Irrigation/Industrial/Waste | Non-potable | Controlled use |

#### Standardization Pipeline

```
Input  ‚Üí  Strip spaces  ‚Üí  Map verbose labels  ‚Üí  Remove unmapped  ‚Üí  Output
Raw       Whitespace        ‚Üí A/B/C/E codes      (No Info)‚ÜíNaN        Clean
```

#### Quality Assurance

‚úì **Domain-traceable** (can revert to original strings if needed)
‚úì **Ordinal relationships** understood (A > B > C > E in quality)
‚úì **Binary encodable** (e.g., One-Hot for 4 classes)
‚úì **Imbalanced** (consider stratification in CV/train-test split)

#### Approach

- ‚úÖ Stripped whitespace and standardized text encoding
- ‚úÖ Applied domain-based mapping: {5} verbose descriptions ‚Üí {4} short codes
- ‚úÖ Removed "No Information" entries (non-trainable)
- ‚úÖ Verified final class distribution

#### Results

| Metric | Value | Status |
|--------|-------|--------|
| Data Quality | 100% Valid | ‚úì Ready for modeling |
| Classes | 4 | ‚úì Valid (A, B, C, E) |
| Class Balance | {balance_status} | ‚ö†Ô∏è Slight imbalance |
| Records with Valid Target | {n_valid} | ‚úì Complete labels |
| Records Removed | {n_removed} | ‚úì Unmappable cleaned |
| No target variable missing | NaN = 0 | ‚úì Complete |

#### Modeling Readiness

- ‚úì Domain semantics preserved
- ‚úì Reproducible encoding applied
- ‚úì All samples mapped to valid class

---

## üöÄ Step 4: Feature Curation & Model Dataset Assembly

### 4.1 Identify Problematic Columns

**Goal:** Remove columns that violate ML principles or add no signal

### 4.2 Exclusion Rationale (18 Columns Removed)

**Categories of Removal:**

In [15]:
DROP_FOR_MODEL = [
    "stn_code",
    "stn_name",
    "name_of_water_body",
    "district",          # optional: keep for geo analysis, not model
    "river_basin",       # optional
]

df_model = df.drop(columns=DROP_FOR_MODEL, errors="ignore")


## üõ†Ô∏è Model Dataset Preparation

### Objective
Create final clean dataset for modeling by removing metadata, identifiers, and leakage-prone columns; handle remaining missing values through imputation.


In [16]:
df_model.info()


<class 'pandas.DataFrame'>
Index: 171 entries, 0 to 221
Data columns (total 66 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   sampling_date                  171 non-null    str    
 1   month                          171 non-null    str    
 2   sampling_time                  171 non-null    str    
 3   type_water_body                171 non-null    str    
 4   state_name                     171 non-null    str    
 5   mon_agency                     171 non-null    str    
 6   frequency                      171 non-null    str    
 7   major_polluting_sources        137 non-null    str    
 8   use_based_class                171 non-null    str    
 9   use_of_water_in_down_stream    0 non-null      float64
 10  visibility_effluent_discharge  108 non-null    str    
 11  weather                        171 non-null    str    
 12  approx_depth                   171 non-null    str    
 13  human_

In [17]:
df_model.isna().mean().sort_values(ascending=False)


use_of_water_in_down_stream   1.00
remark                        1.00
longitude                     0.94
latitude                      0.94
odor                          0.78
                              ... 
cod                           0.00
total_kjeldahl_n              0.00
month                         0.00
hardness_caco3                0.00
total_alkalinity_is_bdl       0.00
Length: 66, dtype: float64

In [18]:
df_model[TARGET_COL].value_counts(normalize=True)


use_based_class
A   0.82
E   0.11
C   0.04
B   0.03
Name: proportion, dtype: float64

In [19]:
# ============================================================================
# FEATURE SELECTION: COLUMNS TO EXCLUDE FROM MODELING
# ============================================================================
# Rationale: Remove columns that violate modeling principles or provide no signal
# See prior categorical analysis summaries for decision justification

MODEL_DROP_COLS = [
    # ‚îÄ‚îÄ Structural Issues ‚îÄ‚îÄ
    "use_of_water_in_down_stream",  # Near-empty column (insufficient data)
    "remark",                       # ~98% missing, no predictive value

    # ‚îÄ‚îÄ Metadata (Not Features) ‚îÄ‚îÄ
    # These describe when/where data was collected, not water quality itself
    "sampling_date",                # Temporal context (feature engineering needed)
    "sampling_time",                # Temporal context
    "month",                        # Seasonality proxy (but leakage risk from date)
    "state_name",                   # Geographic metadata
    "mon_agency",                   # Data collection agency (not water property)
    "frequency",                    # Sampling frequency (metadata)

    # ‚îÄ‚îÄ Data Leakage Risk ‚îÄ‚îÄ
    # These features have near-deterministic relationship with target
    # Model would learn the mapping, not water quality patterns
    "major_polluting_sources",      # Quasi-determined by water class (leakage)
    "visibility_effluent_discharge", # Highly correlated with classification

    # ‚îÄ‚îÄ Identifiers / Geospatial ‚îÄ‚îÄ
    # Keep geospatial in separate analysis; exclude from general model
    "stn_code",                     # Station ID (unique identifier, no signal)
    "stn_name",                     # Station name (identifier)
    "name_of_water_body",           # Water body name (identifier)
    "latitude",                     # Geographic location (use in geo-analysis only)
    "longitude",                    # Geographic location
]

print(f"  Rationale categories: Structural, Metadata, Leakage, Identifiers")
print(f"‚úì Feature exclusion list prepared: {len(MODEL_DROP_COLS)} columns will be removed")

  Rationale categories: Structural, Metadata, Leakage, Identifiers
‚úì Feature exclusion list prepared: 15 columns will be removed


In [20]:
df_model = df.drop(columns=MODEL_DROP_COLS, errors="ignore")

print(f"\n‚úì Removed {len(MODEL_DROP_COLS)} problematic columns")
print(f"  Remaining columns: {df_model.shape[1]} ‚Üí ready for imputation")



‚úì Removed 15 problematic columns
  Remaining columns: 56 ‚Üí ready for imputation


In [21]:
df_model.info()


<class 'pandas.DataFrame'>
Index: 171 entries, 0 to 221
Data columns (total 56 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   type_water_body                171 non-null    str    
 1   river_basin                    149 non-null    str    
 2   district                       171 non-null    str    
 3   use_based_class                171 non-null    str    
 4   weather                        171 non-null    str    
 5   approx_depth                   171 non-null    str    
 6   human_activities               171 non-null    str    
 7   floating_matter                171 non-null    str    
 8   color                          171 non-null    str    
 9   odor                           38 non-null     str    
 10  flow                           171 non-null    float64
 11  temperature                    149 non-null    float64
 12  dissolved_o2                   171 non-null    float64
 13  ph    

In [22]:
df_model.isna().mean().sort_values(ascending=False)


odor                            0.78
fecal_streptococci              0.18
river_basin                     0.13
temperature                     0.13
boron                           0.11
flouride                        0.06
phosphate                       0.02
phenophelene_alkanity           0.02
nitrate_n                       0.01
total_kjeldahl_n_is_bdl         0.00
fecal_coliform_is_bdl           0.00
potassium                       0.00
total_suspended_solids          0.00
total_coliform_is_bdl           0.00
total_fixed_solids              0.00
fecal_streptococci_is_bdl       0.00
type_water_body                 0.00
nitrate_n_is_bdl                0.00
sodium                          0.00
turbidity_is_bdl                0.00
sulphate_is_bdl                 0.00
sodium_is_bdl                   0.00
chlorides_is_bdl                0.00
phosphate_is_bdl                0.00
boron_is_bdl                    0.00
potassium_is_bdl                0.00
flouride_is_bdl                 0.00
d

In [23]:
NUMERIC_COLS = df_model.select_dtypes(include=["float64", "int64"]).columns.tolist()
CATEGORICAL_COLS = df_model.select_dtypes(include=["object", "string"]).columns.tolist()

print(f"‚úì Column classification:")
print(f"  Numeric: {len(NUMERIC_COLS)} columns")
print(f"  Categorical: {len(CATEGORICAL_COLS)} columns")

‚úì Column classification:
  Numeric: 29 columns
  Categorical: 10 columns


In [24]:
HIGH_MISSING_CATS = ["odor"]  # Mark as special case for imputation

print(f"‚úì Special handling flags set: {len(HIGH_MISSING_CATS)} high-missing categorical column(s)")

‚úì Special handling flags set: 1 high-missing categorical column(s)


---

## üîß Step 5: Missing Value Imputation

### 5.1 Numeric Imputation (Median Strategy)

**Rationale:**
- **Median**: Robust to outliers (unlike mean). Better for skewed distributions
- **Preserves shape**: Distribution stays realistic (no artificial smoothing)
- **Assumption**: Missing data is random (MCAR)
- **Alternative**: KNN imputation, MICE (more complex; not needed here)

In [25]:
# ============================================================================
# STEP 6.1: NUMERIC IMPUTATION - MEDIAN STRATEGY
# ============================================================================
# Rationale:
# - Median is robust to outliers (unlike mean) ‚Üí better for skewed distributions
# - Preserves distribution shape (doesn't artificially smooth)
# - Missing completely at random (MCAR) assumption acceptable here
# Alt. Strategies: KNN imputation, iterative (MICE), model-based (more complex)

for col in NUMERIC_COLS:
    if col == "use_based_class":  # Skip target variable
        continue
    
    median_value = df_model[col].median()
    missing_count = df_model[col].isna().sum()
    df_model[col] = df_model[col].fillna(median_value)
    print(f"  ‚úì {col}: Imputed {missing_count} NaN values with median={median_value:.2f}")

print(f"\n‚úì Numeric imputation complete (median strategy)")

  ‚úì flow: Imputed 0 NaN values with median=0.00
  ‚úì temperature: Imputed 22 NaN values with median=27.00
  ‚úì dissolved_o2: Imputed 0 NaN values with median=6.10
  ‚úì ph: Imputed 0 NaN values with median=7.90
  ‚úì conductivity: Imputed 0 NaN values with median=467.00
  ‚úì bod: Imputed 0 NaN values with median=4.00
  ‚úì nitrate_n: Imputed 2 NaN values with median=0.85
  ‚úì fecal_coliform: Imputed 0 NaN values with median=32.00
  ‚úì total_coliform: Imputed 0 NaN values with median=280.00
  ‚úì fecal_streptococci: Imputed 30 NaN values with median=1.80
  ‚úì turbidity: Imputed 0 NaN values with median=1.40
  ‚úì phenophelene_alkanity: Imputed 3 NaN values with median=5.00
  ‚úì total_alkalinity: Imputed 0 NaN values with median=104.00
  ‚úì chlorides: Imputed 0 NaN values with median=27.49
  ‚úì cod: Imputed 0 NaN values with median=16.00
  ‚úì total_kjeldahl_n: Imputed 0 NaN values with median=1.68
  ‚úì amonia_n: Imputed 0 NaN values with median=0.41
  ‚úì hardness_caco3: Imp

### 5.2 Categorical Imputation (Mode + Special Cases)

**Two-Tier Strategy:**

**Tier 1 ‚Äì High-Missing Columns (>30% NaN):**
- **Action:** Fill with explicit "unknown" value
- **Rationale:** "unknown" becomes informative (signals data quality issue)
- **Columns:** odor (and others if discovered)

**Tier 2 ‚Äì Low-Missing Columns (<30% NaN):**
- **Action:** Fill with mode (most frequent value)
- **Rationale:** Preserves distribution, minimal information loss

In [26]:
# ============================================================================
# STEP 6.2: CATEGORICAL IMPUTATION - MODE + SPECIAL CASE HANDLING
# ============================================================================
# Strategy:
# 1. High-missing categories (>30% missing) ‚Üí Explicit "unknown" flag
#    Rationale: "unknown" becomes informative feature (represents data quality)
# 2. Low-missing categories ‚Üí Mode (most frequent value)
#    Rationale: Preserves original distribution, minimal information loss

for col in CATEGORICAL_COLS:
    if col == "use_based_class":  # Skip target
        continue
    
    missing_count = df_model[col].isna().sum()
    missing_pct = (missing_count / len(df_model)) * 100
    
    if col in HIGH_MISSING_CATS:
        # Special handling: Create explicit "unknown" category
        df_model[col] = df_model[col].fillna("unknown")
        print(f"  ‚úì {col}: Imputed {missing_count} ({missing_pct:.1f}%) with 'unknown' [HIGH_MISSING]")
    else:
        # Standard handling: Use mode (most frequent value)
        mode_value = df_model[col].mode().iloc[0]
        df_model[col] = df_model[col].fillna(mode_value)
        print(f"  ‚úì {col}: Imputed {missing_count} ({missing_pct:.1f}%) with mode='{mode_value}'")

print(f"\n‚úì Categorical imputation complete")

  ‚úì type_water_body: Imputed 0 (0.0%) with mode='River'
  ‚úì river_basin: Imputed 22 (12.9%) with mode='Godavari'
  ‚úì district: Imputed 0 (0.0%) with mode='Thane'
  ‚úì weather: Imputed 0 (0.0%) with mode='Clear'
  ‚úì approx_depth: Imputed 0 (0.0%) with mode='Less than 50cm'
  ‚úì human_activities: Imputed 0 (0.0%) with mode='Others'
  ‚úì floating_matter: Imputed 0 (0.0%) with mode='Yes'
  ‚úì color: Imputed 0 (0.0%) with mode='Clear'
  ‚úì odor: Imputed 133 (77.8%) with 'unknown' [HIGH_MISSING]

‚úì Categorical imputation complete


In [27]:
df_model.isna().mean().sort_values(ascending=False)


type_water_body                 0.00
river_basin                     0.00
sulphate                        0.00
sodium                          0.00
total_dissolved_solids          0.00
total_fixed_solids              0.00
total_suspended_solids          0.00
phosphate                       0.00
boron                           0.00
potassium                       0.00
flouride                        0.00
fecal_coliform_is_bdl           0.00
total_coliform_is_bdl           0.00
fecal_streptococci_is_bdl       0.00
total_kjeldahl_n_is_bdl         0.00
nitrate_n_is_bdl                0.00
turbidity_is_bdl                0.00
sulphate_is_bdl                 0.00
sodium_is_bdl                   0.00
chlorides_is_bdl                0.00
phosphate_is_bdl                0.00
boron_is_bdl                    0.00
potassium_is_bdl                0.00
flouride_is_bdl                 0.00
dissolved_o2_is_bdl             0.00
total_suspended_solids_is_bdl   0.00
phenophelene_alkanity_is_bdl    0.00
m

In [28]:
df_model.info()



<class 'pandas.DataFrame'>
Index: 171 entries, 0 to 221
Data columns (total 56 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   type_water_body                171 non-null    str    
 1   river_basin                    171 non-null    str    
 2   district                       171 non-null    str    
 3   use_based_class                171 non-null    str    
 4   weather                        171 non-null    str    
 5   approx_depth                   171 non-null    str    
 6   human_activities               171 non-null    str    
 7   floating_matter                171 non-null    str    
 8   color                          171 non-null    str    
 9   odor                           171 non-null    str    
 10  flow                           171 non-null    float64
 11  temperature                    171 non-null    float64
 12  dissolved_o2                   171 non-null    float64
 13  ph    

### 5.3 Final Cleanup: Drop Fully Empty Columns

**Why?**
- Columns with 100% NaN carry **zero information**
- Keeping them breaks regression checks
- Impacts downstream assumptions (correlation matrices, etc.)

In [29]:
# ============================================================================
# DROP FULLY EMPTY COLUMNS (FINAL CLEANUP)
# ============================================================================
# Columns with 100% missing values carry no information
# Keeping them breaks regression checks and downstream assumptions

empty_cols = [
    col for col in df_model.columns
    if df_model[col].isna().mean() == 1.0
]

print(f"Dropping fully empty columns ({len(empty_cols)}):")
for col in empty_cols:
    print(f"  - {col}")

df_model = df_model.drop(columns=empty_cols)


Dropping fully empty columns (0):


In [30]:
# ============================================================================
# FINAL CLEANED DATASET SNAPSHOT
# ============================================================================
# Purpose:
# - Establish a clear boundary between cleaning and downstream steps
# - Prevent accidental re-cleaning or mutation in later notebooks

df_cleaned = df_model.copy()

print(f"‚úì Cleaned dataset snapshot created: {df_cleaned.shape}")


‚úì Cleaned dataset snapshot created: (171, 56)


In [31]:
# ============================================================================
# REGRESSION CHECKS (SCHEMA & DATA INTEGRITY)
# ============================================================================
# These assertions ensure the dataset is safe to consume downstream

# Dataset existence
assert df_cleaned.shape[0] > 0, "Dataset is empty after cleaning"

# Target integrity
assert df_cleaned[TARGET_COL].isna().sum() == 0, "Target contains NaN values"

# Duplicate safety
assert df_cleaned.duplicated().sum() == 0, "Duplicate rows detected"

# Numeric sanity (ensures no column is fully missing)
numeric_nan_ratio = (
    df_cleaned
    .select_dtypes(include="number")
    .isna()
    .mean()
    .max()
)

assert numeric_nan_ratio < 1.0, "At least one numeric column is fully NaN"

print("‚úì Regression checks passed")


‚úì Regression checks passed


In [32]:
csv_folder_path = os.path.join(DATA_DIR, "processed", "csv" )
parquet_folder_path = os.path.join(DATA_DIR, "processed", "parquet" )
Path(csv_folder_path).mkdir(parents=True, exist_ok=True) # create folder if not exists
Path(parquet_folder_path).mkdir(parents=True, exist_ok=True) # create folder if not exists

In [33]:
# ============================================================================
# EXPORT CLEANED DATASET (PIPELINE CONTRACT)
# ============================================================================
# This file is the ONLY input for subsequent notebooks

OUTPUT_PATH = os.path.join(parquet_folder_path, "cleaned_water_quality_data.parquet")

df_cleaned.to_parquet(
    OUTPUT_PATH,
    index=False
)

print(f"‚úì Cleaned dataset exported to {OUTPUT_PATH}")


‚úì Cleaned dataset exported to /Users/rex/Documents/personal/AquaSafe/data/processed/parquet/cleaned_water_quality_data.parquet


In [34]:
# ============================================================================
# EXPORT CLEANED DATASET (PIPELINE CONTRACT)
# ============================================================================
# This file is the ONLY input for subsequent notebooks

OUTPUT_PATH = os.path.join(csv_folder_path, "cleaned_water_quality_data.csv")

df_cleaned.to_csv(
    OUTPUT_PATH,
    index=False
)

print(f"‚úì Cleaned dataset exported to {OUTPUT_PATH}")


‚úì Cleaned dataset exported to /Users/rex/Documents/personal/AquaSafe/data/processed/csv/cleaned_water_quality_data.csv
