# Phase 3: Enhanced Feature Modeling

This notebook reproduces and extends the Benchmark Model experiment using the **enhanced spectral indices** generated in Phase 2.

## Objectives
1. Load the enhanced Landsat feature set from `landsat_features_training_enhanced.csv`
2. Reproduce the benchmark Random Forest pipeline for a fair apples-to-apples comparison
3. Train an enhanced model using all 18 new spectral indices + PET + temporal features
4. Compare performance metrics against the benchmark
5. Plot feature importances for each target
6. Generate submission predictions using the best model

## Benchmark Results (to beat)
| Parameter | R² Train | RMSE Train | R² Test | RMSE Test |
|---|---|---|---|---|
| Total Alkalinity | 0.903 | 23.12 | 0.546 | 50.88 |
| Electrical Conductance | 0.918 | 98.03 | 0.585 | 220.21 |
| Dissolved Reactive Phosphorus | 0.882 | 17.45 | 0.529 | 35.18 |


## Step 1: Load Dependencies


In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

RANDOM_STATE = 42
TEST_SIZE    = 0.3

print('Libraries loaded.')

## Step 2: Load Data


In [None]:
# Water quality labels
wq_df = pd.read_csv('water_quality_training_dataset.csv')
print(f'Water quality shape:      {wq_df.shape}')

# Enhanced Landsat features (produced by 02_Feature_Engineering_Spectral_Indices.ipynb)
landsat_train = pd.read_csv('landsat_features_training_enhanced.csv')
print(f'Enhanced Landsat train:   {landsat_train.shape}')

landsat_val   = pd.read_csv('landsat_features_validation_enhanced.csv')
print(f'Enhanced Landsat val:     {landsat_val.shape}')

# TerraClimate PET
tc_train = pd.read_csv('terraclimate_features_training.csv')
tc_val   = pd.read_csv('terraclimate_features_validation.csv')
print(f'TerraClimate train:       {tc_train.shape}')
print(f'TerraClimate val:         {tc_val.shape}')

# Submission template
submission_template = pd.read_csv('submission_template.csv')
print(f'Submission template:      {submission_template.shape}')

## Step 3: Merge Datasets & Add Temporal Features


In [None]:
def build_training_frame(wq, landsat, terraclimate):
    """
    Concatenate water-quality labels, enhanced Landsat bands/indices,
    and TerraClimate PET into a single training DataFrame.
    Duplicate columns (Lat/Lon/Date) are removed after concat.
    """
    df = pd.concat([wq, landsat, terraclimate], axis=1)
    df = df.loc[:, ~df.columns.duplicated()]
    return df


def add_temporal_features(df):
    """
    Extract month and year from 'Sample Date' as numeric features.
    Month captures seasonality; year captures any long-term drift.
    """
    df = df.copy()
    dates = pd.to_datetime(df['Sample Date'], dayfirst=True, errors='coerce')
    df['Month'] = dates.dt.month.astype(float)
    df['Year']  = dates.dt.year.astype(float)
    return df


train_df = build_training_frame(wq_df, landsat_train, tc_train)
train_df = add_temporal_features(train_df)

print(f'Merged training shape: {train_df.shape}')
print(f'Columns: {list(train_df.columns)}')
train_df.head()

## Step 4: Handle Missing Values (Median Imputation)


In [None]:
print('Missing values before imputation:')
missing = train_df.isnull().sum()
print(missing[missing > 0].to_string())

train_df = train_df.fillna(train_df.median(numeric_only=True))

print(f'\nMissing after imputation: {train_df.isnull().sum().sum()}')

## Step 5: Define Feature Sets

Two feature sets are defined for a controlled comparison:

- **`BASELINE_FEATURES`** — identical to the Benchmark notebook (`swir22`, `NDMI`, `MNDWI`, `pet`)
- **`ENHANCED_FEATURES`** — all 18 spectral indices from Phase 2 + raw bands + PET + temporal features


In [None]:
TARGETS = ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']

BASELINE_FEATURES = ['swir22', 'NDMI', 'MNDWI', 'pet']

ENHANCED_FEATURES = [
    # Raw bands
    'nir', 'green', 'swir16', 'swir22',
    # Baseline indices
    'NDMI', 'MNDWI',
    # New vegetation / water indices
    'NDVI', 'NDWI', 'NDSI_water', 'NDTI',
    'Turbidity_Index', 'Chlorophyll_Proxy', 'BSI',
    # Band ratios
    'SWIR22_NIR_ratio', 'SWIR16_NIR_ratio', 'Green_NIR_ratio',
    'SWIR22_Green_ratio', 'SWIR16_Green_ratio',
    # Log-transformed bands
    'log_nir', 'log_green', 'log_swir16', 'log_swir22',
    # Quadratic terms
    'nir_squared', 'swir22_squared',
    # Climate
    'pet',
    # Temporal
    'Month', 'Year',
]

# Keep only columns that actually exist after merging
ENHANCED_FEATURES = [c for c in ENHANCED_FEATURES if c in train_df.columns]

print(f'Baseline feature count:  {len(BASELINE_FEATURES)}')
print(f'Enhanced feature count:  {len(ENHANCED_FEATURES)}')
print(f'\nEnhanced features:\n{ENHANCED_FEATURES}')

## Step 6: Helper Functions (Pipeline)


In [None]:
def split_data(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)


def scale_data(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled  = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled, scaler


def train_model(X_train_scaled, y_train, n_estimators=200):
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features='sqrt',
        min_samples_leaf=2,
        random_state=RANDOM_STATE,
        n_jobs=-1,
    )
    model.fit(X_train_scaled, y_train)
    return model


def evaluate_model(model, X_scaled, y_true, dataset_name='Test'):
    y_pred = model.predict(X_scaled)
    r2   = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f'  {dataset_name:6s} — R²: {r2:.4f}  |  RMSE: {rmse:.4f}')
    return y_pred, r2, rmse


def run_pipeline(df, feature_cols, target_col, label=''):
    """
    Full pipeline: split → scale → train → evaluate.
    Returns (model, scaler, results_dict).
    """
    print(f'\n  Target: {target_col}')
    X = df[feature_cols].values
    y = df[target_col].values

    X_train, X_test, y_train, y_test = split_data(X, y)
    X_train_sc, X_test_sc, scaler    = scale_data(X_train, X_test)
    model = train_model(X_train_sc, y_train)

    _, r2_tr, rmse_tr = evaluate_model(model, X_train_sc, y_train, 'Train')
    _, r2_te, rmse_te = evaluate_model(model, X_test_sc,  y_test,  'Test')

    results = {
        'Model':       label,
        'Parameter':   target_col,
        'R2_Train':    round(r2_tr,   4),
        'RMSE_Train':  round(rmse_tr, 4),
        'R2_Test':     round(r2_te,   4),
        'RMSE_Test':   round(rmse_te, 4),
    }
    return model, scaler, results

## Step 7: Baseline Experiment (Reproduce Benchmark)

Using the same 4 features as the official benchmark notebook: `swir22`, `NDMI`, `MNDWI`, `pet`.


In [None]:
print('=' * 60)
print('BASELINE MODEL  (swir22, NDMI, MNDWI, pet)')
print('=' * 60)

baseline_results = []
baseline_models  = {}
baseline_scalers = {}

for target in TARGETS:
    model, scaler, res = run_pipeline(
        train_df, BASELINE_FEATURES, target, label='Baseline'
    )
    baseline_models[target]  = model
    baseline_scalers[target] = scaler
    baseline_results.append(res)

baseline_df = pd.DataFrame(baseline_results)
print('\nBaseline Summary:')
baseline_df

## Step 8: Enhanced Experiment (All Spectral Indices + Temporal)


In [None]:
print('=' * 60)
print('ENHANCED MODEL  (all spectral indices + PET + temporal)')
print('=' * 60)

enhanced_results = []
enhanced_models  = {}
enhanced_scalers = {}

for target in TARGETS:
    model, scaler, res = run_pipeline(
        train_df, ENHANCED_FEATURES, target, label='Enhanced'
    )
    enhanced_models[target]  = model
    enhanced_scalers[target] = scaler
    enhanced_results.append(res)

enhanced_df = pd.DataFrame(enhanced_results)
print('\nEnhanced Summary:')
enhanced_df

## Step 9: Performance Comparison


In [None]:
comparison_df = pd.concat([baseline_df, enhanced_df], ignore_index=True)

print('Full Comparison Table:')
print(comparison_df.to_string(index=False))

# Delta
delta_rows = []
for target in TARGETS:
    b = baseline_df[baseline_df['Parameter'] == target].iloc[0]
    e = enhanced_df[enhanced_df['Parameter'] == target].iloc[0]
    delta_rows.append({
        'Parameter':      target,
        'ΔR²_Test':       round(e['R2_Test']   - b['R2_Test'],   4),
        'ΔRMSE_Test':     round(e['RMSE_Test']  - b['RMSE_Test'], 4),
    })

delta_df = pd.DataFrame(delta_rows)
print('\nDelta (Enhanced − Baseline):')
print(delta_df.to_string(index=False))

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

metrics = ['R2_Test', 'RMSE_Test']
titles  = ['Test R²  (higher is better)', 'Test RMSE  (lower is better)']

x = np.arange(len(TARGETS))
width = 0.35

for ax, metric, title in zip(axes, metrics, titles):
    b_vals = [baseline_df[baseline_df['Parameter'] == t][metric].values[0] for t in TARGETS]
    e_vals = [enhanced_df[enhanced_df['Parameter'] == t][metric].values[0] for t in TARGETS]

    bars_b = ax.bar(x - width/2, b_vals, width, label='Baseline', color='steelblue', alpha=0.8)
    bars_e = ax.bar(x + width/2, e_vals, width, label='Enhanced', color='coral',     alpha=0.8)

    ax.set_xticks(x)
    ax.set_xticklabels(['Total Alk.', 'Elec. Cond.', 'DRP'], fontsize=10)
    ax.set_title(title, fontsize=12)
    ax.legend()

    for bar in bars_b:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=8)
    for bar in bars_e:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=8)

plt.suptitle('Baseline vs Enhanced Model — Out-of-Sample Performance', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## Step 10: Feature Importance (Enhanced Model)


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 7))

for ax, target in zip(axes, TARGETS):
    model = enhanced_models[target]
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]

    top_n = 15
    top_idx   = indices[:top_n]
    top_names = [ENHANCED_FEATURES[i] for i in top_idx]
    top_vals  = importances[top_idx]

    colors = plt.cm.viridis(np.linspace(0.2, 0.85, top_n))
    ax.barh(range(top_n), top_vals[::-1], color=colors)
    ax.set_yticks(range(top_n))
    ax.set_yticklabels(top_names[::-1], fontsize=9)
    ax.set_xlabel('Importance', fontsize=10)
    ax.set_title(f'{target}\n(Top {top_n} features)', fontsize=11)

plt.suptitle('Feature Importances — Enhanced Random Forest', fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig('feature_importance_enhanced.png', dpi=150, bbox_inches='tight')
plt.show()

## Step 11: Prepare Validation Data & Generate Submission


In [None]:
# Build validation frame from enhanced Landsat + TerraClimate val
val_df = pd.concat([landsat_val, tc_val], axis=1)
val_df = val_df.loc[:, ~val_df.columns.duplicated()]
val_df = add_temporal_features(val_df)

# Impute with training medians so no data leakage from validation set
train_medians = train_df[ENHANCED_FEATURES].median()
for col in ENHANCED_FEATURES:
    if col in val_df.columns:
        val_df[col] = val_df[col].fillna(train_medians[col])
    else:
        val_df[col] = train_medians[col]

print(f'Validation shape: {val_df.shape}')
print(f'Missing after imputation: {val_df[ENHANCED_FEATURES].isnull().sum().sum()}')

In [None]:
X_val = val_df[ENHANCED_FEATURES].values

pred_TA  = enhanced_models['Total Alkalinity'].predict(
    enhanced_scalers['Total Alkalinity'].transform(X_val)
)
pred_EC  = enhanced_models['Electrical Conductance'].predict(
    enhanced_scalers['Electrical Conductance'].transform(X_val)
)
pred_DRP = enhanced_models['Dissolved Reactive Phosphorus'].predict(
    enhanced_scalers['Dissolved Reactive Phosphorus'].transform(X_val)
)

submission_df = pd.DataFrame({
    'Latitude':                     submission_template['Latitude'].values,
    'Longitude':                    submission_template['Longitude'].values,
    'Sample Date':                  submission_template['Sample Date'].values,
    'Total Alkalinity':             pred_TA,
    'Electrical Conductance':       pred_EC,
    'Dissolved Reactive Phosphorus': pred_DRP,
})

submission_df.to_csv('submission_enhanced.csv', index=False)
print(f'Submission saved: submission_enhanced.csv  ({submission_df.shape})')
submission_df.head(10)

## Step 12: Predicted Value Distributions (Sanity Check)


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

preds  = [pred_TA, pred_EC, pred_DRP]
labels = ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']
colors = ['steelblue', 'coral', 'mediumseagreen']

for ax, pred, label, color in zip(axes, preds, labels, colors):
    ax.hist(pred, bins=30, color=color, edgecolor='white', alpha=0.85)
    ax.axvline(train_df[label].median(), color='black', linestyle='--',
               linewidth=1.5, label=f'Train median')
    ax.set_title(label, fontsize=11)
    ax.set_xlabel('Predicted Value')
    ax.set_ylabel('Count')
    ax.legend(fontsize=8)

plt.suptitle('Distribution of Validation Predictions vs Training Median', fontsize=13)
plt.tight_layout()
plt.savefig('submission_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

## Summary

| Model | Feature Count | TA R² Test | EC R² Test | DRP R² Test |
|---|---|---|---|---|
| **Benchmark (original)** | 4 | 0.546 | 0.585 | 0.529 |
| **Baseline (reproduced)** | 4 | see above | see above | see above |
| **Enhanced** | 27 | see above | see above | see above |

### Key improvements over the benchmark
- Added 18 physics-motivated spectral indices (NDVI, NDWI, NDTI, Turbidity Index, Chlorophyll Proxy, BSI, band ratios, log-transformed bands, quadratic terms)
- Added raw Landsat bands (`nir`, `green`, `swir16`) which the benchmark excluded
- Added `Month` and `Year` as temporal features to capture seasonality
- Used training medians (not validation medians) for imputation to prevent leakage
- Increased `n_estimators` to 200 and tuned `min_samples_leaf` for better generalisation
