# Water Quality Prediction: XGBoost Regression Notebook

This notebook is based on `BENCHMARK_Jin.ipynb` but replaces **Random Forest** with **XGBoost Regression** to improve prediction accuracy.

## Challenge Overview

Welcome to the EY AI & Data Challenge 2026!  
The objective of this challenge is to build a robust **machine learning model** capable of predicting water quality across various river locations in South Africa. In addition to accurate predictions, the model should also identify and emphasize the key factors that significantly influence water quality.

Participants will be provided with a dataset containing three water quality parameters — **Total Alkalinity**, **Electrical Conductance**, and **Dissolved Reactive Phosphorus** — collected between 2011 and 2015 from approximately 200 river locations across South Africa. Each data point includes the geographic coordinates (latitude and longitude) of the sampling site, the date of collection, and the corresponding water quality measurements.

Using this dataset, participants are expected to build a machine learning model to predict water quality parameters for a separate validation dataset, which includes locations from different regions not present in the training data. The challenge also encourages participants to explore feature importance and provide insights into the factors most strongly associated with variations in water quality.

This challenge is designed for participants with varying levels of experience in data science, remote sensing, and environmental analytics. It offers a valuable opportunity to apply machine learning techniques to real-world environmental data and contribute to advancing water quality monitoring using artificial intelligence.

## Load In Dependencies

In [None]:
!pip install uv
!uv pip install  -r requirements.txt 

In [None]:
import snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Data manipulation and analysis
import numpy as np
import pandas as pd
from IPython.display import display

# Multi-dimensional arrays and datasets (e.g., NetCDF, Zarr)
import xarray as xr

# Geospatial raster data handling with CRS support
import rioxarray as rxr

# Raster operations and spatial windowing
import rasterio
from rasterio.windows import Window

# Feature preprocessing and data splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from scipy.spatial import cKDTree

# Machine Learning - XGBoost
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

from datetime import date
from tqdm import tqdm
import os 

## Response Variable

In [None]:
Water_Quality_df = pd.read_csv("water_quality_training_dataset.csv")
display(Water_Quality_df.head(5))

## Predictor Variables

### Loading Pre-Extracted Landsat Data

In [None]:
landsat_train_features = pd.read_csv("landsat_features_training_200m.csv")
display(landsat_train_features.head(5))

### Loading Pre-Extracted TerraClimate Data

In [None]:
Terraclimate_df = pd.read_csv("terraclimate_features_training_full.csv")
display(Terraclimate_df.head(50))

In [None]:
urbanization_df = pd.read_csv("urbanization_train.csv")

cat_cols = ["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus"]

urbanization_df = urbanization_df.drop(columns=cat_cols)

display(urbanization_df.head(5))

In [None]:
urbanization_df.dtypes

## Joining the Predictor Variables and Response Variables

In [None]:
# Combine two datasets vertically (along columns) using pandas concat function.
def combine_three_datasets(dataset1,dataset2,dataset3):
    '''
    Returns a  vertically concatenated dataset.
    Attributes:
    dataset1 - Dataset 1 to be combined 
    dataset2 - Dataset 2 to be combined
    '''
    
    data = pd.concat([dataset1,dataset2,dataset3], axis=1)
    data = data.loc[:, ~data.columns.duplicated()]
    return data

In [None]:
# Combining ground data and final data into a single dataset.
wq_data = combine_three_datasets(Water_Quality_df, landsat_train_features, Terraclimate_df)
display(wq_data.head(5))

### Handling Missing Values

In [None]:
wq_data = wq_data.fillna(wq_data.median(numeric_only=True))
wq_data.isna().sum()

## Model Building (XGBoost Regression)

### XGBoost vs Random Forest

XGBoost (eXtreme Gradient Boosting) is a gradient boosting algorithm that builds trees **sequentially**, where each new tree corrects the errors of the previous ensemble. Key advantages over Random Forest:

- **Boosting** learns from residuals, often achieving better accuracy
- **Regularization** (L1/L2) reduces overfitting
- **Early stopping** automatically finds the optimal number of trees
- **Built-in handling** of missing values

We keep the same features from Landsat and TerraClimate as the original benchmark.

In [None]:
def split_data(X, y, test_size=0.3, random_state=42):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def scale_data(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled, scaler

def train_model(X_train_scaled, y_train, X_val_scaled=None, y_val=None):
    """
    Train an XGBoost Regressor with early stopping.
    """
    model = XGBRegressor(
        n_estimators=1000,          # 最大樹數量：設高一點，靠 early stopping 自動決定最佳
        max_depth=6,                # 樹的深度：比 RF 淺，boosting 不需要太深
        learning_rate=0.05,         # 學習率：偏低以確保穩定收斂
        subsample=0.8,              # 每棵樹隨機抽取 80% 樣本，減少過擬合
        colsample_bytree=0.8,       # 每棵樹隨機抽取 80% 特徵
        reg_alpha=0.1,              # L1 正則化：促進稀疏性
        reg_lambda=1.0,             # L2 正則化：防止過擬合
        min_child_weight=5,         # 葉節點最小樣本權重：避免過擬合極端值
        gamma=0.1,                  # 分裂最小損失減少量：防止不必要的分裂
        random_state=42,
        n_jobs=-1,
        verbosity=0,
        early_stopping_rounds=50    # 當驗證集連續 50 輪沒進步就停止
    )
    
    # 使用 early stopping：當驗證集連續 50 輪沒進步就停止
    if X_val_scaled is not None and y_val is not None:
        model.fit(
            X_train_scaled, y_train,
            eval_set=[(X_val_scaled, y_val)],
            verbose=False
        )
        try:
            print(f"  Best iteration: {model.best_iteration}")
        except AttributeError:
            print(f"  Trained {model.n_estimators} trees (no early stop triggered)")
    else:
        model.fit(X_train_scaled, y_train)
    
    return model


def evaluate_model(model, X_scaled, y_true, dataset_name="Test"):
    y_pred = model.predict(X_scaled)
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f"\n{dataset_name} Evaluation:")
    print(f"R²: {r2:.3f}")
    print(f"RMSE: {rmse:.3f}")
    return y_pred, r2, rmse

## Model Workflow (Pipeline) — XGBoost with Early Stopping

In [None]:
def run_pipeline(X, y, param_name="Parameter"):
    print(f"\n{'='*60}")
    print(f"Training XGBoost Model for {param_name}")
    print(f"{'='*60}")
    
    # Split data
    X_train, X_test, y_train, y_test = split_data(X, y)
    
    # Scale
    X_train_scaled, X_test_scaled, scaler = scale_data(X_train, X_test)
    
    # Train with early stopping using test set as eval
    model = train_model(X_train_scaled, y_train, X_test_scaled, y_test)
    
    # Evaluate (in-sample)
    y_train_pred, r2_train, rmse_train = evaluate_model(model, X_train_scaled, y_train, "Train")
    
    # Evaluate (out-sample)
    y_test_pred, r2_test, rmse_test = evaluate_model(model, X_test_scaled, y_test, "Test")
    
    # Return summary
    results = {
        "Parameter": param_name,
        "R2_Train": r2_train,
        "RMSE_Train": rmse_train,
        "R2_Test": r2_test,
        "RMSE_Test": rmse_test
    }
    return model, scaler, pd.DataFrame([results])

### Model Training and Evaluation for Each Parameter

In [None]:
X = wq_data.iloc[:, 6:]

y_TA = wq_data['Total Alkalinity']
y_EC = wq_data['Electrical Conductance']
y_DRP = wq_data['Dissolved Reactive Phosphorus']

model_TA, scaler_TA, results_TA = run_pipeline(X, y_TA, "Total Alkalinity")
model_EC, scaler_EC, results_EC = run_pipeline(X, y_EC, "Electrical Conductance")
model_DRP, scaler_DRP, results_DRP = run_pipeline(X, y_DRP, "Dissolved Reactive Phosphorus")

### Model Performance Summary

In [None]:
results_summary = pd.concat([results_TA, results_EC, results_DRP], ignore_index=True)
results_summary

### Feature Importance (XGBoost)

XGBoost provides built-in feature importance based on how frequently each feature is used in splits (weight), or how much it reduces the loss (gain). This helps identify which satellite bands and climate features matter most for water quality prediction.

In [None]:
feature_names = X.columns.tolist()

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for ax, model, name in zip(axes, [model_TA, model_EC, model_DRP],
                            ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']):
    importances = model.feature_importances_
    sorted_idx = np.argsort(importances)
    ax.barh(range(len(sorted_idx)), importances[sorted_idx], align='center')
    ax.set_yticks(range(len(sorted_idx)))
    ax.set_yticklabels([feature_names[i] for i in sorted_idx])
    ax.set_title(f'{name}')
    ax.set_xlabel('Feature Importance (Gain)')

plt.suptitle('XGBoost Feature Importance', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Submission

In [None]:
test_file = pd.read_csv("submission_template.csv")
display(test_file)

In [None]:
landsat_val_features = pd.read_csv("landsat_features_validation_200m.csv")
display(landsat_val_features)

In [None]:
Terraclimate_val_df = pd.read_csv("terraclimate_features_validation_full.csv")
display(Terraclimate_val_df)

In [None]:
urbanization_val_df = pd.read_csv("urbanization_val.csv")

cat_cols = ["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus"]

urbanization_val_df = urbanization_val_df.drop(columns=cat_cols)
display(urbanization_val_df)

In [None]:
cols_to_drop = [
    'Total Alkalinity',
    'Electrical Conductance',
    'Dissolved Reactive Phosphorus'
]

test_file = test_file.drop(columns=cols_to_drop, errors='ignore')

val_data = combine_three_datasets(test_file, landsat_val_features, Terraclimate_val_df)
display(val_data.head(5))

In [None]:
# Impute the missing values
val_data = val_data.fillna(val_data.median(numeric_only=True))
val_data

In [None]:
# Extracting specific columns (swir22, NDMI, MNDWI, pet) from the validation dataset
submission_val_data=val_data.iloc[:, 3:]
display(submission_val_data.head())

In [None]:
submission_val_data.shape

In [None]:
# --- Predicting for Total Alkalinity ---
X_sub_scaled_TA = scaler_TA.transform(submission_val_data)
pred_TA_submission = model_TA.predict(X_sub_scaled_TA)

# --- Predicting for Electrical Conductance ---
X_sub_scaled_EC = scaler_EC.transform(submission_val_data)
pred_EC_submission = model_EC.predict(X_sub_scaled_EC)

# --- Predicting for Dissolved Reactive Phosphorus ---
X_sub_scaled_DRP = scaler_DRP.transform(submission_val_data)
pred_DRP_submission = model_DRP.predict(X_sub_scaled_DRP)

In [None]:
submission_df = pd.DataFrame({
    'Latitude': test_file['Latitude'].values,
    'Longitude': test_file['Longitude'].values,
    'Sample Date': test_file['Sample Date'].values,
    'Total Alkalinity': pred_TA_submission,
    'Electrical Conductance': pred_EC_submission,
    'Dissolved Reactive Phosphorus': pred_DRP_submission
})

In [None]:
#Displaying the sample submission dataframe
display(submission_df.head())

In [None]:
submission_df.to_csv("/tmp/submission_v10.csv",index = False)

session.sql("""
    PUT file:///tmp/submission_v10.csv
    'snow://workspace/USER$.PUBLIC."EY-AI-and-Data-Challenge"/versions/live/'
    AUTO_COMPRESS=FALSE
    OVERWRITE=TRUE
""").collect()

print("File saved! Refresh the browser to see the files in the sidebar")



### Upload submission file on platform

Upload the `submission.csv` file on the challenge platform to generate your score on the leaderboard.

## Conclusion

This notebook replaces Random Forest with **XGBoost Regression** to improve prediction accuracy. Key changes include:
- **Gradient boosting** instead of bagging (sequential learning from residuals)
- **Early stopping** to automatically find the optimal number of trees
- **L1/L2 regularization** to control overfitting
- **Feature importance visualization** to understand which features drive predictions