# Water Quality Prediction: Benchmark Notebook 

## Challenge Overview

Welcome to the EY AI & Data Challenge 2026!  
The objective of this challenge is to build a robust **machine learning model** capable of predicting water quality across various river locations in South Africa. In addition to accurate predictions, the model should also identify and emphasize the key factors that significantly influence water quality.

Participants will be provided with a dataset containing three water quality parameters — **Total Alkalinity**, **Electrical Conductance**, and **Dissolved Reactive Phosphorus** — collected between 2011 and 2015 from approximately 200 river locations across South Africa. Each data point includes the geographic coordinates (latitude and longitude) of the sampling site, the date of collection, and the corresponding water quality measurements.

Using this dataset, participants are expected to build a machine learning model to predict water quality parameters for a separate validation dataset, which includes locations from different regions not present in the training data. The challenge also encourages participants to explore feature importance and provide insights into the factors most strongly associated with variations in water quality.

This challenge is designed for participants with varying levels of experience in data science, remote sensing, and environmental analytics. It offers a valuable opportunity to apply machine learning techniques to real-world environmental data and contribute to advancing water quality monitoring using artificial intelligence.

**About the Notebook:**  

In this notebook, we demonstrate a basic workflow that serves as a foundation for the challenge. The model has been developed to predict **water quality parameters** using features derived from the **Landsat** and **TerraClimate** datasets. Specifically, four spectral bands — **SWIR22** (Shortwave Infrared 2), **NIR** (Near Infrared), **Green**, and **SWIR16** (Shortwave Infrared 1) — were utilized from Landsat, along with derived spectral indices such as **NDMI** (Normalized Difference Moisture Index) and **MNDWI** (Modified Normalized Difference Water Index). In addition, the **PET** (Potential Evapotranspiration) variable was incorporated from the **TerraClimate** dataset to account for climatic influences on water quality.

The dataset spans a five-year period from **2011 to 2015**. Using **API-based data extraction** methods, both Landsat and TerraClimate features were retrieved directly from the [Microsoft Planetary Computer portal](https://planetarycomputer.microsoft.com).

These combined spectral, index-based, and climatic features were used as predictors in a regression model to estimate three key water quality parameters: **Total Alkalinity (TA)**, **Electrical Conductance (EC)**, and **Dissolved Reactive Phosphorus (DRP)**.

Please note that this notebook serves only as a starting point. Several assumptions were made during the data extraction and model development process, which you may find opportunities to improve upon. Participants are encouraged to explore additional features, enhance preprocessing techniques, or experiment with different regression algorithms to optimize predictive performance.

## Load In Dependencies
The following code installs the required Python libraries (found in the requirements.txt file) in the Snowflake environment to allow successful execution of the remaining notebook code. After running this code for the first time, it is required to “restart” the kernal so the Python libraries are available in the environment. This is done by selecting the “Connected” menu above the notebook (next to “Run all”) and selecting the “restart kernal” link. Subsequent runs of the notebook do not require this “restart” process.

In [None]:
!pip install uv
!uv pip install -r requirements.txt xgboost

In [None]:
import snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Data manipulation and analysis
import numpy as np
import pandas as pd
from IPython.display import display

# Multi-dimensional arrays and datasets (e.g., NetCDF, Zarr)
import xarray as xr

# Geospatial raster data handling with CRS support
import rioxarray as rxr

# Raster operations and spatial windowing
import rasterio
from rasterio.windows import Window

# Feature preprocessing and data splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GroupKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from scipy.spatial import cKDTree
from scipy.stats import uniform, randint

# Machine Learning
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error, make_scorer

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

from datetime import date
from tqdm import tqdm
import os

## Response Variable

Before building the model, we first load the **water quality training dataset**. The curated dataset contains samples collected from various monitoring stations across the study region. Each record includes the geographical coordinates (Latitude and Longitude), the sample collection date, and the corresponding **measured values** for the three key water quality parameters — **Total Alkalinity (TA)**, **Electrical Conductance (EC)**, and **Dissolved Reactive Phosphorus (DRP)**.

In [None]:
Water_Quality_df = pd.read_csv("water_quality_training_dataset.csv")
display(Water_Quality_df.head(5))

## Predictor Variables

Now that we have our water quality dataset, the next step is to gather the predictor variables from the **Landsat** and **TerraClimate** datasets. In this notebook, we demonstrate how to **load previously extracted satellite and climate data** from separate files, rather than performing the extraction directly, which allows for a smoother and faster experience. Participants can refer to the dedicated extraction notebooks—one for Landsat and another for TerraClimate—to understand how the data was retrieved and processed, and they can also generate their own output CSV files if needed. Using these pre-extracted CSV files, this notebook focuses on loading the predictor features and running the subsequent analysis and model training efficiently.

For more detailed guidance on the original data extraction process, you can review the Landsat and TerraClimate example notebooks available on the Planetary Computer portal:

- [Landsat-c2-l2 - Example-Notebook](https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2#Example-Notebook)  
- [Terraclimate - Example-Notebook](https://planetarycomputer.microsoft.com/dataset/terraclimate#Example-Notebook)

We have used selected spectral bands — **SWIR22** (Shortwave Infrared 2), **NIR** (Near Infrared), **Green**, and **SWIR16** (Shortwave Infrared 1) — and computed key spectral indices such as **NDMI** (Normalized Difference Moisture Index) and **MNDWI** (Modified Normalized Difference Water Index). These features capture surface moisture, vegetation, and water content characteristics that influence water quality variability.

In addition to Landsat features, we also incorporated the **Potential Evapotranspiration (PET)** variable from the **TerraClimate** dataset, which provides high-resolution global climate data. The PET feature captures the atmospheric demand for moisture, representing climatic conditions such as temperature, humidity, and radiation that influence surface water evaporation and thus affect water quality parameters.

The predictor features include:

- **SWIR22** – Sensitive to surface moisture and turbidity variations in water bodies.  
- **NIR** – Helps in identifying vegetation and suspended matter in water.  
- **Green** – Useful for detecting water color and surface reflectance changes.  
- **SWIR16** – Provides information on surface dryness and sediment concentration.  
- **NDMI** – Derived from NIR and SWIR16, indicates moisture and vegetation–water interaction.  
- **MNDWI** – Derived from Green and SWIR22, effective for distinguishing open water areas and reducing built-up noise.  
- **PET** – Extracted from the TerraClimate dataset, represents potential evapotranspiration influencing hydrological and water quality dynamics.

### **Tip 1**

Participants are encouraged to experiment with different combinations of **Landsat** bands or even include data from other public satellite data sources. By creating mathematical combinations of bands, you can derive various spectral indices that capture surface and environmental characteristics.

### Loading Pre-Extracted Landsat Data

In this notebook, we **load previously extracted Landsat data** from CSV files generated in a separate extraction notebook. This approach ensures a smoother and faster workflow, allowing participants to focus on data analysis and model development without waiting for time-consuming data retrieval.

Participants are expected to generate their own data extraction CSV files by running the dedicated Landsat extraction notebook. These CSV files can then be used here to smoothly run this benchmark notebook. Participants can refer to the extraction notebook to understand the API-based process, including how individual bands and indices like **NDMI** were computed. Using these pre-extracted CSV files simplifies preprocessing and is ideal for large-scale environmental and water quality analysis.

### **Tip 2**

In the data extraction process (performed in the dedicated extraction notebooks), a 100 m focal buffer was applied around each sampling location rather than using a single point. Participants may explore creating different focal buffers around the locations (e.g., 50 m, 150 m, etc.) during extraction. For example, if a 50 m buffer was used for “Band 2”, the extracted CSV values would reflect the average of Band 2 within 50 meters of each location. This approach can help reduce errors associated with spatial autocorrelation.

In [None]:
landsat_train_features = pd.read_csv("landsat_features_training.csv")
display(landsat_train_features.head(5))

In [None]:
# If NDMI and MNDWI columns are of type object, convert them to float
landsat_train_features['NDMI'] = landsat_train_features['NDMI'].astype(float)
landsat_train_features['MNDWI'] = landsat_train_features['MNDWI'].astype(float)

### Loading Pre-Extracted TerraClimate Data

In this notebook, we **load previously extracted TerraClimate data** from CSV files generated in a dedicated extraction notebook. This approach ensures a smoother and faster workflow, allowing participants to focus on data analysis and model development without waiting for time-consuming data retrieval.

Participants are expected to generate their own data extraction CSV files by running the dedicated TerraClimate extraction notebook. These CSV files can then be used here to smoothly run this benchmark notebook. Participants can refer to the extraction notebook to understand the API-based process, including how climate variables such as **Potential Evapotranspiration (PET)** were extracted. Using these pre-extracted CSV files ensures consistent, automated retrieval of high-resolution climate data that can be easily integrated with satellite-derived features for comprehensive environmental and hydrological analysis.

In [None]:
Terraclimate_df = pd.read_csv("terraclimate_features_training.csv")
display(Terraclimate_df.head(5))

## Joining the Predictor Variables and Response Variables

Now that we have extracted our predictor variables, we need to join them with the response variables. We use the **combine_two_datasets** function to merge the predictor variables and response variables. The **concat** function from pandas is particularly useful for this step.

In [None]:
# Combine two datasets vertically (along columns) using pandas concat function.
def combine_two_datasets(dataset1,dataset2,dataset3):
    '''
    Returns a  vertically concatenated dataset.
    Attributes:
    dataset1 - Dataset 1 to be combined 
    dataset2 - Dataset 2 to be combined
    '''
    
    data = pd.concat([dataset1,dataset2,dataset3], axis=1)
    data = data.loc[:, ~data.columns.duplicated()]
    return data

In [None]:
# Combining ground data and final data into a single dataset.
wq_data = combine_two_datasets(Water_Quality_df, landsat_train_features, Terraclimate_df)
display(wq_data.head(5))

### Handling Missing Values

Before model training, missing values in the dataset were carefully handled to ensure data consistency and prevent model bias. Numerical columns were imputed using their median values, maintaining the overall data distribution while minimizing the impact of outliers.

In [None]:
wq_data = wq_data.fillna(wq_data.median(numeric_only=True))
wq_data.isna().sum()

## Model Building

We select the feature columns for the model: **SWIR22**, **NDMI**, **MNDWI**, and **PET**. We also retain **Latitude** and **Longitude** in the dataframe to create the `GroupKFold` groups — however, these coordinates are **not** used as predictor features.

In [None]:
# Retaining predictor columns + Latitude/Longitude (needed for GroupKFold groups)
wq_data = wq_data[['Latitude', 'Longitude', 'swir22', 'NDMI', 'MNDWI', 'pet',
                    'Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']]

### **Tip 3**

We are developing individual models for each water quality parameter using a common set of features: **SWIR22**, **NDMI**, **MNDWI**, and **PET**. However, participants are encouraged to experiment with different feature combinations to build more robust machine learning models.

## Helper Functions

### Pipeline with StandardScaler + XGBoost
We use a scikit-learn `Pipeline` that wraps the `StandardScaler` and the `XGBRegressor` into a single object. This **prevents data leakage** because the scaler is fitted only on the training data of each fold during cross-validation, never seeing the validation data.

### GroupKFold by Location
Instead of a simple random split, we use **GroupKFold** grouping by geographic location (combination of Latitude and Longitude). This ensures that **all samples from the same monitoring station stay in the same fold**, simulating real spatial evaluation — the model is evaluated on locations it has never seen during training.

The locations (Latitude, Longitude) are used only to define the fold splits during GroupKFold — they are not used as input features to the model.

So, the groups parameter tells GroupKFold which rows belong together (same station), but the model itself only ever sees swir22, NDMI, MNDWI, and pet. The groups ensure that during hyperparameter tuning, no fold leaks data from a station it's supposed to evaluate on — but the trained model has no knowledge of coordinates.

At final prediction time, pipeline_TA.predict(submission_val_data) receives only ['swir22', 'NDMI', 'MNDWI', 'pet'] — no location information at all.

### RandomizedSearchCV for Hyperparameter Tuning
We use `RandomizedSearchCV` to efficiently search for the best XGBoost hyperparameters by sampling random combinations from the search space. This is faster than an exhaustive `GridSearchCV` and generally finds good configurations with fewer iterations.

### Model Evaluation
After tuning, we evaluate the best model found using:
- **R² Score**: Measures how well the model explains the variance in the observed values.
- **RMSE (Root Mean Square Error)**: Quantifies the average magnitude of prediction errors.

### **Tip 4**

There are many data preprocessing methods available that may help improve model performance. Participants are encouraged to explore various preprocessing techniques as well as different machine learning algorithms to build a more robust model.

In [None]:
def create_location_groups(wq_data):
    """Create group labels based on unique (Latitude, Longitude) pairs for GroupKFold."""
    groups = wq_data[['Latitude', 'Longitude']].apply(lambda row: f"{row['Latitude']}_{row['Longitude']}", axis=1)
    return groups

def build_pipeline():
    """Build a Pipeline with StandardScaler + XGBRegressor (avoids data leakage)."""
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', XGBRegressor(
            objective='reg:squarederror',
            random_state=42,
            n_jobs=-1,
            verbosity=0
        ))
    ])
    return pipeline

def get_xgb_param_distributions():
    """Define hyperparameter search space for XGBoost."""
    param_distributions = {
        'model__n_estimators': randint(100, 1000),
        'model__max_depth': randint(3, 12),
        'model__learning_rate': uniform(0.01, 0.29),       # [0.01, 0.30]
        'model__subsample': uniform(0.5, 0.5),             # [0.5, 1.0]
        'model__colsample_bytree': uniform(0.5, 0.5),      # [0.5, 1.0]
        'model__min_child_weight': randint(1, 10),
        'model__gamma': uniform(0, 5),
        'model__reg_alpha': uniform(0, 10),
        'model__reg_lambda': uniform(0, 10),
    }
    return param_distributions

def evaluate_predictions(y_true, y_pred, dataset_name="Test"):
    """Evaluate predictions with R² and RMSE."""
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f"\n{dataset_name} Evaluation:")
    print(f"  R²:   {r2:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    return r2, rmse

## Model Workflow (Pipeline)

The complete pipeline follows a robust structure to ensure **reproducibility**, **no data leakage**, and **realistic spatial evaluation**:

1. **Location-based group creation** — each unique (Latitude, Longitude) combination receives a group label.
2. **GroupKFold (5 folds)** — ensures that samples from the same station never appear simultaneously in training and validation.
3. **Pipeline (StandardScaler → XGBRegressor)** — the scaler is fitted only within each training fold.
4. **RandomizedSearchCV** — explores 50 random hyperparameter combinations, evaluating each with GroupKFold.
5. **Final evaluation** — the best model is refitted on all data and also evaluated via cross-validation scores.

In [None]:
def run_pipeline(X, y, groups, param_name="Parameter", n_splits=5, n_iter=50):
    print(f"\n{'='*60}")
    print(f"Training Model for {param_name}")
    print(f"{'='*60}")
    
    # GroupKFold by location — ensures spatial separation between folds
    group_kfold = GroupKFold(n_splits=n_splits)
    
    # Build pipeline: StandardScaler + XGBRegressor (no leakage)
    pipeline = build_pipeline()
    
    # Hyperparameter search space
    param_distributions = get_xgb_param_distributions()
    
    # RMSE scorer (negative because sklearn maximizes)
    rmse_scorer = make_scorer(
        lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
        greater_is_better=False
    )
    
    # RandomizedSearchCV with GroupKFold
    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_distributions,
        n_iter=n_iter,
        cv=group_kfold,
        scoring={'r2': 'r2', 'rmse': rmse_scorer},
        refit='r2',  # Refit using the best R² score
        random_state=42,
        n_jobs=-1,
        verbose=1,
        return_train_score=True
    )
    
    # Fit with groups
    search.fit(X, y, groups=groups)
    
    # Best model (already a full Pipeline with scaler + xgb)
    best_pipeline = search.best_estimator_
    
    # Cross-validation results
    best_idx = search.best_index_
    cv_results = search.cv_results_
    
    r2_train = cv_results['mean_train_r2'][best_idx]
    r2_test = cv_results['mean_test_r2'][best_idx]
    rmse_train = -cv_results['mean_train_rmse'][best_idx]
    rmse_test = -cv_results['mean_test_rmse'][best_idx]
    
    print(f"\nBest Hyperparameters:")
    for param, value in search.best_params_.items():
        print(f"  {param}: {value}")
    
    print(f"\nCross-Validation Results (GroupKFold, {n_splits} folds):")
    print(f"  Train R²:  {r2_train:.4f}  |  Train RMSE: {rmse_train:.4f}")
    print(f"  Val   R²:  {r2_test:.4f}  |  Val   RMSE: {rmse_test:.4f}")
    
    # Summary
    results = {
        "Parameter": param_name,
        "CV_R2_Train": round(r2_train, 4),
        "CV_RMSE_Train": round(rmse_train, 4),
        "CV_R2_Val": round(r2_test, 4),
        "CV_RMSE_Val": round(rmse_test, 4),
    }
    
    return best_pipeline, pd.DataFrame([results])

### Model Training and Evaluation for Each Parameter

In this step, we apply the complete modeling pipeline to each of the three selected water quality parameters — Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus. 

**Importantly**, we create **location-based groups** from the (Latitude, Longitude) pairs and pass them to `GroupKFold`, so the model is always validated on locations it has never seen during training. The `RandomizedSearchCV` explores 50 random hyperparameter combinations for XGBoost, evaluated across 5 spatial folds.

In [None]:
# Create location-based groups for GroupKFold
groups = create_location_groups(wq_data)
print(f"Total samples: {len(wq_data)}")
print(f"Unique locations (groups): {groups.nunique()}")

# Feature matrix (only predictor variables — no Lat/Lon)
X = wq_data[['swir22', 'NDMI', 'MNDWI', 'pet']]

# Target variables
y_TA = wq_data['Total Alkalinity']
y_EC = wq_data['Electrical Conductance']
y_DRP = wq_data['Dissolved Reactive Phosphorus']

# Run pipeline for each parameter
pipeline_TA, results_TA = run_pipeline(X, y_TA, groups, "Total Alkalinity")
pipeline_EC, results_EC = run_pipeline(X, y_EC, groups, "Electrical Conductance")
pipeline_DRP, results_DRP = run_pipeline(X, y_DRP, groups, "Dissolved Reactive Phosphorus")

### Model Performance Summary

The table below consolidates the **GroupKFold cross-validation** results for each water quality parameter. The training and validation metrics represent **averages across folds**, where each fold ensures complete spatial separation between training and validation stations.

In [None]:
results_summary = pd.concat([results_TA, results_EC, results_DRP], ignore_index=True)
results_summary

## Submission

Once you are satisfied with your model’s performance, you can proceed to make predictions for unseen data. To do this, use your trained model to estimate the concentrations of the target water quality parameters — Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus — for a set of test locations provided in the **Submission_template.csv** file. 

The predicted results can then be uploaded to the challenge platform for evaluation.

In [None]:
test_file = pd.read_csv("submission_template.csv")
display(test_file.head(5))

In [None]:
landsat_val_features = pd.read_csv("landsat_features_validation.csv")
display(landsat_val_features.head(5))

In [None]:
Terraclimate_val_df = pd.read_csv("terraclimate_features_validation.csv")
display(Terraclimate_val_df.head(5))

Similarly, participants can use the **Landsat** and **TerraClimate** data extraction demonstration notebooks to produce feature CSVs for their **validation** data. For convenience, we have already computed and saved example validation outputs as `landsat_features_val_V3.csv` and `Terraclimate_val_df_v3.csv`. 

Participants should save their own extracted files in the same format and column schema; doing so will allow this benchmark notebook to load the validation features directly and run smoothly.

In [None]:
#Consolidate all the extracted bands and features in a single dataframe
val_data = pd.DataFrame({
    'Longitude': landsat_val_features['Longitude'].values,
    'Latitude': landsat_val_features['Latitude'].values,
    'Sample Date': landsat_val_features['Sample Date'].values,
    'nir': landsat_val_features['nir'].values,
    'green': landsat_val_features['green'].values,
    'swir16': landsat_val_features['swir16'].values,
    'swir22': landsat_val_features['swir22'].values,
    'NDMI': landsat_val_features['NDMI'].values,
    'MNDWI': landsat_val_features['MNDWI'].values,
    'pet': Terraclimate_val_df['pet'].values,
})

In [None]:
# Impute the missing values
val_data = val_data.fillna(val_data.median(numeric_only=True))

In [None]:
# Extracting specific columns (swir22, NDMI, MNDWI, pet) from the validation dataset
submission_val_data=val_data.loc[:,['swir22','NDMI','MNDWI','pet']]
display(submission_val_data.head())

In [None]:
submission_val_data.shape

In [None]:
# The pipeline already includes StandardScaler + XGBRegressor,
# so we can call .predict() directly — no manual scaling needed.

# --- Predicting for Total Alkalinity ---
pred_TA_submission = pipeline_TA.predict(submission_val_data)

# --- Predicting for Electrical Conductance ---
pred_EC_submission = pipeline_EC.predict(submission_val_data)

# --- Predicting for Dissolved Reactive Phosphorus ---
pred_DRP_submission = pipeline_DRP.predict(submission_val_data)

In [None]:
submission_df = pd.DataFrame({
    'Longitude': test_file['Longitude'].values,
    'Latitude': test_file['Latitude'].values,
    'Sample Date': test_file['Sample Date'].values,
    'Total Alkalinity': pred_TA_submission,
    'Electrical Conductance': pred_EC_submission,
    'Dissolved Reactive Phosphorus': pred_DRP_submission
})

In [None]:
#Displaying the sample submission dataframe
display(submission_df.head())

In [None]:
#Dumping the predictions into a csv file.
submission_df.to_csv("/tmp/submission.csv",index = False)

In [None]:
session.sql("""
    PUT file:///tmp/submission.csv
    'snow://workspace/USER$.PUBLIC."ey-hackathon"/versions/live/'
    AUTO_COMPRESS=FALSE
    OVERWRITE=TRUE
""").collect()
print("File saved! Refresh the browser to see the files in the sidebar")

### Upload submission file on platform

Upload the `submission.csv` file on the challenge platform to generate your score on the leaderboard.

## Conclusion

Now that you have learned a basic approach to model training, it’s time to explore your own techniques and ideas! Feel free to modify any of the functions presented in this notebook to experiment with alternative preprocessing steps, feature engineering strategies, or machine learning algorithms. 

We look forward to seeing your enhanced model and the insights you uncover. Best of luck with the challenge!