## 2026 EY AI & Data Challenge - Landsat Data Extraction Notebook

This notebook demonstrates Landsat data extraction and the creation of an output file to be used by the benchmark notebook. The baseline data is [Landsat Collection 2 Level 2](https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2) data from the MS Planetary Computer catalog.

**Caution**... This notebook requires significant execution time as there are 9,319 data points (unique locations and times) used for data extraction from the Landsat archive. The code takes about 7 hours to run to completion on a typical laptop computer with a typical internet connection. Lower execution times are likely possible with optimization of the data extraction process and the use of cloud computing services.


### Load In Dependencies
The following code installs the required Python libraries (found in the requirements.txt file) in the Snowflake environment to allow successful execution of the remaining notebook code. After running this code for the first time, it is required to ‚Äúrestart‚Äù the kernal so the Python libraries are available in the environment. This is done by selecting the ‚ÄúConnected‚Äù menu above the notebook (next to ‚ÄúRun all‚Äù) and selecting the ‚Äúrestart kernal‚Äù link. Subsequent runs of the notebook do not require this ‚Äúrestart‚Äù process. 

In [None]:
!pip install uv
!uv pip install  -r requirements.txt 

In [1]:
import snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()

import warnings
warnings.filterwarnings("ignore")

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

from datetime import date
from tqdm import tqdm
import time
import os


### Extracting Landsat Data Using API Calls

The API-based method allows us to efficiently access **Landsat** data for specific coordinates and time periods, ensuring scalability and reproducibility of the process.

Through the API, we can query individual bands or compute indices like **NDMI** on the fly. This approach reduces storage requirements and simplifies data preprocessing, making it ideal for large-scale environmental and water quality analysis.

The **compute_Landsat_values** function extracts Landsat surface reflectance values for specific sampling locations using a 100 m focal buffer around each point. For each location:

- A bounding box (bbox) is created around the latitude and longitude coordinates.
- The Microsoft Planetary Computer API is queried for Landsat-8 Level-2 surface reflectance imagery within the date range.
- The nearest low-cloud (<10% cloud cover) scene is selected, and the specified bands (**green**, **nir08**, **swir16**, **swir22**) are loaded.
- Median values of the pixels within the bounding box are computed to reduce the effect of noise or outliers.

**Why the buffer value is 0.00089831**

We want a ~100 m buffer around each point.  
At the equator, 1 degree ‚âà 110 km.

Therefore, the degree equivalent of 100 m is:

*buffer_deg ‚âà 100 m / 110,000 m per degree ‚âà 0.00089831*

This value ensures that the buffer approximately matches the pixel resolution of Landsat imagery, capturing a ~100 m area around each sampling location.


In [2]:
# Setup
tqdm.pandas()

def compute_Landsat_values(row, sleep_sec=0.5):
    lat = row['Latitude']
    lon = row['Longitude']
    date = pd.to_datetime(row['Sample Date'], dayfirst=True, errors='coerce')

    bbox_size = 0.00089831
    bbox = [
        lon - bbox_size / 2,
        lat - bbox_size / 2,
        lon + bbox_size / 2,
        lat + bbox_size / 2
    ]

    # Rate-limiting: pause before each API call
    time.sleep(sleep_sec)

    catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1",
        modifier=pc.sign_inplace,
    )

    search = catalog.search(
        collections=["landsat-c2-l2"],
        bbox=bbox,
        datetime="2011-01-01/2015-12-31",
        query={"eo:cloud_cover": {"lt": 10}},
    )

    items = search.item_collection()

    NAN_RESULT = pd.Series({
        "blue": np.nan, "green": np.nan, "red": np.nan,
        "nir": np.nan, "swir16": np.nan, "swir22": np.nan
    })

    if not items:
        return NAN_RESULT

    try:
        sample_date_utc = date.tz_localize("UTC") if date.tzinfo is None else date.tz_convert("UTC")

        items = sorted(
            items,
            key=lambda x: abs(pd.to_datetime(x.properties["datetime"]).tz_convert("UTC") - sample_date_utc)
        )
        selected_item = pc.sign(items[0])

        bands_of_interest = ["blue", "green", "red", "nir08", "swir16", "swir22"]
        data = stac_load([selected_item], bands=bands_of_interest, bbox=bbox).isel(time=0)

        medians = {}
        band_map = {"blue": "blue", "green": "green", "red": "red",
                    "nir": "nir08", "swir16": "swir16", "swir22": "swir22"}

        for out_name, band_key in band_map.items():
            val = float(data[band_key].astype("float").median(skipna=True).values)
            medians[out_name] = val if val != 0 else np.nan

        return pd.Series(medians)

    except Exception:
        return NAN_RESULT


### Extracting features for the training dataset

In [3]:
Water_Quality_df=pd.read_csv('water_quality_training_dataset.csv')
display(Water_Quality_df.head())

In [4]:
Water_Quality_df.shape

### Note

The Landsat data extraction process for all 9,319 locations typically requires more than 7 hours when executed in a single run. During long executions, you may occasionally encounter API limits, timeout errors, or request failures. To avoid these interruptions, we recommend running the extraction in smaller batches.

In this notebook, we provide a sample code snippet demonstrating how to extract data for the first 200 locations. Participants are encouraged to follow the same batching approach to extract data for all 9,319 locations safely and efficiently.

We have already executed the full extraction for all 9,319 locations and saved the output to **landsat_features_training.csv**, which will be used in the benchmark notebook.  
Similarly, participants can extract Landsat features in batches, combine the batch outputs, and save the final merged dataset as **landsat_features_training.csv** to ensure the benchmark notebook runs smoothly.


In [None]:
Water_Quality_df = pd.read_csv('water_quality_training_dataset.csv')

chunksize       = 500          # rows per batch
sleep_between   = 2            # seconds to wait between batches
output_path     = "landsat_features_training_full.csv"

dfs = []
total_batches = (len(Water_Quality_df) + chunksize - 1) // chunksize

print(f"üöÄ Starting Landsat feature extraction for training data...")
print(f"   Total rows: {len(Water_Quality_df)} | Batch size: {chunksize} | Batches: {total_batches}\n")

for batch_num, i in enumerate(tqdm(range(0, len(Water_Quality_df), chunksize),
                                   desc="Batches", total=total_batches), start=1):
    chunk = Water_Quality_df.iloc[i : i + chunksize]

    print(f"\n‚è≥ Batch {batch_num}/{total_batches} ‚Äî rows {i} to {i + len(chunk) - 1}")
    t0 = time.time()

    chunk_features = chunk.progress_apply(compute_Landsat_values, axis=1)
    dfs.append(chunk_features)

    elapsed = time.time() - t0
    print(f"   ‚úÖ Batch {batch_num} done in {elapsed:.1f}s")

    # Checkpoint: save incrementally so no work is lost on failure
    partial = pd.concat(dfs, ignore_index=True)
    partial.to_csv(output_path, index=False)
    print(f"   üíæ Checkpoint saved ‚Üí {output_path} ({len(partial)} rows so far)")

    if batch_num < total_batches:
        print(f"   üí§ Sleeping {sleep_between}s before next batch...")
        time.sleep(sleep_between)

landsat_train_features = pd.concat(dfs, ignore_index=True)
print(f"\nüéâ Extraction complete! Total rows: {len(landsat_train_features)}")


**NDMI and MNDWI Indices**

In this notebook, we compute two commonly used water-related indices from the extracted Landsat bands:

- **NDMI (Normalized Difference Moisture Index):**  
  Measures vegetation water content and surface moisture.  
  Computed as *(NIR - SWIR16) / (NIR + SWIR16)*.

- **MNDWI (Modified Normalized Difference Water Index):**  
  Highlights open water features by enhancing water reflectance and suppressing built-up areas.  
  Computed as *(Green - SWIR16) / (Green + SWIR16)*.

An **epsilon value** (*eps = 1e-10*) is added to the denominators to avoid division by zero.  
These indices are widely used in hydrological and water quality analyses for detecting water presence and vegetation moisture levels.


In [7]:
# Create indices: NDMI and MNDWI
eps = 1e-10
landsat_train_features['NDMI'] = (landsat_train_features['nir'] - landsat_train_features['swir16']) / (landsat_train_features['nir'] + landsat_train_features['swir16'] + eps)
landsat_train_features['MNDWI'] = (landsat_train_features['green'] - landsat_train_features['swir16']) / (landsat_train_features['green'] + landsat_train_features['swir16'] + eps)

In [8]:
landsat_train_features['Latitude'] = Water_Quality_df['Latitude']
landsat_train_features['Longitude'] = Water_Quality_df['Longitude']
landsat_train_features['Sample Date'] = Water_Quality_df['Sample Date']
landsat_train_features = landsat_train_features[['Latitude', 'Longitude', 'Sample Date', 'nir','nir08', 'red', 'blue', 'green', 'swir16', 'swir22', 'NDMI', 'MNDWI']]

In [None]:
# Preview File
landsat_train_features.head()

In [None]:
landsat_train_features.to_csv("/tmp/landsat_features_training.csv", index=False)


In [None]:
session.sql("""
    PUT file:///tmp/landsat_features_training.csv
    'snow://workspace/USER$.PUBLIC."ey-hackathon"/versions/live/'
    AUTO_COMPRESS=FALSE
    OVERWRITE=TRUE
""").collect()

print("File saved! Refresh the browser to see the files in the sidebar")


**Note:** If you're using your own workspace, remember to replace "EY-AI-and-Data-Challenge" with your workspace name in the file path.

### Extracting features for the validation dataset

In [11]:
Validation_df=pd.read_csv('submission_template.csv')
display(Validation_df.head())

In [12]:
Validation_df.shape

In [None]:
chunksize       = 500          # rows per batch
sleep_between   = 2            # seconds to wait between batches
val_output_path = "landsat_features_validation_full.csv"

val_dfs = []
total_batches = (len(Validation_df) + chunksize - 1) // chunksize

print(f"üöÄ Starting Landsat feature extraction for validation data...")
print(f"   Total rows: {len(Validation_df)} | Batch size: {chunksize} | Batches: {total_batches}\n")

for batch_num, i in enumerate(tqdm(range(0, len(Validation_df), chunksize),
                                   desc="Batches", total=total_batches), start=1):
    chunk = Validation_df.iloc[i : i + chunksize]

    print(f"\n‚è≥ Batch {batch_num}/{total_batches} ‚Äî rows {i} to {i + len(chunk) - 1}")
    t0 = time.time()

    chunk_features = chunk.progress_apply(compute_Landsat_values, axis=1)
    val_dfs.append(chunk_features)

    elapsed = time.time() - t0
    print(f"   ‚úÖ Batch {batch_num} done in {elapsed:.1f}s")

    # Checkpoint: save incrementally so no work is lost on failure
    partial = pd.concat(val_dfs, ignore_index=True)
    partial.to_csv(val_output_path, index=False)
    print(f"   üíæ Checkpoint saved ‚Üí {val_output_path} ({len(partial)} rows so far)")

    if batch_num < total_batches:
        print(f"   üí§ Sleeping {sleep_between}s before next batch...")
        time.sleep(sleep_between)

landsat_val_features = pd.concat(val_dfs, ignore_index=True)
print(f"\nüéâ Extraction complete! Total rows: {len(landsat_val_features)}")


In [14]:
# Create indices: NDMI and MNDWI
eps = 1e-10
landsat_val_features['NDMI'] = (landsat_val_features['nir'] - landsat_val_features['swir16']) / (landsat_val_features['nir'] + landsat_val_features['swir16'])
landsat_val_features['MNDWI'] = (landsat_val_features['green'] - landsat_val_features['swir16']) / (landsat_val_features['green'] + landsat_val_features['swir16'] + eps)

In [15]:
landsat_val_features['Latitude'] = Validation_df['Latitude']
landsat_val_features['Longitude'] = Validation_df['Longitude']
landsat_val_features['Sample Date'] = Validation_df['Sample Date']
landsat_val_features = landsat_val_features[['Latitude', 'Longitude', 'Sample Date', 'nir','nir08', 'red', 'blue', 'green', 'swir16', 'swir22', 'NDMI', 'MNDWI']]

In [None]:
# Preview File
landsat_val_features.head()

In [None]:
landsat_val_features.to_csv("/tmp/landsat_features_validation.csv",index = False)

In [None]:
session.sql("""
    PUT file:///tmp/landsat_features_validation.csv
    'snow://workspace/USER$.PUBLIC."ey-hackathon"/versions/live/'
    AUTO_COMPRESS=FALSE
    OVERWRITE=TRUE
""").collect()

print("File saved! Refresh the browser to see the files in the sidebar")

**Note:** If you're using your own workspace, remember to replace "EY-AI-and-Data-Challenge" with your workspace name in the file path.