# GeoAI Hack - Locust Breeding Ground Prediction using HLS Data

<a href="https://colab.research.google.com/github/instadeepai/InstaGeo-E2E-Geospatial-ML/blob/main/notebooks/InstaGeo_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This starter notebook showcases the capabilities of InstaGeo, an end-to-end package designed for geospatial machine learning with multispectral data.

In this demonstration, we use ground-truth locust observations downloaded from the [UN FAO Locust Hub ](https://locust-hub-hqfao.hub.arcgis.com/) on March 17, 2022 to learn a model for identifying desert locust breeding grounds in Africa. The notebook will guide you through the process of creating segmentation-like data from these observations, fine-tuning the [Prithvi](https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) model, and finally visualizing the inference results on an interactive map.

By the end of this demo, you will gain hands-on experience with key InstaGeo functionalities and learn how it streamlines geospatial ML workflows from data preparation to model inference.

# Install InstaGeo

In [None]:
repository_url = "https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML"

!git clone {repository_url}

In [None]:
%%bash
cd InstaGeo-E2E-Geospatial-ML
pip install -e .[all]

## EarthData Login

InstaGeo currently supports multispectral data from NASA [Harmonized Landsat and Sentinel-2 (HLS)](https://hls.gsfc.nasa.gov/). Accessing HLS data requires an EarthData user account which can be created [here](https://urs.earthdata.nasa.gov/)

In [None]:
from getpass import getpass
import os

In [None]:
# Enter you EarthData user account credentials
USERNAME = getpass('Enter your EarthData username: ')
PASSWORD = getpass('Enter your EarthData password: ')

content = f"""machine urs.earthdata.nasa.gov login {USERNAME} password {PASSWORD}"""

with open(os.path.expanduser('~/.netrc'), 'w') as file:
    file.write(content)

## InstaGeo - Data (Optional)

With InstaGeo installed and EarthData authentication configured, we are now ready to download and process HLS (Harmonized Landsat and Sentinel) granules using the `InstaGeo-Data` module. This module offers several powerful functionalities for handling geospatial data, including:

- Searching and retrieving metadata for HLS granules
- Downloading specific spectral bands from HLS granules
- Generating data chips and corresponding target labels for machine learning tasks

These capabilities streamline the preprocessing of multispectral data, setting the foundation for efficient geospatial model development.



In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
import os
import re

The ground-truth locust observations used in this challenge were downloaded from the [UN FAO Locust Hub ](https://locust-hub-hqfao.hub.arcgis.com/) on March 17, 2022. The raw data was processed to derive our locust breeding ground dataset. The label dataset was splitted into train and test.

After splitting the data into training and test splits, the next step is to group the data by the HLS granules they belong to and download the corresponding spectral bands for each granule. Once the bands are retrieved, we will generate smaller chips and target labels with dimensions of 256 x 256 pixels.

By the end of this process, the input data will have a shape of 3 x 6 x 256 x 256 (representing three sets of six spectral bands and 256 x 256 pixel chips), and the target labels will have a shape of 256 x 256.

While these tasks might seem complex, the `InstaGeo-Data` module abstracts this process, allowing you to configure it with a simple command as shown in the following cells

In [None]:
def generate_label_mapping(root_dir, input_subdir, output_csv):
    """
    Generate a CSV mapping input chips to corresponding segmentation maps.

    Args:
        root_dir (str or Path): Root directory containing the subdirectories for chips and segmentation maps.
        input_subdir (str): Subdirectory path for chips within the root directory.
        output_csv (str or Path): Output path for the generated CSV file.
    """
    root_dir = Path(root_dir)
    chips_orig = os.listdir(root_dir / input_subdir / "chips")

    chips = [chip.replace("chip", f"{input_subdir}/chips/chip") for chip in chips_orig]
    seg_maps = [chip.replace("chip", f"{input_subdir}/seg_maps/seg_map") for chip in chips_orig]

    df = pd.DataFrame({"Input": chips, "Label": seg_maps})
    df.to_csv(root_dir / output_csv, index=False)
    
    print(f"Number of rows is: {df.shape[0]}")
    print(f"CSV generated and saved to: {root_dir / output_csv}")

### Training Split

In [None]:
!mkdir train

!python -m "instageo.data.chip_creator" \
    --dataframe_path="train.csv" \
    --output_directory="train" \
    --min_count=10 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=3 \
    --processing_method=cog

In [None]:
generate_label_mapping(Path.cwd(), "train", "train_ds.csv")

### Test Split

In [None]:
!mkdir test

!python -m "instageo.data.chip_creator" \
    --dataframe_path="test.csv" \
    --output_directory="test" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --processing_method=cog

In [None]:
generate_label_mapping(Path.cwd(), "test", "test_ds.csv")

## Prepare Data

Due to the limited time available for this hackathon, we have created the chips and labels using `Instageo-Data`, which took 57h to complete.

The data is provided as part of this competition. So you can simply download it and start hacking your way to a TOP SOLUTION!

Extract the compressed data

In [None]:
!tar -xvzf train.tar.gz
!tar -xvzf test.tar.gz

Create input and label mapping

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
import os
import re

In [None]:
def generate_label_mapping(root_dir, input_subdir, output_csv):
    """
    Generate a CSV mapping input chips to corresponding segmentation maps.

    Args:
        root_dir (str or Path): Root directory containing the subdirectories for chips and segmentation maps.
        input_subdir (str): Subdirectory path for chips within the root directory.
        output_csv (str or Path): Output path for the generated CSV file.
    """
    root_dir = Path(root_dir)
    chips_orig = os.listdir(root_dir / input_subdir / "chips")

    chips = [chip.replace("chip", f"{input_subdir}/chips/chip") for chip in chips_orig]
    seg_maps = [chip.replace("chip", f"{input_subdir}/seg_maps/seg_map") for chip in chips_orig]

    df = pd.DataFrame({"Input": chips, "Label": seg_maps})
    df.to_csv(root_dir / output_csv, index=False)
    
    print(f"Number of rows is: {df.shape[0]}")
    print(f"CSV generated and saved to: {root_dir / output_csv}")

In [None]:
generate_label_mapping(Path.cwd(), 'train', "train_ds.csv")

In [None]:
generate_label_mapping(Path.cwd(), 'test', "test_ds.csv")

Split out Validation Set

In [None]:
def split_validation_data(mapping_csv, data_dir, train_dir, validation_dir, validation_split=0.3):
    """
    Split data into training and validation sets based on a CSV file mapping `chips` and `seg_maps`.

    Args:
        mapping_csv (str or Path): Path to the CSV file containing the mapping between `chips` and `seg_maps`.
        data_dir (str or Path): Path to the merged directory containing all files.
        validation_dir (str or Path): Path to the new directory for validation files.
        validation_split (float): Fraction of the data to use as the validation set.
    """
    data_dir = Path(data_dir)
    validation_dir = Path(validation_dir)
    train_dir = Path(train_dir)

    validation_dir.mkdir(parents=True, exist_ok=True)
    train_dir.mkdir(parents=True, exist_ok=True)

    df = pd.read_csv(mapping_csv)
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)

    num_val = int(len(df) * validation_split)
    train_df = df[num_val:]
    val_df = df[:num_val]

    for _, row in val_df.iterrows():
        chip_file = data_dir / Path(row['Input']).relative_to(data_dir)
        seg_map_file = data_dir / Path(row['Label']).relative_to(data_dir)

        for file, subfolder in [(chip_file, "chips"), (seg_map_file, "seg_maps")]:
            if file.exists():
                dest_path = validation_dir / subfolder / file.relative_to(data_dir).name
                dest_path.parent.mkdir(parents=True, exist_ok=True)
                shutil.move(str(file), str(dest_path))
            else:
                print(f"File not found: {file}")
                
    for _, row in train_df.iterrows():
        chip_file = data_dir / Path(row['Input']).relative_to(data_dir)
        seg_map_file = data_dir / Path(row['Label']).relative_to(data_dir)

        for file, subfolder in [(chip_file, "chips"), (seg_map_file, "seg_maps")]:
            if file.exists():
                dest_path = train_dir / subfolder / file.relative_to(data_dir).name
                dest_path.parent.mkdir(parents=True, exist_ok=True)
                shutil.move(str(file), str(dest_path))
            else:
                print(f"File not found: {file}")
                
    print(f"Train files moved to {train_dir}. Train set size: {len(train_df)}.")
    print(f"Validation files moved to {validation_dir}. Validation set size: {len(val_df)}.")
    

In [None]:
split_validation_data(
    mapping_csv="train_ds.csv",
    data_dir="train_split",
    validation_dir="validation_split",
    validation_split=0.3
)

In [None]:
# Generate label mapping

In [None]:
generate_label_mapping(Path.cwd(), 'train_split', "train_split.csv")

In [None]:
generate_label_mapping(Path.cwd(), 'validation_split', "validation_split.csv")

## InstaGeo - Model

After creating our dataset using the `InstaGeo-Data` module, we can move on to fine-tuning a model that includes a Prithvi backbone paired with a classification head. For regression tasks, the classification head can easily be replaced with a suitable regression head. Additionally, if a completely different model architecture is needed, it can be designed and implemented within this framework.

In [None]:
import os
import os
import pandas as pd
import numpy as np
from pathlib import Path

**Launch Training**

First compute the mean and standard deviation for the dataset and update the corresponding config file, in this case `locust.yaml`

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    train.num_epochs=5 \
    mode=stats \
    train_filepath="train_ds.csv" \

Run training

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    train.num_epochs=5 \
    mode=train \
    train_filepath="train_ds.csv" \
    valid_filepath="val_ds.csv"

**Run Model Evaluation**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    test_filepath="test_ds.csv" \
    train.batch_size=8 \
    checkpoint_path='checkpoint-path' \
    mode=eval

### Make Submission

We first run inference on test chips to get the predictions

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    test_filepath="test_ds.csv" \
    train.batch_size=16 \
    checkpoint_path='checkpoint_path' \
    mode=chip_inference

After getting the prdictions for each chip, we retrieve the predicted value for each observatio in our test split.

In [None]:
from pyproj import CRS, Transformer
import os
import rasterio
import numpy as np

predictions_directory = "predictions"
prediction_files = os.listdir(predictions_directory)

def get_prediction_value(row):
    matching_files = [f for f in prediction_files if (str(row['date']) in f) and (row['mgrs_tile_id'] in f)]
    if not matching_files:
        return (np.nan, np.nan)
    for file in matching_files:
        with rasterio.open(f"{predictions_directory}/{file}") as src:
            width, height = src.width, src.height
            affine_transform = rasterio.transform.AffineTransformer(src.transform)
            transformer = Transformer.from_crs(CRS.from_epsg(4326), src.crs, always_xy=True)
            x_chip, y_chip = transformer.transform(row['x'], row['y'])
            x_offset, y_offset = affine_transform.rowcol(x_chip, y_chip)
            
            if 0 <= x_offset < width and 0 <= y_offset < height:
                return src.read(1)[x_offset, y_offset], file
    return (np.nan, np.nan)

In [None]:
submission_df = pd.read_csv("hls_submission.csv")
submission_df[['prediction', 'filename']] = submission_df.apply(get_prediction_value, axis=1, result_type='expand')
submission_df.to_csv("hls_submission.csv")

**Upload submission file to Kaggle to see leaderboard score**

**Run Inference**

In [None]:
# !gsutil cp gs://instageo/utils/africa_prediction_template.csv .
!mkdir -p inference/2021-06

**Create Inference Data**

For inference, we only need to download the necessary HLS tiles and run inference directly using the sliding window inference feature.

If you're running inference across the entire African continent, you can use the `africa_prediction_template.csv`, which will automatically download 2,120 HLS granules covering Africa and parts of Asia.

For this demo, we'll limit the scope to the HLS granules included in our test split.

Note: Ensure you have approximately 1TB of storage space available for this process if you are running inference across Africa.

In [None]:
# !python -m "instageo.data.chip_creator" \
#     --dataframe_path="africa_prediction_template.csv" \
#     --output_directory="inference/2021-06" \
#     --min_count=1 \
#     --no_data_value=-1 \
#     --temporal_tolerance=3 \
#     --temporal_step=30 \
#     --num_steps=3 \
#     --download_only

In [None]:
# Instead of downloading new set of HLS tiles, we can use the one for our test split for inference.

!cp -r test/* inference/2021-06

**Run Inference**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='inference/2021-06' \
    test_filepath='hls_dataset.json' \
    train.batch_size=16 \
    test.mask_cloud=True \
    checkpoint_path='checkpoint-path' \
    mode=predict

## InstaGeo - Apps
Once inference has been completed on the HLS tiles and the results have been saved, we can use the `InstaGeo-Apps` module to visualize the predictions on an interactive map.

To visualize the results, simply move the HLS prediction GeoTIFF files to the appropriate directory, and `InstaGeo-Apps` will handle the rest, providing an intuitive and interactive mapping experience.

In [None]:
!mkdir -p predictions/2023/6
!mv inference/2023-06/predictions/* /content/predictions/2023/6

In [None]:
!npm install localtunnel

In [None]:
!nohup streamlit run InstaGeo-E2E-Geospatial-ML/instageo/apps/app.py --server.address=localhost &

Retrieve your IP address which is the password of the localtunnel

In [None]:
import urllib
print("Password/Endpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

In [None]:
!npx localtunnel --port 8501

## Summary

In this notebook, we demonstrated the end-to-end capabilities of InstaGeo for geospatial machine learning using multispectral data. We began by downloading and processing HLS granules, creating data chips for training, and fine-tuning a model with the Prithvi backbone. Finally, we ran inference on test data and visualized the results using the `InstaGeo-Apps` module.

By leveraging InstaGeo, complex tasks such as data preprocessing, model training, and large-scale inference can be streamlined and efficiently handled with minimal configuration.

If you found this demo helpful, please consider giving our [InstaGeo GitHub repository](https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML) a star ⭐! Your support helps us continue improving the tool for the community.

Thank you for exploring InstaGeo with us!