# InstaGeo Demo

<a href="https://colab.research.google.com/github/instadeepai/InstaGeo-E2E-Geospatial-ML/blob/main/notebooks/InstaGeo_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the InstaGeo demo notebook! This tutorial showcases the capabilities of InstaGeo, an end-to-end package designed for geospatial machine learning with multispectral data.

In this demonstration, we use ground truth geospatial point observations for cropland classification in Rwanda. The notebook will guide you through the process of creating segmentation-like data from these observations, fine-tuning the [Prithvi](https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) model, and finally visualizing the inference results on an interactive map.

By the end of this demo, you will gain hands-on experience with key InstaGeo functionalities and learn how it streamlines geospatial ML workflows from data preparation to model inference.

# Install InstaGeo

In [None]:
repository_url = "https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML"

!git clone {repository_url}

In [None]:
%%bash
cd InstaGeo-E2E-Geospatial-ML
pip install -e .[all]

## EarthData Login

InstaGeo currently supports multispectral data from NASA [Harmonized Landsat and Sentinel-2 (HLS)](https://hls.gsfc.nasa.gov/). Accessing HLS data requires an EarthData user account which can be created [here](https://urs.earthdata.nasa.gov/)

In [None]:
from getpass import getpass
import os

In [None]:
# Enter you EarthData user account credentials
USERNAME = getpass('Enter your EarthData username: ')
PASSWORD = getpass('Enter your EarthData password: ')

content = f"""machine urs.earthdata.nasa.gov login {USERNAME} password {PASSWORD}"""

with open(os.path.expanduser('~/.netrc'), 'w') as file:
    file.write(content)

## InstaGeo - Data

With InstaGeo installed and EarthData authentication configured, we are now ready to download and process HLS (Harmonized Landsat and Sentinel) granules using the `InstaGeo-Data` module. This module offers several powerful functionalities for handling geospatial data, including:

- Searching and retrieving metadata for HLS granules
- Downloading specific spectral bands from HLS granules
- Generating data chips and corresponding target labels for machine learning tasks

These capabilities streamline the preprocessing of multispectral data, setting the foundation for efficient geospatial model development.



In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

The ground-truth geospatial observations for Rwanda cropland classification used in this notebook were sourced from the [Rwanda 2019 Crop/Non-Crop Labels (HarvestPortal)](https://data.harvestportal.org/dataset/rwanda-2019-crop-non-crop-labels) dataset. Run the following cell to download the data.

In [None]:
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/ed0ab379-a688-4419-ab96-181c726e1b22/download/ceo-2019-rwanda-cropland-sample-data-2021-04-20.csv
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/0cfc1320-f909-4759-90f9-cb5c92ca019e/download/ceo-2019-rwanda-cropland-rcmrd-set-1-sample-data-2021-04-20.csv
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/6675cc7e-e6da-4889-9905-60c0d5369ce6/download/ceo-2019-rwanda-cropland-rcmrd-set-2-sample-data-2021-04-20.csv

In [None]:
df1 = pd.read_csv("ceo-2019-rwanda-cropland-sample-data-2021-04-20.csv")
df2 = pd.read_csv("ceo-2019-rwanda-cropland-rcmrd-set-1-sample-data-2021-04-20.csv")
df3 = pd.read_csv("ceo-2019-rwanda-cropland-rcmrd-set-2-sample-data-2021-04-20.csv")

df = pd.concat([df1, df2, df3])

In [None]:
df = df[['lat', 'lon', 'collection_time', 'Crop/ or not', 'sample_id']]
df = df.rename({"lon": "x", "lat":"y", "Crop/ or not":'label', 'collection_time':"date"}, axis=1)
df.head(10)

In [None]:
def label_map(x):
    if x == "Cropland":
        return 1
    elif x == "Non-crop":
        return 0
    else:
        return np.nan

df['date'] = df['date'].map(lambda x: pd.to_datetime(x).strftime("%Y-%m-%d"))
df['label'] = df['label'].map(label_map)
df = df.dropna().reset_index()
df.head(10)

In [None]:
print(f"The number of labeled observations in the aggregated dataset is: {df.shape[0]}")

**Optional**: For the sake of rapid experimentation, let's use a subset of the observations (for instance 10%), while keeping approximately the same distribution for the labels.

In [None]:
df = df.groupby('label', as_index=False).sample(frac=0.1).reset_index(drop=True)
print(f"The number of labeled observations in the subset is: {df.shape[0]}")

In [None]:
from sklearn.model_selection import train_test_split

train, val_and_test = train_test_split(df, test_size=0.3)
val, test = train_test_split(val_and_test, test_size=0.5)

print(train.size, val.size, test.size)

In [None]:
train.to_csv("rwanda_cropland_data_train.csv")
val.to_csv("rwanda_cropland_data_val.csv")
test.to_csv("rwanda_cropland_data_test.csv")

After splitting the data into training, validation, and test sets, the next step is to group the data by the HLS granules they belong to and download the corresponding spectral bands for each granule. Once the bands are retrieved, we will generate smaller chips and target labels with dimensions of 256 x 256 pixels.

By the end of this process, the input data will have a shape of 3 x 6 x 256 x 256 (representing three sets of six spectral bands and 256 x 256 pixel chips), and the target labels will have a shape of 256 x 256.

While these tasks might seem complex, the `InstaGeo-Data` module abstracts this process, allowing you to configure it with a simple command as shown in the following cells

### Training Split

In [None]:
%%bash
mkdir train
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_train.csv" \
    --output_directory="train" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

In [None]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "train/chips"))
chips = [chip.replace("chip", "train/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "train/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("train.csv"))

In [None]:
print(f"The size of the train split: {df.shape[0]}")

### Validation Split

In [None]:
%%bash
mkdir val
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_val.csv" \
    --output_directory="val" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

In [None]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "val/chips"))
chips = [chip.replace("chip", "val/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "val/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("val.csv"))

In [None]:
print(f"The size of the validation split: {df.shape[0]}")

### Test Split

In [None]:
%%bash
mkdir test
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_test.csv" \
    --output_directory="test" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

In [None]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "test/chips"))
chips = [chip.replace("chip", "test/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "test/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("test.csv"))

In [None]:
print(f"The size of the test split: {df.shape[0]}")

## InstaGeo - Model

After creating our dataset using the `InstaGeo-Data` module, we can move on to fine-tuning a model that includes a Prithvi backbone paired with a classification head. For regression tasks, the classification head can easily be replaced with a suitable regression head. Additionally, if a completely different model architecture is needed, it can be designed and implemented within this framework.

In [None]:
import os
import os
import pandas as pd
import numpy as np
from pathlib import Path

**Launch Training**

First compute the mean and standard deviation for the dataset and update the corresponding config file, in this case `locust.yaml`

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    mode=stats \
    train_filepath="train.csv"


In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    train.num_epochs=5 \
    train_filepath="train.csv" \
    valid_filepath="val.csv"

**Run Model Evaluation**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    test_filepath="test.csv" \
    train.batch_size=8 \
    checkpoint_path='checkpoint-path' \
    mode=eval

**Run Inference**

In [None]:
# !gsutil cp gs://instageo/utils/africa_prediction_template.csv .
!mkdir -p inference/2023-06

**Create Inference Data**

For inference, we only need to download the necessary HLS tiles and run inference directly using the sliding window inference feature.

If you're running inference across the entire African continent, you can use the `africa_prediction_template.csv`, which will automatically download 2,120 HLS granules covering Africa and parts of Asia.

For this demo, we'll limit the scope to the HLS granules included in our test split.

Note: Ensure you have approximately 1TB of storage space available for this process if you are running inference across Africa.

In [None]:
!python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_test.csv" \
    --output_directory="inference/2023-06" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=download-only

**Run Inference**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='inference/2023-06' \
    test_filepath='hls_dataset.json' \
    train.batch_size=16 \
    test.mask_cloud=True \
    checkpoint_path='checkpoint-path' \
    mode=sliding_inference

## InstaGeo - Apps
Once inference has been completed on the HLS tiles and the results have been saved, we can use the `InstaGeo-Apps` module to visualize the predictions on an interactive map.

To visualize the results, simply move the HLS prediction GeoTIFF files to the appropriate directory, and `InstaGeo-Apps` will handle the rest, providing an intuitive and interactive mapping experience.

In [None]:
!mkdir -p predictions/2023/6
!mv inference/2023-06/predictions/* /content/predictions/2023/6

In [None]:
!npm install localtunnel

In [None]:
!nohup streamlit run InstaGeo-E2E-Geospatial-ML/instageo/apps/app.py --server.address=localhost &

Retrieve your IP address which is the password of the localtunnel

In [None]:
import urllib
print("Password/Endpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

In [None]:
!npx localtunnel --port 8501

## Summary

In this notebook, we demonstrated the end-to-end capabilities of InstaGeo for geospatial machine learning using multispectral data. We began by downloading and processing HLS granules, creating data chips for training, and fine-tuning a model with the Prithvi backbone. Finally, we ran inference on test data and visualized the results using the `InstaGeo-Apps` module.

By leveraging InstaGeo, complex tasks such as data preprocessing, model training, and large-scale inference can be streamlined and efficiently handled with minimal configuration.

If you found this demo helpful, please consider giving our [InstaGeo GitHub repository](https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML) a star ⭐! Your support helps us continue improving the tool for the community.

Thank you for exploring InstaGeo with us!