# Pangaea Training Notebook

This notebook aims to help users train their desired models on whichever datasets they wish, by building a console command from user preferences. 

## Building the environment

1) Find the `environment.yml` file in the repo, and executing the command: `conda env create -f environment.yml`. Make sure you are in the ilab-pangaea-repo directory that you've cloned!
2) In the jupyterhub session, click on the top right area that says `Python [...]`, and select `conda env: conda-pangaea-bench`.
3) You can now run the notebook as needed!

## How to use this notebook

Each section (e.g. "X Options") refers to a different facet of using Foundation Models. Tasks, Datasets, Encoders, Decoders, and so on, all have their own section here. Each section is organized into subsections for consistency and readability. Below is a brief explanation of how the subsections function and relate to one another. 

#### Filenames subsection: 

This will contain a set of preconfigured code options that the user can choose from. The first cell will run a check in the repository for existing code, and will print out a list of options for the user to pick from. For example, the first section, called Task Options, will print out a list of the various tasks that users can perform with the FMs in this repo. Users will pick from this list to perform the task that they desire. This will hold True for any section in the notebook; if using code from the repo directly, one only needs to change the associated variable (`task` in our Task Options example).  

#### Parameters subsection: 

This section contains contents of .yaml files, which represent the configuration for a certain code behavior. The first variable controls the behavior of this entire subsection: if you wish to override config parameters or create your own, set the corresponding `override` variable to `True`. Otherwise, leave it as `False`.

Example: The Dataset Options section has a parameters subsection that contains multiple different variables that need to be edited by the user, if they wish to use their own dataset. After setting `ds_override = True`, users can now change other parameters of their choosing, which will override the stock config option. 

**Note: if you change variables but leave the override variable as `False`, your changes will not take place!**

Each variable in this subsection represents a configurable value -- <span style="color: red">each variable has its own format and other considerations, so be sure to read comments</span> (`# this is a comment`). *If users prefer, they can read the documentation (linked below) and edit .yaml files directly. The parameters subsection is designed to be easy to use, but some users may find the direct configuration of the yaml file more desirable.* 

#### Build config subsection: no edits needed! 

At the end of sections that allow for more customization, some code will run to incorporate user preferences into stock preferences. If no stock preferences were chosen (by leaving the associated variable as the empty string, `""`, in the filename section), then the configuration will be built entirely from user preferences. If the override variable in the parameters section was false, then the config will use the stock config directly. 

#### Build/execute command subsections: 

The very bottom of the notebook will run a command-line interface to begin training with the user configuration that was built in this notebook. This will include logging to display progress and any errors that may arise. 

## Questions?

If you have any questions, please read the [documentation](https://nasa-nccs-hpda.github.io/ilab-pangaea-bench/), which includes sections on troubleshooting and individual config file needs. 

## Setup

Import python libraries, clone repo into our current directory, and import from repo.

### Imports and cloning

In [1]:
import os
import torch
import subprocess
from hydra import initialize, compose
from omegaconf import OmegaConf
import copy
from pprint import pprint
import sys
import warnings

In [2]:
warnings.filterwarnings('ignore')
repo_name = "ilab-pangaea-bench"
if not os.path.exists(repo_name):
    subprocess.run(["git", "clone", "https://github.com/nasa-nccs-hpda/ilab-pangaea-bench.git"], 
                   check=True, stdout=subprocess.DEVNULL)
else:
    subprocess.run(["git", "-C", repo_name, "pull"], check=True, stdout=subprocess.DEVNULL)

subprocess.run([sys.executable, "-m", "pip", "install", "-e", "./ilab-pangaea-bench"], 
               check=True, stdout=subprocess.DEVNULL)
sys.path.append("ilab-pangaea-bench")

From https://github.com/nasa-nccs-hpda/ilab-pangaea-bench
   df1aeb0..83a208a  main       -> origin/main
[0m[33m  DEPRECATION: Legacy editable install of pangaea==1.0.0 from file:///panfs/ccds02/nobackup/people/ajkerr1/EO_FM/ilab-pangaea-bench/notebooks/ilab-pangaea-bench (setup.py develop) is deprecated. pip 25.3 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457[0m[33m
[0m

In [4]:
from pangaea.utils.train_utils import (build_config, get_folder_options, is_valid_override)

### Helper functions
Streamline some cell operations

In [None]:
def collect_config_variables(param_var_names=[]):
    """Create user parameters dict from global variable names.
    
    Allows for merging of default config and user params."""
    all_vars = globals()  # Get all variables in the notebook
    
    # Extract variables from user params into a dict
    params_dict = {}
    for var_name in param_var_names:
        if var_name in all_vars: 
            var = all_vars[var_name]
            if is_valid_override(var):  # Add user param if it's truthy (except booleans)
                params_dict[var_name] = var
            del globals()[var_name]
        else:
            print(f"Warning: Variable '{var_name}' not found in globals")
    
    return params_dict

def create_split_dict(split_name, list_of_classes, default_dict):
    """Create a dict for train/val/test split preprocessing. Used to generate config."""
    split_dict = copy.deepcopy(default_dict)
    split_dict["preprocessor_cfg"] = [
        {"_target_": f"pangaea.engine.data_preprocessor.{class_name}"} 
        for class_name in list_of_classes
    ]
    return split_dict

## Defining command options
Here is where users will pick from a list of options for training. The first cell in each subsection will generate a list of options, and the second cell will be where users will enter their choice from the list provided. 

**This is where you need to edit this notebook.**

### Task Options

#### Filename

In [5]:
# PRINT OPTIONS
task_options = get_folder_options("task")
print(task_options)

['change_detection', 'knn_probe', 'knn_probe_multi_label', 'linear_classification', 'linear_classification_multi_label', 'regression', 'segmentation']


In [6]:
task = "regression"

### Dataset Options

#### Filename

In [7]:
# PRINT OPTIONS
dataset_options = get_folder_options("dataset")
print(dataset_options)

['aboveshrubschm', 'ai4smallfarms', 'biomassters', 'croptypemapping', 'dynamicen', 'fivebillionpixels', 'fivebillionpixels_cross_sensors', 'hlsburnscars', 'landsatnlcd-Copy1', 'landsatnlcd', 'mados', 'mbigearthnet', 'mbrickkiln', 'mcashew-plantation', 'mchesapeake-landcover', 'meurosat', 'mforestnet', 'mneontree', 'mnz-cattle', 'mpv4ger-seg', 'mpv4ger', 'msa-crop-type', 'mso2sat', 'oceancolor', 'oceancolor_tm', 'oceancolorval', 'opencanopy', 'pastis', 'potsdam', 'sen1floods11', 'spacenet7', 'spacenet7cd', 'xview2']


In [8]:
dataset = "oceancolor"

#### Parameters

In [None]:
ds_override = False  # Set to True if you want to define your own functionality

# If making your own dataset
target_class = "CustomPangaeaDataset"
_target_ = f"pangaea.datasets.{target_class}"

# ALWAYS EDIT
root_path = "."  # Where to gather dataset files from on ADAPT
download_url = "https://yourdataURL.com"  # Where to download files from (will be downloaded to
img_size = 224  # Height and Width of individual dataset images (must be square)

In [10]:
# Change to match band numbers for inputs
# Bands are separated by modality; each modality is a key: value pair in the dictionary
bands = {
    "optical": ["B1", "B2", "B3", "B4", "B5", "B6", "B7", "B8", "B9", "B10", "B11", "B12"]
}

# REQUIRED FOR STOCK MIN-MAX SCALING PREPROCESSING
# As with bands above, these are separated by modality, add more k: v pairs if needed
data_mean = {
    "optical": [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
    # other modalities
}
data_std = {
    "optical": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    # other modalities
}
data_min = {
    "optical": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    # other modalities
}
data_max = {
    "optical": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
    # other modalities
}

In [11]:
# TASK-DEPENDANT OPTIONS
multi_temporal = False  # Whether input data is multi-temporal
multi_modal = False  # Whether input data is multimodal
num_classes = 1  # For classification/segmentation, change this value to your number of classes
classes = ["regression"]  # List of class names; only needs changing for classification or segmentation
distribution = [0]  # List of probability of encountering certain target values; only used for classification or segmentation

#### Build config - no need to edit this section

In [12]:
# Automatically populate variables
ds_var_names = [
    '_target_', 'root_path', 'download_url', 
    'img_size', 'distribution', 'bands', 'multi_temporal', 
    'multi_modal', 'num_classes', 'classes', 'data_mean', 
    'data_std', 'data_min', 'data_max'
]
ds_params = collect_config_variables(ds_var_names)

# Load stock config if defined
if (dataset):
    with initialize(config_path="ilab-pangaea-bench/configs/dataset", version_base=None):
        ds_cfg = compose(config_name=dataset)
else:
    ds_cfg = {}
    ds_override = True

# Merge user and stock configs depending on override preference
ds_cfg, ds_cfg_filename = build_config("dataset", ds_params, ds_cfg, ds_override)

original cfg: {'_target_': 'pangaea.datasets.ocean_color.OceanColorDataset', 'dataset_name': 'OceanColorDataset', 'root_path': '/explore/nobackup/people/ajkerr1/SatVision/OceanColor/chips/prithvi_chips/train', 'download_url': 'None', 'auto_download': False, 'img_size': 224, 'multi_temporal': False, 'multi_modal': False, 'ignore_index': -1, 'num_classes': 1, 'classes': ['regression'], 'distribution': [1.0], 'bands': {'optical': ['B8', 'B9', 'B10', 'B11', 'B12', 'B13']}, 'data_mean': {'optical': [0.485, 0.456, 0.404, 0.404, 0.404, 0.404]}, 'data_std': {'optical': [0.485, 0.456, 0.404, 0.404, 0.404, 0.404]}, 'data_min': {'optical': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, 'data_max': {'optical': [32767.0, 32767.0, 32767.0, 32767.0, 32767.0, 32767.0]}, 'num_inputs': 6, 'num_targets': 1, 'target_idx': 6}
user params: {'_target_': 'pangaea.datasets.CustomPangaeaDataset', 'root_path': '.', 'download_url': 'https://fakewebsite.com', 'img_size': 224, 'distribution': [0], 'bands': {'optical': ['B1', 'B2

### Encoder options

**Note**: encoders have very different architectures from one another. This means that the config files may look quite different from one to the other, so we have included the absolute bare minimum to editing the encoder file. Many other values are left as default values for existing encoders, and if making your own it's assumed you will know how to edit those parameters on your own (outside of this notebook). 

#### Filename

In [13]:
# PRINT OPTIONS
encoder_options = get_folder_options("encoder")
print(encoder_options)

['croma_joint', 'croma_optical', 'croma_sar', 'dofa', 'gfmswin', 'prithvi', 'remoteclip', 'resnet50_pretrained', 'resnet50_scratch', 'satlasnet_mi', 'satlasnet_si', 'scalemae', 'spectralgpt', 'ssl4eo_data2vec', 'ssl4eo_dino', 'ssl4eo_mae_optical', 'ssl4eo_mae_sar', 'ssl4eo_moco', 'terramind_large', 'unet_encoder', 'unet_encoder_mi', 'vit', 'vit_mi', 'vit_scratch']


In [14]:
# DEFINE YOUR ENCODER HERE FROM THE LIST ABOVE
encoder = "prithvi"

#### Parameters

<span style="color: red">Note: these are not exahaustive parameters needed to create a custom encoder. A python file must be created in the `pangaea/encoders` directory, as well as some more specific .yaml parameters. Check the [documentation](https://nasa-nccs-hpda.github.io/ilab-pangaea-bench/) and see some other encoder configuration files for examples.</span>

In [None]:
enc_override = False  # Set to True if you want to define your own functionality

# IF MAKING A CUSTOM ENCODER
enc_py_filename = ""  # .py filename of encoder you've written
enc_class = ""  # Name of python class in the .py file above
weights_filename = ""  # Filename of weights to load
download_url = "https://yourdataURL.com"  # URL to fetch weights from (e.g. huggingface)
output_layers = [3, 5, 7, 11]  # Indices of layers to output to decoder
output_dim = 1024  # Dimensionality of output to decoder

#### Build config

In [16]:
# Reformat user input to match config formatting, if non-empty string provided
_target_ = (f"pangaea.encoders.{target_py_filename}.{target_class}" 
           if (enc_py_filename and enc_class) 
           else "")

encoder_weights = (f"./pretrained_models/{weights_filename}" 
                  if weights_filename 
                  else "")

# Automatically populate variables
enc_var_names = [
    '_target_', 'encoder_weights', 'download_url',
    'output_layers', 'output_dim'
]
enc_params = collect_config_variables(enc_var_names)

# Load stock config if defined
if (encoder):
    with initialize(config_path="ilab-pangaea-bench/configs/encoder", version_base=None):
        enc_cfg = compose(config_name=encoder)
else:
    enc_cfg = {}
    enc_override = True

# Merge user and stock configs depending on override preference
enc_cfg, enc_cfg_filename = build_config("encoder", enc_params, enc_cfg, enc_override)

original cfg: {'_target_': 'pangaea.encoders.prithvi_encoder.Prithvi_Encoder', 'encoder_weights': './pretrained_models/Prithvi_EO_V2_300M.pt', 'download_url': 'https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M/resolve/main/Prithvi_EO_V2_300M.pt', 'embed_dim': 768, 'input_size': 224, 'in_chans': 6, 'patch_size': 16, 'num_heads': 12, 'depth': 12, 'mlp_ratio': 4, 'tubelet_size': 1, 'num_frames': '${dataset.multi_temporal}', 'input_bands': {'optical': ['B2', 'B3', 'B4', 'B8A', 'B11', 'B12']}, 'output_layers': [3, 5, 7, 11], 'output_dim': 768}
user params: {'download_url': 'https://fakewebsite.com', 'output_layers': [3, 5, 7, 11], 'output_dim': 1024}
merged cfg: {'_target_': 'pangaea.encoders.prithvi_encoder.Prithvi_Encoder', 'encoder_weights': './pretrained_models/Prithvi_EO_V2_300M.pt', 'download_url': 'https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M/resolve/main/Prithvi_EO_V2_300M.pt', 'embed_dim': 768, 'input_size': 224, 'in_chans': 6, 'patch_size': 16, 'num_

### Decoder options

#### Filename

In [17]:
# PRINT OPTIONS
decoder_options = get_folder_options("decoder")
print(decoder_options)

['cls_knn', 'cls_knn_multilabel', 'cls_linear', 'reg_unet', 'reg_upernet', 'reg_upernet_mt_linear', 'reg_upernet_mt_ltae', 'seg_siamunet_conc', 'seg_siamunet_diff', 'seg_siamupernet_conc', 'seg_siamupernet_diff', 'seg_unet', 'seg_upernet', 'seg_upernet_mt_linear', 'seg_upernet_mt_ltae', 'seg_upernet_mt_none']


In [18]:
# DEFINE YOUR DECODER HERE FROM THE LIST ABOVE
decoder = "reg_upernet"

#### Parameters

In [None]:
dec_override = False  # Set to True if you want to define your own functionality
dec_py_filename = "custom_decoder.py"
dec_class = "CustomDecoder"

# Feel free to add any custom options if creating your own decoder

#### Build config

In [20]:
# Automatically populate variables
_target_ = (f"pangaea.encoders.{dec_py_filename}.{dec_class}" 
            if dec_py_filename and dec_class else "")
dec_var_names = ["_target_"]
dec_params = collect_config_variables(dec_var_names)

# Load stock config if defined
if (decoder):
    with initialize(config_path="ilab-pangaea-bench/configs/decoder", version_base=None):
        dec_cfg = compose(config_name=decoder)
else:
    dec_cfg = {}
    dec_override = True

# Merge user and stock configs depending on override preference
dec_cfg, dec_cfg_filename = build_config("decoder", dec_params, dec_cfg, dec_override)

original cfg: {'_target_': 'pangaea.decoders.upernet.RegUPerNet', 'encoder': None, 'finetune': '${finetune}', 'channels': 512}
user params: {'_target_': 'pangaea.encoders.custom_decoder.py.CustomDecoder'}
merged cfg: {'_target_': 'pangaea.decoders.upernet.RegUPerNet', 'encoder': None, 'finetune': '${finetune}', 'channels': 512}
Config saved to regupernet.yaml


### Preprocessing Options
This config has a different format from all of the other configs; it's separated into train, val, and test fields. See the first cell under the "parameters" subsection for details. There are also further instructions on how to edit this config below. 

To edit preprocessing: edit each list of preprocessing targets to change the preprocessing behavior of the stock preprocessor; lists should contain base class names, which are found in pangaea/engine/data_preprocessor.py. 

Example: `train_preprocessing = ["RandomCropToEncoder", "BandFilter", ...]` 

NOT: `train_preprocessing = ["pangaea.engine.data_preprocessor.RandomCropToEncoder", "pangaea.engine.data_preprocessor.BandFilter", ...]`

#### Filename

In [21]:
# PRINT OPTIONS
preprocessing_options = get_folder_options("preprocessing")
print(preprocessing_options)

['cls_resize', 'pb_minmax', 'reg_default', 'seg_default', 'seg_focus_crop', 'seg_importance_crop', 'seg_resize', 'seg_wo_norm']


In [22]:
# DEFINE YOUR PREPROCESSING HERE FROM THE LIST ABOVE
preprocessing = "pb_minmax"

#### Parameters

In [None]:
pre_override = False  # Set to True if you want to define your own functionality

In [24]:
# Load stock config if defined
if (preprocessing):
    with initialize(config_path="ilab-pangaea-bench/configs/preprocessing", version_base=None):
        pre_cfg = compose(config_name=preprocessing)
else:
    pre_cfg = {}
    pre_override = True

# This will display the config contents, since preprocessing has a specific format
print(OmegaConf.to_yaml(pre_cfg))

train:
  _target_: pangaea.engine.data_preprocessor.Preprocessor
  preprocessor_cfg:
  - _target_: pangaea.engine.data_preprocessor.PBMinMaxNorm
val:
  _target_: pangaea.engine.data_preprocessor.Preprocessor
  preprocessor_cfg:
  - _target_: pangaea.engine.data_preprocessor.PBMinMaxNorm
test:
  _target_: pangaea.engine.data_preprocessor.Preprocessor
  preprocessor_cfg:
  - _target_: pangaea.engine.data_preprocessor.PBMinMaxNorm



In [25]:
train_preprocessing = []
val_preprocessing = []
test_preprocessing = []

#### Build config

In [26]:
# 3 sections: train, val, test; all have same basic layout, so we copy a default dict
preprocessor_target = "pangaea.engine.data_preprocessor.Preprocessor"
default_pre_dict = {"_target_": preprocessor_target, "preprocessor_cfg": []}

if train_preprocessing:
    train = create_split_dict("train", train_preprocessing, default_pre_dict)
if val_preprocessing:
    val = create_split_dict("val", val_preprocessing, default_pre_dict)
if test_preprocessing:
    test = create_split_dict("test", test_preprocessing, default_pre_dict)

# Automatically populate variables
pre_var_names = {"train", "val", "test"}
pre_params = collect_config_variables(pre_var_names)



In [27]:
# Merge user and stock configs depending on override preference
pre_cfg, pre_cfg_filename = build_config("preprocessing", pre_params, pre_cfg, pre_override)

original cfg: {'train': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}, 'val': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}, 'test': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}}
user params: {}
merged cfg: {'train': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}, 'val': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}, 'test': {'_target_': 'pangaea.engine.data_preprocessor.Preprocessor', 'preprocessor_cfg': [{'_target_': 'pangaea.engine.data_preprocessor.PBMinMaxNorm'}]}}
Config saved to cust

### Loss Options
Also known as a "criterion".

#### Filename

In [28]:
# PRINT OPTIONS
loss_fn_options = get_folder_options("criterion")
print(loss_fn_options)

['binary_cross_entropy', 'cross_entropy', 'dice', 'mse', 'none', 'spectral_spatial', 'weighted_cross_entropy']


In [29]:
# DEFINE YOUR LOSS FUNCTION HERE FROM THE LIST ABOVE
loss_fn = "mse"

#### Parameters

In [None]:
loss_override = False  # Set to True if you want to define your own functionality
_target_ = "pangaea.utils.losses.DICELoss"  # Can also be an external package, like torch.nn.MSELoss

#### Build config

In [31]:
# Automatically populate variables
loss_var_names = ["_target_"]
loss_params = collect_config_variables(loss_var_names)

# Load stock config if defined
if (loss_fn):
    with initialize(config_path="ilab-pangaea-bench/configs/criterion", version_base=None):
        loss_cfg = compose(config_name=loss_fn)
else:
    loss_cfg = {}

# Merge user and stock configs depending on override preference
loss_cfg, loss_cfg_filename = build_config("criterion", loss_params, loss_cfg, loss_override)

original cfg: {'_target_': 'torch.nn.MSELoss'}
user params: {'_target_': 'pangaea.utils.losses.DICELoss'}
merged cfg: {'_target_': 'torch.nn.MSELoss'}
Config saved to mseloss.yaml


### Optimizer, LR Scheduler Options
These are optional. If you don't want to include one or both, leave them as the empty string, `""`.

In [32]:
optimizer = "adamw"  # or "sgd"
lr_scheduler = "multi_step_lr" 

### GPU Acceleration Options
You do not need to edit this section unless you are an advanced user.

In [33]:
nnodes = 1
nproc_per_node = 1
script_path = "pangaea/run.py"

## Build and run command
Do NOT edit anything in this section, or the training command won't execute properly. As of right now, the logger doesn't output properly to the jupyter notebook. 

**If the command is taking a long time (sometimes hours!), the model is running the training loop.**

In [34]:
# Build the command
command = [
    "torchrun",
    f"--nnodes={nnodes}",
    f"--nproc_per_node={nproc_per_node}",
    "pangaea/run.py",
    "--config-name=train",
    f"dataset={ds_cfg_filename}",
    f"encoder={enc_cfg_filename}",
    f"decoder={dec_cfg_filename}",
    f"preprocessing={pre_cfg_filename}",
    f"criterion={loss_cfg_filename}",
    f"task={task}"
]

if optimizer:
    command.append(f"optimizer={optimizer}")
if lr_scheduler:
    command.append(f"lr_scheduler={lr_scheduler}")

In [35]:
try:
    os.chdir("./ilab-pangaea-bench")
    result = subprocess.run(command, 
                          capture_output=True, 
                          text=True, 
                          check=True)
    print("Training complete!")
except subprocess.CalledProcessError as e:
    print(f"Command failed with return code {e.returncode}")
    print("Output from stderr:", e.stderr)

KeyboardInterrupt: 