# Pangaea Training Notebook

This notebook aims to help users train their desired models on whichever datasets they wish, by building a console command from user preferences. 

## Using this notebook

Each section (e.g. "X Options") refers to a different facet of using Foundation Models. Tasks, Datasets, Encoders, Decoders, and so on, all have their own section here. Each section is organized into subsections for consistency and readability. Below is a brief explanation of how the subsections function and relate to one another. 

**Filenames subsection**: 

This will contain a set of preconfigured code options that the user can choose from. The first cell will run a check in the repository for existing code, and will print out a list of options for the user to pick from. For example, the first section, called Task Options, will print out a list of the various tasks that users can perform with the FMs in this repo. Users will pick from this list to perform the task that they desire. This will hold true for any section in the notebook; if using code from the repo directly, one only needs to change the associated variable (`task` in our Task Options example).  

**Parameters subsection**: 

This section contains contents of .yaml files, which represent the configuration for a certain code behavior. The Dataset Options section, for instance, has a parameters subsection that contains multiple different variables that need to be edited by the user, if they wish to use their own dataset. Each variable in this subsection represents a configurable value -- <span style="color: red">each variable has its own format and other considerations, so be sure to read comments</span> (`# this is a comment`). *If users prefer, they can read the documentation (linked below) and edit .yaml files directly. The parameters subsection is designed to be easy to use, but some users may find the direct configuration of the yaml file more desirable.* 

**Build config**: 

At the end of sections that allow for more customization, some code will run to incorporate user preferences with stock preferences. If no stock preferences were chosen (by leaving the associated variable as the empty string, `""`, in the filename section), then the configuration will be built entirely from user preferences. 

**Final notebook section**: 

The very bottom of the notebook will run a command-line interface to begin training with the user configuration that was built in this notebook. This will include logging to display progress and any errors that may arise. 

## Questions?

If you have any questions, please read the [documentation](https://nasa-nccs-hpda.github.io/ilab-pangaea-bench/), which includes sections on troubleshooting and individual config file needs. 

## Setup

Imports and helper function definitions

### Imports

In [1]:
import os
import torch
import subprocess
from hydra import initialize, compose
from omegaconf import OmegaConf

In [2]:
from train_script import build_config

### get_folder_options
This function lists all possible options for the given parameter. 

In [3]:
def get_folder_options(base_path):
    path = os.path.join(("../configs"), base_path)
    return [
        os.path.splitext(fn)[0]
        for fn in os.listdir(path)
        if ".yaml" in fn
    ]

### collect_config_variables
Takes the global variables from this notebook and creates a parameters dictionary that will be used to create a config file. 

In [4]:
def collect_config_variables(config_var_names=[]):
    """
    Collect all user-defined configuration variables from the notebook globals.
    """
    # Get all variables from the global namespace
    all_vars = globals()
    
    # Extract only the config variables
    config_dict = {}
    for var_name in config_var_names:
        if var_name in all_vars:  
            # Updated our config_dict only if our value is non-negative and truthy ("" ignored)
            var = all_vars[var_name]
            if var and not (isinstance(var, int) and var < 0):
                config_dict[var_name] = all_vars[var_name]
            del globals()[var_name]
        else:
            print(f"Warning: Variable '{var_name}' not found in globals")
    
    return config_dict

## Defining command options
Here is where users will pick from a list of options for training. The first cell in each subsection will generate a list of options, and the second cell will be where users will enter their choice from the list provided. 

**This is where you need to edit this notebook.**

### Task Options

#### Filename

In [5]:
task_options = get_folder_options("task")
print(task_options)

['change_detection', 'knn_probe', 'knn_probe_multi_label', 'linear_classification', 'linear_classification_multi_label', 'regression', 'segmentation']


In [6]:
task = "regression"

### Dataset Options

#### Filename

In [7]:
# PRINT OPTIONS
dataset_options = get_folder_options("dataset")
print(dataset_options)

['spacenet7', 'spacenet7cd', 'xview2', 'aboveshrubschm', 'ai4smallfarms', 'biomassters', 'croptypemapping', 'dynamicen', 'fivebillionpixels', 'fivebillionpixels_cross_sensors', 'hlsburnscars', 'landsatnlcd-Copy1', 'landsatnlcd', 'mados', 'mbigearthnet', 'mbrickkiln', 'mcashew-plantation', 'mchesapeake-landcover', 'meurosat', 'mforestnet', 'mneontree', 'mnz-cattle', 'mpv4ger-seg', 'mpv4ger', 'msa-crop-type', 'mso2sat', 'oceancolor', 'oceancolor_tm', 'oceancolorval', 'opencanopy', 'pastis', 'potsdam', 'sen1floods11']


In [8]:
dataset = ""  # Choose from the options listed above, or leave as "" to make your own dataset config

#### Parameters

In [10]:
# If making your own dataset
target_class = "CustomPangaeaDataset"
_target_ = f"pangaea.datasets.{target_class}"

In [11]:
# ALWAYS EDIT
root_path = "."  # Where to gather dataset files from on ADAPT
download_url = "https://fakewebsite.com"  # Where to download files from (will be downloaded to
img_size = 224  # Height and Width of individual dataset images (must be square)

# Bands are separated by modality; each modality is a key: value pair in the dictionary
bands = {
    "optical": [f"B{i}" for i in range(1, 12)]  # B1 thru B12
}

In [12]:
# TASK-DEPENDANT OPTIONS
multi_temporal = False  # Whether input data is multi-temporal
multi_modal = False  # Whether input data is multimodal
num_classes = 1  # For classification/segmentation, change this value to your number of classes
classes = ["regression"]  # List of class names; only needs changing for classification or segmentation
distribution = [0]  # List of probability of encountering certain target values; only used for classification or segmentation

In [13]:
# REQUIRED FOR STOCK MIN-MAX SCALING PREPROCESSING
# As with bands above, these are separated by modality, add more k: v pairs if needed
data_mean = {
    "optical": [0.5] * 13
}
data_std = {
    "optical": [0] * 13
}
data_min = {
    "optical": [0.0] * 13
}
data_max = {
    "optical": [1.0] * 13
}

#### Build config 

In [None]:
# LOAD STOCK DATASET CONFIG IF DESIRED
if (dataset):
    with initialize(config_path="../configs/dataset", version_base=None):
        ds_cfg = compose(config_name=dataset)
else:
    ds_cfg = {}

In [14]:
# COLLECT USER-DEFINED VARIABLES
ds_var_names = [
    '_target_', 'root_path', 'download_url', 
    'img_size', 'distribution', 'bands', 'multi_temporal', 
    'multi_modal', 'num_classes', 'classes', 'data_mean', 
    'data_std', 'data_min', 'data_max'
]
ds_params = collect_config_variables(ds_var_names)

In [15]:
# COMBINE STOCK AND USER-DEFINED INTO ONE CONFIG DICT
ds_cfg, ds_cfg_filename = build_config("dataset", ds_params, ds_cfg)

params: {'_target_': 'pangaea.datasets.OceanColorDataset', 'root_path': '.', 'download_url': 'https://fakewebsite.com', 'img_size': 224, 'distribution': [0], 'bands': {'optical': ['B1', 'B2', 'B3']}, 'multi_modal': True, 'num_classes': 1, 'classes': ['regression'], 'data_mean': {'optical': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}, 'data_std': {'optical': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}, 'data_min': {'optical': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, 'data_max': {'optical': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}
cfg: {}
params: {'_target_': 'pangaea.datasets.OceanColorDataset', 'root_path': '.', 'download_url': 'https://fakewebsite.com', 'img_size': 224, 'distribution': [0], 'bands': {'optical': ['B1', 'B2', 'B3']}, 'multi_modal': True, 'num_classes': 1, 'classes': ['regression'], 'data_mean': {'optical': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}, 'data_std': {'optical': 

### Encoder options

**Note**: encoders have very different architectures from one another. This means that the config files may look quite different from one to the other, so we have included the absolute bare minimum to editing the encoder file. Many other values are left as default values for existing encoders, and if making your own it's assumed you will know how to edit those parameters on your own (outside of this notebook). 

#### Filename

In [16]:
encoder_options = get_folder_options("encoder")
print(encoder_options)

['prithvi', 'terramind_large', 'unet_encoder', 'unet_encoder_mi', 'vit', 'vit_mi', 'vit_scratch', 'croma_joint', 'croma_optical', 'croma_sar', 'dofa', 'gfmswin', 'remoteclip', 'resnet50_pretrained', 'resnet50_scratch', 'satlasnet_mi', 'satlasnet_si', 'scalemae', 'spectralgpt', 'ssl4eo_data2vec', 'ssl4eo_dino', 'ssl4eo_mae_optical', 'ssl4eo_mae_sar', 'ssl4eo_moco']


In [17]:
# DEFINE YOUR ENCODER HERE FROM THE LIST ABOVE
encoder = "prithvi"

In [18]:
# LOAD STOCK ENCODER CONFIG
if (encoder):
    with initialize(config_path="../configs/encoder", version_base=None):
        enc_cfg = compose(config_name=encoder)
else:
    enc_cfg = {}

#### Parameters

<span style="color: red">Note: these are not exahaustive parameters needed to create a custom encoder. A python file must be created in the `pangaea/encoders` directory, as well as some more specific .yaml parameters. Check the [documentation](https://nasa-nccs-hpda.github.io/ilab-pangaea-bench/) and see some other encoder configuration files for examples.</span>

In [19]:
# IF MAKING A CUSTOM ENCODER
target_py_filename = ""  # .py filename of encoder you've written
target_class = ""  # Name of python class in the .py file above
weights_filename = ""  # Filename of weights to load
download_url = "https://fakewebsite.com"  # URL to fetch weights from (e.g. huggingface)
output_layers = [3, 5, 7, 11]
output_dim = 1024

In [20]:
# Copied from dataset, no need to edit
input_size = ds_cfg["img_size"] 
input_bands = ds_cfg["bands"]

# NO NEED TO EDIT
# Reformats user input to match config formatting, if non-empty string provided
if target_py_filename and target_class:
    _target_ = f"pangaea.encoders.{target_py_filename}.{target_class}" 
else:
    _target = ""
if weights_filename:
    encoder_weights = f"./pretrained_models/{weights_filename}" 
else:
    encoder_weights = ""

#### Build config

In [21]:
enc_var_names = [
    '_target_', 'encoder_weights', 'download_url',
    'output_layers', 'output_dim'
]
enc_params = collect_config_variables(enc_var_names)



In [24]:
enc_cfg, enc_cfg_filename = build_config("encoder", enc_params, enc_cfg)

params: {'download_url': 'https://fakewebsite.com', 'embed_dim': 128, 'patch_size': 16, 'mlp_ratio': 4, 'output_layers': [3, 5, 7, 11]}
cfg: {'_target_': 'pangaea.encoders.prithvi_encoder.Prithvi_Encoder', 'encoder_weights': './pretrained_models/Prithvi_EO_V2_300M.pt', 'download_url': 'https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M/resolve/main/Prithvi_EO_V2_300M.pt', 'embed_dim': 768, 'input_size': 224, 'in_chans': 6, 'patch_size': 16, 'num_heads': 12, 'depth': 12, 'mlp_ratio': 4, 'tubelet_size': 1, 'num_frames': '${dataset.multi_temporal}', 'input_bands': {'optical': ['B2', 'B3', 'B4', 'B8A', 'B11', 'B12']}, 'output_layers': [3, 5, 7, 11], 'output_dim': 768}
params: {'download_url': 'https://fakewebsite.com', 'embed_dim': 128, 'patch_size': 16, 'mlp_ratio': 4, 'output_layers': [3, 5, 7, 11]}
cfg: {'_target_': 'pangaea.encoders.prithvi_encoder.Prithvi_Encoder', 'encoder_weights': './pretrained_models/Prithvi_EO_V2_300M.pt', 'download_url': 'https://huggingface.co/ibm-n

### Decoder options

In [25]:
a

NameError: name 'a' is not defined

In [None]:
decoder_options = get_folder_options("decoders")
print(decoder_options)

In [None]:
# DEFINE YOUR DECODER HERE FROM THE LIST ABOVE
decoder = "Reg Upernet"

### Training and task options

#### Preprocessing

In [None]:
preprocessing_options = get_folder_options("preprocessing")
print(preprocessing_options)

In [None]:
# DEFINE YOUR PREPROCESSING HERE FROM THE LIST ABOVE
preprocessing = "reg_default"

#### Loss function
Also known as a "criterion".

In [None]:
loss_fn_options = get_folder_options("criterion")
print(loss_fn_options)

In [None]:
# DEFINE YOUR LOSS FUNCTION HERE FROM THE LIST ABOVE
loss_fn = "reg_default"

### GPU Acceleration Options
You do not need to edit this section unless you are an advanced user.

In [None]:
nnodes = 1
nproc_per_node = 1
script_path = "../pangaea/run.py"
config_name = "train"

## Building the command
Do NOT edit anything in this section, or the training command won't execute properly. 

In [None]:
# Build the command
command = [
    "torchrun",
    f"--nnodes={nnodes}",
    f"--nproc_per_node={nproc_per_node}",
    script_path,
    f"--config-name={config_name}",
    f"dataset={dataset}",
    f"encoder={encoder}",
    f"decoder={decoder}",
    f"preprocessing={preprocessing}",
    f"criterion={loss_fn}",
    f"task={task}"
]

## Running the command

In [None]:
try:
    result = subprocess.run(command, 
                          capture_output=True, 
                          text=True, 
                          check=True)
    print("Command executed successfully!")
    print("STDOUT:", result.stdout)
except subprocess.CalledProcessError as e:
    print(f"Command failed with return code {e.returncode}")
    print("STDERR:", e.stderr)