flake8: noqa: E501

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.

# Custom Trainers

_Written by: Caleb Robinson_

In this tutorial, we demonstrate how to extend a TorchGeo ["trainer class"](https://torchgeo.readthedocs.io/en/latest/api/trainers.html). In TorchGeo there exist several trainer classes that are pre-made PyTorch Lightning Modules designed to allow for the easy training of models on semantic segmentation, classification, change detection, etc. tasks using TorchGeo's [prebuilt DataModules](https://torchgeo.readthedocs.io/en/latest/api/datamodules.html). While the trainers aim to provide sensible defaults and customization options for common tasks, they will not be able to cover all situations (e.g. researchers will likely want to implement and use their own architectures, loss functions, optimizers, etc. in the training routine). If you run into such a situation, then you can simply extend the trainer class you are interested in, and write custom logic to override the default functionality.

This tutorial shows how to do exactly this to customize a learning rate schedule, logging, and model checkpointing for a semantic segmentation task using the [LandCover.ai](https://landcover.ai.linuxpolska.com/) dataset.

It's recommended to run this notebook on Google Colab if you don't have your own GPU. Click the "Open in Colab" button above to get started.

## Setup

As always, we install TorchGeo.

In [None]:
%pip install torchgeo

## Imports

Next, we import TorchGeo and any other libraries we need.

In [1]:
# Get rid of the pesky warnings raised by kornia
# UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
import warnings
from collections.abc import Sequence
from typing import Any

warnings.filterwarnings('ignore', category=UserWarning, module='torch.nn.functional')
warnings.filterwarnings('ignore', category=FutureWarning)

In [2]:
import os

import lightning
import lightning.pytorch as pl
import torch
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.callbacks.callback import Callback
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchmetrics import MetricCollection
from torchmetrics.classification import (
    Accuracy,
    FBetaScore,
    JaccardIndex,
    Precision,
    Recall,
)

from torchgeo.datamodules import LandCoverAI100DataModule
from torchgeo.trainers import SemanticSegmentationTask

## Custom SemanticSegmentationTask

Now, we create a `CustomSemanticSegmentationTask` class that inhierits from `SemanticSegmentationTask` and that overrides a few methods:
- `__init__`: We add two new parameters `tmax` and `eta_min` to control the learning rate scheduler
- `configure_optimizers`: We use the `CosineAnnealingLR` learning rate scheduler instead of the default `ReduceLROnPlateau`
- `configure_metrics`: We add a "MeanIoU" metric (what we will use to evaluate the model's performance) and a variety of other classification metrics
- `configure_callbacks`: We demonstrate how to stack `ModelCheckpoint` callbacks to save the best checkpoint as well as periodic checkpoints
- `on_train_epoch_start`: We log the learning rate at the start of each epoch so we can easily see how it decays over a training run

Overall these demonstrate how to customize the training routine to investigate specific research questions (e.g. of the scheduler on test performance).

In [3]:
class CustomSemanticSegmentationTask(SemanticSegmentationTask):
    # any keywords we add here between *args and **kwargs will be found in self.hparams
    def __init__(
        self, *args: Any, tmax: int = 50, eta_min: float = 1e-6, **kwargs: Any
    ) -> None:
        super().__init__(*args, **kwargs)  # pass args and kwargs to the parent class

    def configure_optimizers(
        self,
    ) -> 'lightning.pytorch.utilities.types.OptimizerLRSchedulerConfig':
        """Initialize the optimizer and learning rate scheduler.

        Returns:
            Optimizer and learning rate scheduler.
        """
        tmax: int = self.hparams['tmax']
        eta_min: float = self.hparams['eta_min']

        optimizer = AdamW(self.parameters(), lr=self.hparams['lr'])
        scheduler = CosineAnnealingLR(optimizer, T_max=tmax, eta_min=eta_min)
        return {
            'optimizer': optimizer,
            'lr_scheduler': {'scheduler': scheduler, 'monitor': self.monitor},
        }

    def configure_metrics(self) -> None:
        """Initialize the performance metrics."""
        num_classes: int = self.hparams['num_classes']

        self.train_metrics = MetricCollection(
            {
                'OverallAccuracy': Accuracy(
                    task='multiclass', num_classes=num_classes, average='micro'
                ),
                'OverallPrecision': Precision(
                    task='multiclass', num_classes=num_classes, average='micro'
                ),
                'OverallRecall': Recall(
                    task='multiclass', num_classes=num_classes, average='micro'
                ),
                'OverallF1Score': FBetaScore(
                    task='multiclass',
                    num_classes=num_classes,
                    beta=1.0,
                    average='micro',
                ),
                'MeanIoU': JaccardIndex(
                    num_classes=num_classes, task='multiclass', average='macro'
                ),
            },
            prefix='train_',
        )
        self.val_metrics = self.train_metrics.clone(prefix='val_')
        self.test_metrics = self.train_metrics.clone(prefix='test_')

    def configure_callbacks(self) -> Sequence[Callback] | Callback:
        """Initialize callbacks for saving the best and latest models.

        Returns:
            List of callbacks to apply.
        """
        return [
            ModelCheckpoint(every_n_epochs=50, save_top_k=-1, save_last=True),
            ModelCheckpoint(monitor=self.monitor, mode=self.mode, save_top_k=5),
        ]

    def on_train_epoch_start(self) -> None:
        """Log the learning rate at the start of each training epoch."""
        optimizers = self.optimizers()
        if isinstance(optimizers, list):
            lr = optimizers[0].param_groups[0]['lr']
        else:
            lr = optimizers.param_groups[0]['lr']
        self.logger.experiment.add_scalar('lr', lr, self.current_epoch)  # type: ignore

## Train model

The remainder of the turial is straightforward and follows the typical [PyTorch Lightning](https://lightning.ai/) training routine. We instantiate a `DataModule` for the LandCover.AI 100 dataset (a small version of the LandCover.AI dataset for notebook testing), instantiate a `CustomSemanticSegmentationTask` with a U-Net and ResNet-50 backbone, then train the model using a Lightning trainer.

In [4]:
dm = LandCoverAI100DataModule(root='data', batch_size=10, num_workers=2, download=True)

# You can use the following for actual training runs
# from torchgeo.datamodules import LandCoverAIDataModule
# dm = LandCoverAIDataModule(root='data', batch_size=64, num_workers=8, download=True)

In [5]:
task = CustomSemanticSegmentationTask(
    model='unet',
    backbone='resnet50',
    weights=True,
    in_channels=3,
    num_classes=6,
    loss='ce',
    lr=1e-3,
    tmax=50,
)

Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/davrob/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 85.1MB/s]


In [6]:
# validate that the task's hyperparameters are as expected
task.hparams

"backbone":        resnet50
"class_weights":   None
"eta_min":         1e-06
"freeze_backbone": False
"freeze_decoder":  False
"ignore_index":    None
"in_channels":     3
"loss":            ce
"lr":              0.001
"model":           unet
"num_classes":     6
"num_filters":     3
"patience":        10
"tmax":            50

In [7]:
# The following Trainer config is useful just for testing the code in this notebook.
trainer = pl.Trainer(
    limit_train_batches=1,
    limit_val_batches=1,
    num_sanity_val_steps=0,
    max_epochs=1,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
)
# You can use the following for actual training runs.
# trainer = pl.Trainer(min_epochs=150, max_epochs=250, log_every_n_steps=50)

Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.


In [8]:
trainer.fit(task, dm)

The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: ModelCheckpoint


Downloading https://cdn-lfs-us-1.hf.co/repos/76/99/7699a6c85994316c8a0bbf95d41627e5f1b3ea8501f66f73c0e2f53eb0afec45/bfe5bcf501a54cfd8ebf985346da50be5e8b751d3491812cd0c226b5a3abff41?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27landcoverai100.zip%3B+filename%3D%22landcoverai100.zip%22%3B&response-content-type=application%2Fzip&Expires=1739923030&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTkyMzAzMH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzc2Lzk5Lzc2OTlhNmM4NTk5NDMxNmM4YTBiYmY5NWQ0MTYyN2U1ZjFiM2VhODUwMWY2NmY3M2MwZTJmNTNlYjBhZmVjNDUvYmZlNWJjZjUwMWE1NGNmZDhlYmY5ODUzNDZkYTUwYmU1ZThiNzUxZDM0OTE4MTJjZDBjMjI2YjVhM2FiZmY0MT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=RGUUjudREyJZqzMW4JtvPeXoaY9Gv0-UKN52uotxrgpzKTXxLIlwU6fM6GEqfxmSWDq5D%7EoAqrv0vrH68SLQZl17Uud3TYFj6-rmt0D7I%7EpGzc7t2ZVKtSrGGVPGNzkoqILO642PDymktCuw7-rOWg7R0rxdLXcHTwvkeXwIKdz7GmTXX5etNMN9x%7EF0sgScLSqkx

100%|██████████| 9.28M/9.28M [00:00<00:00, 40.1MB/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name          | Type             | Params | Mode 
-----------------------------------------------------------
0 | model         | Unet             | 32.5 M | train
1 | criterion     | CrossEntropyLoss | 0      | train
2 | train_metrics | MetricCollection | 0      | train
3 | val_metrics   | MetricCollection | 0      | train
4 | test_metrics  | MetricCollection | 0      | train
-----------------------------------------------------------
32.5 M    Trainable params
0         Non-trainable params
32.5 M    Total params
130.087   Total estimated model params size (MB)
242       Modules in train mode
0         Modules in eval mode
/opt/conda/envs/geo/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see l

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.


## Test model

Finally, we test the model (optionally loading from a previously saved checkpoint).

In [9]:
# You can load directly from a saved checkpoint with `.load_from_checkpoint(...)`
# Note that you can also just call `trainer.test(task, dm)` if you've already trained
# the model in the current notebook session.

task = CustomSemanticSegmentationTask.load_from_checkpoint(
    os.path.join('lightning_logs', 'version_0', 'checkpoints', 'epoch=0-step=1.ckpt')
)

In [10]:
trainer.test(task, dm)

The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: ModelCheckpoint
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test_loss': 3.8608505725860596,
  'test_MeanIoU': 0.002409903099760413,
  'test_OverallAccuracy': 0.008463033474981785,
  'test_OverallF1Score': 0.008463033474981785,
  'test_OverallPrecision': 0.008463033474981785,
  'test_OverallRecall': 0.008463033474981785}]