Add command line script #1559

adamjstewart · 2023-09-12T00:23:07Z

This PR replaces our train.py script with a torchgeo script installed by pip. It can be invoked in two ways:

# If torchgeo has been installed
torchgeo
# If torchgeo has been installed, or if it has been cloned to the current directory
python3 -m torchgeo

It is based on LightningCLI, and supports command-line configuration or YAML/JSON config files. Valid options can be found from the help messages:

# See valid stages
torchgeo --help
# See valid trainer options
torchgeo fit --help
# See valid model options
torchgeo fit --model.help ClassificationTask
# See valid data options
torchgeo fit --data.help EuroSAT100DataModule

An example of the script in action:

# Train and validate a model
torchgeo fit --config tests/conf/eurosat100.yaml --trainer.max_epochs=1
# Validate-only
torchgeo validate --config tests/conf/eurosat100.yaml
# Calculate and report test accuracy
torchgeo test --config tests/conf/eurosat100.yaml

It can also be imported and used in a Python script if you need to extend it to add new features:

from torchgeo.main import main

main(["fit", "--config", "tests/conf/eurosat100.yaml"])

Documentation

Setuptools packaging:

LightningCLI:

Closes #228
Closes #1352

calebrob6 · 2023-09-19T16:53:12Z

Why do we want this vs. a train.py script?

adamjstewart · 2023-09-21T09:35:11Z

I think there are two ways to interpret your question:

Why bother with an entry point, what's wrong with train.py?

It's important to distinguish between TorchGeo (the Python library) and TorchGeo (the GitHub repo). For developers like you and me, we primarily install TorchGeo (the GitHub repo) using:

> git clone https://github.com/microsoft/torchgeo.git

This also installs train.py, unit tests, some conf files, and experiment code. However, the majority of users will instead install TorchGeo (the Python library) using:

> pip install torchgeo  # or conda/spack

The primary advantage of using an entry point is that now train.py gets installed for all users, not just developers. Other advantages:

It's easier to unit test and collect coverage reports for
It's an official part of our stable API and can be documented in our API docs
It's easier to use and document in tutorials
It can be extended in custom scripts (from torchgeo.main import main; ...)

What's the advantage of LightningCLI over our current custom omegaconf/hydra code?

There are a ton of advantages of LightningCLI over our current implementation:

Runtime verification: all init_args undergo runtime type verification! Use dict_kwargs for things that can't be verified, like dataset kwargs. See Add config file schema #1352
Self-documenting: use torchgeo fit --help for all valid trainer options, torchgeo fit --model.help ClassificationTask for model options, and torchgeo fit --data.help EuroSAT100DataModule for data module options
Less code: automatically handles merging multiple config files with CLI args
Override optimizers: can override the default optimizer or lr scheduler from config/CLI for experimentation
JSON support: probably most interesting for web app integration

LightningCLI has a ton of exciting features. I would encourage everyone to read through all of https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html before deciding whether or not it is useful.

calebrob6 · 2023-09-22T19:05:32Z

Got it, I don't have a problem with this in principle. This is just a huge PR and I haven't wrapped my head around what you get from LightningCLI.

adamjstewart · 2023-09-22T19:29:05Z

If you ignore the config files and tests it's actually pretty small. But yeah, definitely takes a bit to wrap your head around. But I think this will put us more in line with other Lightning projects. The config file format will be the same for TorchGeo and every other project that uses LightningCLI.

This reverts commit 4f38cdf.

calebrob6

I just played around with this and I really like it, e.g. Lightning parallelized over 4 GPUs by default, and I trained RESISC45 for 10 epochs in like 2 minutes.

Some things that aren't clear to me:

How do I get it to save model checkpoints somewhere?
How can I name the experiment run something that makes sense?
When running torchgeo validate --config conf/something.yaml, how do I specify that it should validate with a model checkpoint? Currently it just uses the default model. It also creates a new tensorboard log entry.

Some things that I noticed:

Train classification accuracy and the other metrics aren't being logged during training, validation accuracy isn't being computed during validation, etc.

I really want to add a docs page to go along with this to help get people started. Maybe we should implement a new requirement on big PRs that you have to write a little mini tutorial to go along with it?

adamjstewart · 2023-09-25T19:57:19Z

Preliminary answers before I start fixing things:

How do I get it to save model checkpoints somewhere?

You would override --trainer.default_root_dir. It defaults to the current directory.

How can I name the experiment run something that makes sense?

Still need to add the ability to customize that. It defaults to lightning_logs. It's not hard, just want to make sure I'm not missing something builtin.

how do I specify that it should validate with a model checkpoint?

You would use the ckpt_path or model.init_args.weights parameter in config or on the CLI.

Train classification accuracy and the other metrics aren't being logged during training, validation accuracy isn't being computed during validation, etc.

Will fix.

I really want to add a docs page to go along with this to help get people started. Maybe we should implement a new requirement on big PRs that you have to write a little mini tutorial to go along with it?

I've been really lazy with this because I keep telling myself to put it off until we start writing tons of tutorials. This one will be particularly hard because it's all command-line commands. But I think it will still work.

adamjstewart · 2023-09-25T20:28:36Z

Before I accidentally delete this file, I did do a survey of how other common CLI tools organize their code:

Expand

# black

# pyproject.toml
[project.scripts]
black = "black:patched_main"
blackd = "blackd:patched_main [d]"

# src/black/__main__.py
from black import patched_main

patched_main()

# src/black/__init__.py
def patched_main() -> None:
    # PyInstaller patches multiprocessing to need freeze_support() even in non-Windows
    # environments so just assume we always need to call it if frozen.
    if getattr(sys, "frozen", False):
        from multiprocessing import freeze_support

        freeze_support()

    patch_click()
    main()


if __name__ == "__main__":
    patched_main()

# flake8

# setup.cfg
[options.entry_points]
console_scripts =
        flake8 = flake8.main.cli:main

# src/flake8/__main__.py
from flake8.main.cli import main

if __name__ == "__main__":
    raise SystemExit(main())

# src/flake8/main/cli.py
from flake8.main import application


def main(argv: Sequence[str] | None = None) -> int:
    if argv is None:
        argv = sys.argv[1:]

    app = application.Application()
    app.run(argv)
    return app.exit_code()

# isort

# pyproject.toml
[tool.poetry.scripts]
isort = "isort.main:main"
isort-identify-imports = "isort.main:identify_imports_main"

# isort/__main__.py
from isort.main import main

main()

# isort/main.py
def main():
    pass

if __name__ == "__main__":
    main()

# pytest

# setup.cfg
[options.entry_points]
console_scripts =
        pytest=pytest:console_main
        py.test=pytest:console_main

# src/pytest/__main__.py
import pytest

if __name__ == "__main__":
    raise SystemExit(pytest.console_main())

# src/_pytest/config/__init__.py
def main():
    ...

def console_main():
    try:
        main()
    except:
        ...

# pydocstyle

# pyproject.toml
[tool.poetry.scripts]
pydocstyle = "pydocstyle.cli:main"

# src/pydocstyle/__main__.py
def main() -> None:
    from pydocstyle import cli

    cli.main()

if __name__ == '__main__':
    main()

# src/pydocstyle/cli.py
def main():
    """Run pydocstyle as a script."""
    try:
        sys.exit(run_pydocstyle())
    except KeyboardInterrupt:
        pass

# pyupgrade

# setup.cfg
[options.entry_points]
console_scripts =
        pyupgrade = pyupgrade._main:main

# pyupgrade/__main__.py
from pyupgrade._main import main

if __name__ == '__main__':
    raise SystemExit(main())

# pyupgrade/_main.py
def main():
    ...

if __name__ == '__main__':
    raise SystemExit(main())

# mypy

# setup.py
    entry_points={
        "console_scripts": [
            "mypy=mypy.__main__:console_entry",
            "stubgen=mypy.stubgen:main",
            "stubtest=mypy.stubtest:main",
            "dmypy=mypy.dmypy.client:console_entry",
            "mypyc=mypyc.__main__:main",
        ]
    },

# mypy/__main__.py
def console_entry() -> None:
    try:
        main()
        sys.stdout.flush()
        sys.stderr.flush()
    except BrokenPipeError:
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(2)
    except KeyboardInterrupt:
        _, options = process_options(args=sys.argv[1:])
        if options.show_traceback:
            sys.stdout.write(traceback.format_exc())
        formatter = FancyFormatter(sys.stdout, sys.stderr, False)
        msg = "Interrupted\n"
        sys.stdout.write(formatter.style(msg, color="red", bold=True))
        sys.stdout.flush()
        sys.stderr.flush()
        sys.exit(2)


if __name__ == "__main__":
    console_entry()

# pip

# setup.py
    entry_points={
        "console_scripts": [
            "pip=pip._internal.cli.main:main",
            "pip{}=pip._internal.cli.main:main".format(sys.version_info[0]),
            "pip{}.{}=pip._internal.cli.main:main".format(*sys.version_info[:2]),
        ],
    },

# src/pip/__main__.py
if __name__ == "__main__":
    from pip._internal.cli.main import main as _main

    sys.exit(_main())

# src/pip/_internal/cli/main.py
def main():
    ...

# build

# pyproject.toml
[project.scripts]
pyproject-build = "build.__main__:entrypoint"

# src/build/__main__.py
def main(args):
    ...

def entrypoint() -> None:
    main(sys.argv[1:])

if __name__ == '__main__':
    main(sys.argv[1:], 'python -m build')

We don't have to use torchgeo.main.main, that's just a design choice.

adamjstewart · 2023-09-26T15:12:33Z

Train classification accuracy and the other metrics aren't being logged during training, validation accuracy isn't being computed during validation, etc.

@calebrob6 can you see if this is fixed by the last commit? I can also play around with on_step/on_epoch to control frequency and prog_bar to control verbosity.

adamjstewart · 2023-09-26T16:07:46Z

Hoping for a quick response to Lightning-AI/pytorch-lightning#18641, otherwise I'll just ignore the mypy warnings.

calebrob6 · 2023-09-26T16:31:31Z

Yep, metrics are logged now, however they are logged at strange intervals (see x-axis of graphs below) and I get this warning in the multi-gpu case:

PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

calebrob6 · 2023-09-26T16:45:57Z

By default, lightning should log every 50 training steps (https://lightning.ai/docs/pytorch/stable/extensions/logging.html#logging-frequency), which is why the first train point is near 50, but it doesn't explain why the frequency of train logging increases after that.

Update, for whatever reason, when the trainer gets the early stopping signal (I see this in the logs "Trainer was signaled to stop but the required min_epochs=15 or min_steps=None has not been met. Training will continue...") the log frequency gets set to 1.

adamjstewart · 2023-09-26T21:31:39Z

Type errors will be fixed by Lightning-AI/pytorch-lightning#18646

adamjstewart · 2023-09-27T11:51:01Z

@calebrob6 I'm inclined to call this PR "good enough" for a first draft and get it merged so we can focus on other last minute features. We can add docs and add new features or fix bugs in future releases. What do you think?

calebrob6 · 2023-09-27T15:16:34Z

First thought: Are you feeling okay!?
Second thought: We do need to figure out how to save model checkpoints and pick where the outputs go. Currently running torchgeo fit --config ... will create a directory called ./lightning_logs/ that just holds the tensorboard logs. If we release this in 0.5 but then figure out we need to add something to main.py to get the checkpoints to save, then we might confuse a lot of users.

calebrob6 · 2023-09-27T16:38:09Z

(otherwise I don't care if the metric logging is working perfectly)

adamjstewart · 2023-09-27T16:55:25Z

We do need to figure out how to save model checkpoints

They are automatically saved in lightning_logs/version_#/checkpoints

and pick where the outputs go.

The root directory can be controlled using --trainer.default_root_dir my_logs or:

trainer:
  default_root_dir: my_logs

calebrob6 · 2023-09-27T16:59:07Z

Problem solved!

adamjstewart added this to the 0.5.0 milestone Sep 12, 2023

github-actions bot added testing Continuous integration testing dependencies Packaging and dependencies scripts Training and evaluation scripts trainers PyTorch Lightning trainers labels Sep 12, 2023

adamjstewart force-pushed the scripts/lightningcli branch from 0714ab5 to 96b85b0 Compare September 13, 2023 01:20

github-actions bot added datasets Geospatial or benchmark datasets datamodules PyTorch Lightning datamodules labels Sep 14, 2023

adamjstewart added 11 commits September 22, 2023 18:24

Add command line script

635e48a

Label PRs correctly

967411d

Flake8 fix

9e3df6d

Add callbacks

bbad788

Update all conf files

89caf81

Enforce class of model/data

bbc06b2

Update BYOL tests

69cfa24

Update all tests

57a4188

Use default num workers: 0

3188dc5

Variable interpolation only supported by omegaconf backend

ec5ad49

Fix regression tests

f95c7d1

adamjstewart force-pushed the scripts/lightningcli branch from fe9b06c to fdeeae9 Compare September 22, 2023 16:25

github-actions bot removed the datasets Geospatial or benchmark datasets label Sep 22, 2023

adamjstewart force-pushed the scripts/lightningcli branch from 6c80d70 to f95c7d1 Compare September 22, 2023 17:21

github-actions bot removed the datamodules PyTorch Lightning datamodules label Sep 22, 2023

adamjstewart mentioned this pull request Sep 22, 2023

LightningCLI: incorrect default value of kwarg used Lightning-AI/pytorch-lightning#18616

Open

Ensure that jsonargparse is installed during testing

288dc7c

Bump minimimum lightning version

f4d1436

adamjstewart added 2 commits September 23, 2023 19:09

Add test coverage for __main__.py

3532b06

Increase coverage of PixelwiseRegressionTask

2106e25

adamjstewart marked this pull request as ready for review September 23, 2023 19:05

adamjstewart added 2 commits September 25, 2023 11:04

jsonargparse bug fixed in latest release

4f38cdf

Revert "jsonargparse bug fixed in latest release"

585f845

This reverts commit 4f38cdf.

calebrob6 requested changes Sep 25, 2023

View reviewed changes

Merge branch 'main' into scripts/lightningcli

3edd691

adamjstewart added 2 commits September 26, 2023 17:00

Use default logging frequency

04437d3

Ensure that metrics are actually logged

5df9543

Fix ObjectDetectionTask logging

29d9c7a

Ignore type errors

eb6ac14

adamjstewart added 2 commits September 26, 2023 23:41

Not all need to be ignored

a902c49

Merge branch 'main' into scripts/lightningcli

a9f2b1c

adamjstewart closed this Sep 27, 2023

adamjstewart reopened this Sep 27, 2023

calebrob6 approved these changes Sep 27, 2023

View reviewed changes

calebrob6 merged commit 984e222 into microsoft:main Sep 27, 2023
37 checks passed

adamjstewart deleted the scripts/lightningcli branch December 19, 2023 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add command line script #1559

Add command line script #1559

adamjstewart commented Sep 12, 2023 •

edited

Loading

calebrob6 commented Sep 19, 2023

adamjstewart commented Sep 21, 2023

calebrob6 commented Sep 22, 2023 •

edited

Loading

adamjstewart commented Sep 22, 2023

calebrob6 left a comment

adamjstewart commented Sep 25, 2023 •

edited

Loading

adamjstewart commented Sep 25, 2023 •

edited

Loading

adamjstewart commented Sep 26, 2023

adamjstewart commented Sep 26, 2023

calebrob6 commented Sep 26, 2023

calebrob6 commented Sep 26, 2023 •

edited

Loading

adamjstewart commented Sep 26, 2023

adamjstewart commented Sep 27, 2023 •

edited

Loading

calebrob6 commented Sep 27, 2023 •

edited

Loading

calebrob6 commented Sep 27, 2023

adamjstewart commented Sep 27, 2023

calebrob6 commented Sep 27, 2023

Add command line script #1559

Add command line script #1559

Conversation

adamjstewart commented Sep 12, 2023 • edited Loading

Documentation

calebrob6 commented Sep 19, 2023

adamjstewart commented Sep 21, 2023

Why bother with an entry point, what's wrong with train.py?

What's the advantage of LightningCLI over our current custom omegaconf/hydra code?

calebrob6 commented Sep 22, 2023 • edited Loading

adamjstewart commented Sep 22, 2023

calebrob6 left a comment

Choose a reason for hiding this comment

adamjstewart commented Sep 25, 2023 • edited Loading

adamjstewart commented Sep 25, 2023 • edited Loading

adamjstewart commented Sep 26, 2023

adamjstewart commented Sep 26, 2023

calebrob6 commented Sep 26, 2023

calebrob6 commented Sep 26, 2023 • edited Loading

adamjstewart commented Sep 26, 2023

adamjstewart commented Sep 27, 2023 • edited Loading

calebrob6 commented Sep 27, 2023 • edited Loading

calebrob6 commented Sep 27, 2023

adamjstewart commented Sep 27, 2023

calebrob6 commented Sep 27, 2023

adamjstewart commented Sep 12, 2023 •

edited

Loading

calebrob6 commented Sep 22, 2023 •

edited

Loading

adamjstewart commented Sep 25, 2023 •

edited

Loading

adamjstewart commented Sep 25, 2023 •

edited

Loading

calebrob6 commented Sep 26, 2023 •

edited

Loading

adamjstewart commented Sep 27, 2023 •

edited

Loading

calebrob6 commented Sep 27, 2023 •

edited

Loading