Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed Integration #109

Merged
merged 55 commits into from
Dec 1, 2021
Merged

Conversation

jbloxham
Copy link
Contributor

@jbloxham jbloxham commented Nov 29, 2021

Support for using DeepSpeed instead of Pytorch DDP in the trainer. Somewhat WIP, but good enough to merge for now. The major TODOs are:

  • Support for regular checkpointing
  • Cleaner support for precision casting
  • Config support for DeepSpeed goodies like ZeRO and activation checkpointing
  • Quite a bit of testing

I'm waiting on a node to run a quick regression test.

Copy link
Contributor

@Averylamp Averylamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe wait on one more lgtm for the training loop changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another look once the TODOs are adressed, but looks awesome!

composer/core/precision.py Show resolved Hide resolved
composer/trainer/deepspeed.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/deepspeed.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
@pytest.mark.gpu
@pytest.mark.parametrize("world_size", [pytest.param(1), pytest.param(2, marks=pytest.mark.world_size(2))])
def test_deepspeed(world_size: int, mosaic_trainer_hparams: TrainerHparams, tmpdir: pathlib.Path) -> None:
"""Pretty much just copied from ./test_ddp"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#110 changes the DDP test fyi

@jbloxham jbloxham merged commit 0a17521 into mosaicml:dev Dec 1, 2021
@jbloxham jbloxham mentioned this pull request Dec 1, 2021
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants