Deepspeed Integration #109

jbloxham · 2021-11-29T23:34:59Z

Support for using DeepSpeed instead of Pytorch DDP in the trainer. Somewhat WIP, but good enough to merge for now. The major TODOs are:

Support for regular checkpointing
Cleaner support for precision casting
Config support for DeepSpeed goodies like ZeRO and activation checkpointing
Quite a bit of testing

I'm waiting on a node to run a quick regression test.

Averylamp

LGTM, maybe wait on one more lgtm for the training loop changes

composer/trainer/trainer.py

ravi-mosaicml

I'll take another look once the TODOs are adressed, but looks awesome!

composer/core/precision.py

composer/trainer/deepspeed.py

composer/trainer/trainer.py

composer/trainer/deepspeed.py

composer/trainer/trainer.py

ravi-mosaicml · 2021-12-01T03:10:53Z

tests/trainer/test_deepspeed.py

+@pytest.mark.gpu
+@pytest.mark.parametrize("world_size", [pytest.param(1), pytest.param(2, marks=pytest.mark.world_size(2))])
+def test_deepspeed(world_size: int, mosaic_trainer_hparams: TrainerHparams, tmpdir: pathlib.Path) -> None:
+    """Pretty much just copied from ./test_ddp"""


#110 changes the DDP test fyi

jbloxham force-pushed the deepspeed-integration branch from 35bbee2 to 7556fcd Compare November 30, 2021 18:35

jbloxham marked this pull request as ready for review November 30, 2021 23:35

jbloxham requested review from Averylamp, ravi-mosaicml, abhi-mosaic, ajaysaini725 and siriuslee November 30, 2021 23:36

Averylamp approved these changes Dec 1, 2021

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

jbloxham added 22 commits December 1, 2021 00:56

hacky integration

e986fa7

for convenience also include a yaml file

c1ff8da

more hacks

ab18fdd

woops

c4918b5

lmao

7dd5f7d

woops again

a4fd8a1

who needs to test locally when you have a cluster

b992bee

#python

0e0f992

woops

a38d29a

woops

04539fe

fp32 testing

d1b03de

fix loss scaling

7a926ba

fix loss calculation

b9ac49c

small fix

1246ca0

tiny typo

d250814

re-enable fp16

29d7617

fp32 again

7734912

fp16

2efe689

fix sampler

ba02cae

initialize dataloader after deepspeed

0eb8685

fp32

c43248c

restore distinct loss

045f04c

jbloxham added 16 commits December 1, 2021 00:56

slightly hacky fix for fp16 metrics

4d6aa12

unneeded config

e6cdf80

starting cleanup

93f0ec5

remove merge conflict

a8806f7

more cleanup, and error if fp16 without deepspeed

8123226

mostly fix pyright

fdd54bb

remove run_mosaic_deepspeed_trainer

5aec632

ditch deepspeed-specific yamls

b84f7f3

reintroduce device to deepspeed hparams

082f0d4

supposedly merged trainers

53c4fd8

is this really the only bug???

1011e29

purge

b4d08ef

some cleanup

3f2a117

versioning and licensing

e326115

add a test

d8e1744

add license

5c14354

jbloxham force-pushed the deepspeed-integration branch from ead595e to 5c14354 Compare December 1, 2021 00:57

jbloxham added 2 commits December 1, 2021 01:04

don't update optimizer when deepspeed on

bea1158

try changing the loss scaling

aefd95f

ravi-mosaicml reviewed Dec 1, 2021

View reviewed changes

jbloxham added 7 commits December 1, 2021 18:27

some consistency

9289e98

don't zero optimizers when using deepspeed

e473bc2

force null sync context when using deepspeed

f40a1a2

engine.backward

601ab00

satisfy pyright with some type:ignore

c10d924

address comments

47962d4

missing conditional import

f704bb8

jbloxham merged commit 0a17521 into mosaicml:dev Dec 1, 2021

jbloxham mentioned this pull request Dec 1, 2021

DeepSpeed Integration #60

Closed

coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022

Deepspeed Integration (mosaicml#109)

65da7c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed Integration #109

Deepspeed Integration #109

jbloxham commented Nov 29, 2021 •

edited

Loading

Averylamp left a comment

ravi-mosaicml left a comment

ravi-mosaicml Dec 1, 2021

Deepspeed Integration #109

Deepspeed Integration #109

Conversation

jbloxham commented Nov 29, 2021 • edited Loading

Averylamp left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml Dec 1, 2021

Choose a reason for hiding this comment

jbloxham commented Nov 29, 2021 •

edited

Loading