Elastic training support #602

jeffra · 2020-12-14T21:02:22Z

Supports scaling up/down training to compatible GPU counts. Adds new 'elasticity' key to our config json. Users indicate their max acceptable batch size and acceptable micro batch sizes. DeepSpeed will find a batch size that is usable with the largest list of compatible GPU counts. The intended consumers of this API and JSON addition are both the user training code and also the infrastructure scheduler.

    "elasticity": {
        "enabled": true,
        "max_train_batch_size": 2000,
        "micro_batch_sizes": [2,4,6],
        "min_gpus": 1,
        "max_gpus" : 10000,
        "min_time": 20,
        "version": 0.1
    }

…g as standalone'

g-karthik · 2020-12-24T03:42:49Z

@jeffra I haven't looked at this closely, but am I right to assume this requires the user to also use the training_data argument of deepspeed.initialize()? Also, how does the infrastructure scheduler tie to this config?

samyam and others added 30 commits December 10, 2020 00:00

Starting to add config modifications. Currently in incomplete state

a2689d9

Adding the core elasticity compatible gpu count generation logic

cc54b0f

Reverting some of the unfinished modifications to get the file workin…

3ee0bdd

…g as standalone'

formatting and fix build error

b16518f

add np req and move elasticity

1716858

update github actions to trigger on all branches

6e7896a

fix syntax error

072ace3

exclude docs

64b6ef1

formatting

56ec513

config restructure, versioning, etc

a603970

config updates, sanity checks, etc.

fbbd94d

fix version issue

78fd37a

choose best micro batch size for given world size

ca94dc8

bug fixes

bdf3415

add unit test

5391541

add several unit tests and clean-up code

8691642

fix install issue when installing on non-gpu machines

805a067

Merge branch 'master' into jeffra/elastic

cd44deb

Merge branch 'master' into jeffra/elastic

3275d9c

Merge branch 'master' into jeffra/elastic

d64f631

Merge branch 'master' into jeffra/elastic

cbf9063

add ds_elastic cli

07caa68

clean-up

b4f6d71

formatting

8b784ce

docstring

dd30992

fix mbsize division issue

c925a53

formatting

16f9aa2

checkpoint load latest only if it exists

80a642f

add get_batch_info to engine, assert non-elastic bsz config, fix test

01ee3a4

fix tests

c6a23c1

jeffra added 4 commits December 18, 2020 22:45

validate elastic config wrt scheduler config, add repr

2e6b35f

add unit test and fixes

1af4330

require max-batch and micro-batches for elastic training

d030583

fix test error

6b6235b

jeffra marked this pull request as ready for review December 23, 2020 06:24

jeffra requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners December 23, 2020 06:24

jeffra merged commit 81aeea3 into master Dec 23, 2020

bobisapotato mentioned this pull request Jan 24, 2021

Another thing to merge. (MY EYES HURT) bobisai/DeepSpeed#1

Merged

mrwyattii deleted the jeffra/elastic branch July 7, 2023 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic training support #602

Elastic training support #602

jeffra commented Dec 14, 2020

g-karthik commented Dec 24, 2020

Elastic training support #602

Elastic training support #602

Conversation

jeffra commented Dec 14, 2020

g-karthik commented Dec 24, 2020