Improve checkpointing for Zero stage 1 by ashbhandare · Pull Request #5478 · microsoft/onnxruntime

ashbhandare · 2020-10-13T22:25:33Z

This PR does the folllowing changes:

Completely shard the FP32 weight in case of an fp16 run
Simplify the aggregation logic for Zero checkpoints

The correctness has been verified with the test added through #5476:
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4 --tag-output python orttraining_run_bert_pretrain.py ORTBertPretrainTest.test_pretrain_zero

thiagocrepaldi · 2020-10-13T22:37:50Z

This PR does the folllowing changes:
1. Completely shard the FP32 weight in case of an fp16 run

2. Simplify the aggregation logic for Zero checkpoints
The correctness has been verified with the test added through #5476:
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4 --tag-output python orttraining_run_bert_pretrain.py ORTBertPretrainTest.test_pretrain_zero

Is this PR enabling any new scenario for checkpointing, such as fp32 -> fp16 and vice-versa? Or this is just a performance improvement for the existing scenarios

thiagocrepaldi · 2020-10-13T22:38:32Z

@baijumeswani FYI

ashbhandare · 2020-10-15T23:18:55Z

This PR does the folllowing changes:
1. Completely shard the FP32 weight in case of an fp16 run

2. Simplify the aggregation logic for Zero checkpoints
The correctness has been verified with the test added through #5476:
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4 --tag-output python orttraining_run_bert_pretrain.py ORTBertPretrainTest.test_pretrain_zero
Is this PR enabling any new scenario for checkpointing, such as fp32 -> fp16 and vice-versa? Or this is just a performance improvement for the existing scenarios

The fp16-> fp32 support already exists. This PR removes dependency of zero aggregation on optimizer state being present in the state_dict. This allows only saving the model weights for zero_1.

ashbhandare · 2020-10-29T16:31:39Z

The test graph bert_toy_postprocessed.onnx had to be run through onnxruntime/tools/python/remove_initializer_from_input.py to move the initializers bert.embeddings.position_embeddings.weight and bert.embeddings.word_embeddings.weight from inputs as trainable weights should not be expected to be overridden, and zero partitioning is conditioned upon the initializers not being in graph inputs.

Additionally, the script orttraining/orttraining/test/python/orttrainer_bert_toy_onnx_ckpt_gen.py has been added to generate the zero checkpoints required for test testToyBertCheckpointLoadZero()

baijumeswani

Looks good.

change has been addressed, dismissing to unblock

thiagocrepaldi · 2020-11-02T16:49:07Z

d zero partitioning is conditioned upo

@ashbhandare Are you familiar with the PyTorch flexible API specs? It is a new frontend for ORT which requires all graph initializers to be passed as graph inputs. The initializers come from the original pytorch model and passed into ORT, so the ORT backend is actually stateless in this sense, as it will only compute stuff on top of inputs.

@mrry Do you think this behavior is compatible with ORTModule design? Do we intend to support ZeRO on the flexible API?

ashbhandare · 2020-11-02T23:30:45Z

d zero partitioning is conditioned upo

@ashbhandare Are you familiar with the PyTorch flexible API specs? It is a new frontend for ORT which requires all graph initializers to be passed as graph inputs. The initializers come from the original pytorch model and passed into ORT, so the ORT backend is actually stateless in this sense, as it will only compute stuff on top of inputs.

@mrry Do you think this behavior is compatible with ORTModule design? Do we intend to support ZeRO on the flexible API?

If the initializers are inputs in the flexible API, will the optimizer be handled by ORT? If not, zero partitioning for stage 1 should not happen within ORT and this change will not be touched. If yes, we could enable the older way of adding a 'View' for the flexible API alone.

thiagocrepaldi · 2020-12-03T17:30:38Z

/azp run orttraining-linux-gpu-ci-pipeline

ashbhandare requested review from a team, BowenBao, liqunfu, spandantiwari and thiagocrepaldi as code owners October 13, 2020 22:25

ashbhandare requested a review from jessebenson October 13, 2020 22:27

thiagocrepaldi previously requested changes Oct 13, 2020

View reviewed changes

Comment thread orttraining/orttraining/python/training/checkpoint.py Outdated

thiagocrepaldi reviewed Oct 13, 2020

View reviewed changes

Comment thread orttraining/orttraining/test/python/orttraining_run_bert_pretrain.py

thiagocrepaldi reviewed Oct 13, 2020

View reviewed changes

Comment thread orttraining/orttraining/test/python/orttraining_run_bert_pretrain.py

thiagocrepaldi requested a review from baijumeswani October 13, 2020 22:38

jessebenson reviewed Oct 15, 2020

View reviewed changes

Comment thread orttraining/orttraining/core/session/training_session.cc Outdated

jessebenson reviewed Oct 15, 2020

View reviewed changes

Comment thread orttraining/orttraining/core/graph/zero_optimizer_graph_builder.cc Outdated

jessebenson reviewed Oct 15, 2020

View reviewed changes

Comment thread orttraining/orttraining/core/graph/zero_optimizer_graph_builder.cc

jessebenson previously approved these changes Oct 15, 2020

View reviewed changes

ashbhandare dismissed jessebenson’s stale review via df4070c October 29, 2020 01:16

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch 2 times, most recently from df4070c to 06ac0bb Compare October 29, 2020 16:07

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch from 06ac0bb to a1ca6e9 Compare October 29, 2020 18:08

baijumeswani reviewed Oct 29, 2020

View reviewed changes

baijumeswani previously approved these changes Oct 30, 2020

View reviewed changes

ashbhandare dismissed baijumeswani’s stale review via 5a453d0 October 30, 2020 21:13

thiagocrepaldi suggested changes Nov 2, 2020

View reviewed changes

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch from 7506db9 to 2340098 Compare November 2, 2020 23:25

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch 3 times, most recently from ae1fb60 to 59a2698 Compare December 2, 2020 17:53

ashbhandare added 17 commits December 3, 2020 17:53

Initial running changes

d71857d

Checkpointing aggregation changes

2cbe436

compare with older version

903a44a

initial cleanup

b31e833

Add zero test, minor fix

340ef43

Fix zero test, transform, formatting

3d77678

Review comments

1757e5c

add more unit tests

bd40234

review comments

0146849

Try fix CI

29510b7

Add additional check on just aggregation code

9055d29

Try fix ckpt gen

82fef12

Add pregenerated ckpt for CI, enable zero test in e2e

30d2ce7

Moving test to nightly, removing ckpt files

a928d87

Add tests to dist GPU CI

6adec1b

Fix dist test

99ca90d

Review comments

5bc5911

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch from 59a2698 to 5bc5911 Compare December 3, 2020 17:59

thiagocrepaldi previously approved these changes Dec 3, 2020

View reviewed changes

ashbhandare dismissed thiagocrepaldi’s stale review via fa20163 December 4, 2020 18:52

thiagocrepaldi reviewed Dec 4, 2020

View reviewed changes

Comment thread onnxruntime/test/python/onnxruntime_test_ort_trainer.py Outdated

Fix test

eac63e6

ashbhandare force-pushed the aibhanda/zero_1_ckpt branch from fa20163 to eac63e6 Compare December 4, 2020 21:05

thiagocrepaldi approved these changes Dec 4, 2020

View reviewed changes

ashbhandare merged commit 7cebf76 into master Dec 7, 2020

ashbhandare deleted the aibhanda/zero_1_ckpt branch December 7, 2020 17:16

Conversation

ashbhandare commented Oct 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thiagocrepaldi commented Oct 13, 2020

Uh oh!

thiagocrepaldi commented Oct 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashbhandare commented Oct 15, 2020

Uh oh!

ashbhandare commented Oct 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baijumeswani left a comment

Choose a reason for hiding this comment

Uh oh!

thiagocrepaldi commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashbhandare commented Nov 2, 2020

Uh oh!

thiagocrepaldi commented Dec 3, 2020

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thiagocrepaldi commented Nov 2, 2020 •

edited

Loading