ZeRO2-Offload: Load balance gradient copying to CPU by tjruwase · Pull Request #1067 · deepspeedai/DeepSpeed

tjruwase · 2021-05-12T20:41:11Z

During backward in ZeRO-2 Offload, reduced gradients are accumulated into CPU memory for later optimizer step. Due to the previous model partitioning scheme, gradient copying to CPU occurs one rank at a time, which slows down backward and under-utilizes PCIe. This PR introduces a new model partitioning scheme that spreads gradient copying evenly among all (most) ranks at any point in time. This improves backward time and PCIe utilization.

…/zero2_offload_balance_backward

samyam

This is quite clever. :)

Just documenting what we discussed on the potential simplification by just re-ordering the parameters in a round robin fashion directly in the self.fp16_groups to avoid all the book-keeping.

…/zero2_offload_balance_backward

tjruwase · 2021-05-15T19:56:10Z

This is quite clever. :)

Just documenting what we discussed on the potential simplification by just re-ordering the parameters in a round robin fashion directly in the self.fp16_groups to avoid all the book-keeping.

My initial attempt at doing this hurt performance. So I am making it a TODO for when I have more time to investigate.

tjruwase added 4 commits May 12, 2021 06:18

Round robin partitioning to improve ZeRO-2 Offload CPU copy

0630d2d

Formatting fixes

edc3d27

Fix index issues in debug dumps

b0d06a0

Remove debug prints

edd796c

tjruwase requested review from eltonzheng, jeffra and samyam May 12, 2021 20:41

tjruwase requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, minjiaz and niumanar as code owners May 12, 2021 20:41

tjruwase added 3 commits May 12, 2021 20:41

Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…

6fa94b5

…/zero2_offload_balance_backward

Code cleanup

18eab26

Remove unintended stage3.py changes

5425bab

eltonzheng approved these changes May 12, 2021

View reviewed changes

tjruwase added 2 commits May 13, 2021 11:05

Merge branch 'master' into olruwase/zero2_offload_balance_backward

b001aa3

Merge branch 'master' into olruwase/zero2_offload_balance_backward

c557fcd

samyam reviewed May 13, 2021

View reviewed changes

tjruwase added 2 commits May 14, 2021 23:47

Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…

b838eee

…/zero2_offload_balance_backward

Add TODO

2cd4cfb

tjruwase merged commit ee4deab into master May 15, 2021

stas00 mentioned this pull request Jul 20, 2021

[zero2] zero_param_shapes: switch to round_robin_fp16_groups #1240

Merged

mrwyattii deleted the olruwase/zero2_offload_balance_backward branch July 7, 2023 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO2-Offload: Load balance gradient copying to CPU#1067

ZeRO2-Offload: Load balance gradient copying to CPU#1067
tjruwase merged 11 commits intomasterfrom
olruwase/zero2_offload_balance_backward

tjruwase commented May 12, 2021

Uh oh!

samyam left a comment

Uh oh!

tjruwase commented May 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tjruwase commented May 12, 2021

Uh oh!

samyam left a comment

Choose a reason for hiding this comment

Uh oh!

tjruwase commented May 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants