Skip to content

ZeRO2-Offload: Load balance gradient copying to CPU#1067

Merged
tjruwase merged 11 commits intomasterfrom
olruwase/zero2_offload_balance_backward
May 15, 2021
Merged

ZeRO2-Offload: Load balance gradient copying to CPU#1067
tjruwase merged 11 commits intomasterfrom
olruwase/zero2_offload_balance_backward

Conversation

@tjruwase
Copy link
Contributor

During backward in ZeRO-2 Offload, reduced gradients are accumulated into CPU memory for later optimizer step. Due to the previous model partitioning scheme, gradient copying to CPU occurs one rank at a time, which slows down backward and under-utilizes PCIe. This PR introduces a new model partitioning scheme that spreads gradient copying evenly among all (most) ranks at any point in time. This improves backward time and PCIe utilization.

Copy link
Contributor

@samyam samyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite clever. :)

Just documenting what we discussed on the potential simplification by just re-ordering the parameters in a round robin fashion directly in the self.fp16_groups to avoid all the book-keeping.

@tjruwase
Copy link
Contributor Author

This is quite clever. :)

Just documenting what we discussed on the potential simplification by just re-ordering the parameters in a round robin fashion directly in the self.fp16_groups to avoid all the book-keeping.

My initial attempt at doing this hurt performance. So I am making it a TODO for when I have more time to investigate.

@tjruwase tjruwase merged commit ee4deab into master May 15, 2021
@mrwyattii mrwyattii deleted the olruwase/zero2_offload_balance_backward branch July 7, 2023 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants