Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] A new ZeRO implementation #1644

Merged
merged 17 commits into from Oct 9, 2022
Merged

Conversation

1SAA
Copy link
Contributor

@1SAA 1SAA commented Sep 26, 2022

A New ZeRO Implementation

This PR is about a new kind of implementation of ZeRO.

Backgrounds

In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.

Implementation

In order to solve this problem, I refactored the class Chunk. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.

Another Advantage

The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.

@feifeibear feifeibear merged commit b28991d into hpcaitech:main Oct 9, 2022
@1SAA 1SAA deleted the demo_moe branch October 14, 2022 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants