Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] new zero implementation #1623

Merged
merged 13 commits into from Sep 24, 2022
Merged

[feature] new zero implementation #1623

merged 13 commits into from Sep 24, 2022

Conversation

1SAA
Copy link
Contributor

@1SAA 1SAA commented Sep 21, 2022

A New ZeRO Implementation

Backgrounds

In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.

Implementation

In order to solve this problem, I refactored the class Chunk. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.

Another Advantage

The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.

self.fp16_param_to_fp32_param[p] = fp32_p
chunk_16 = self.chunk_manager.get_chunk(p)
chunk_32 = self.chunk_manager.get_chunk(fp32_p)
chunk_32.init_pair(chunk_16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not init pair in ZeroDDP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a problem for a legacy reason. I'll move it to ZeroDDP.

else:
assert self.cuda_shard is not None # only check in CUDA
valid_tensor = self.cuda_shard[:self.valid_end]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you copy fp32 param to fp16 param when both of them are on CPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See shard_move function in the Chunk class.

@feifeibear feifeibear merged commit 5be118f into hpcaitech:main Sep 24, 2022
feifeibear added a commit that referenced this pull request Sep 26, 2022
feifeibear added a commit that referenced this pull request Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants