New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] new zero implementation #1623
Conversation
self.fp16_param_to_fp32_param[p] = fp32_p | ||
chunk_16 = self.chunk_manager.get_chunk(p) | ||
chunk_32 = self.chunk_manager.get_chunk(fp32_p) | ||
chunk_32.init_pair(chunk_16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not init pair in ZeroDDP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a problem for a legacy reason. I'll move it to ZeroDDP.
else: | ||
assert self.cuda_shard is not None # only check in CUDA | ||
valid_tensor = self.cuda_shard[:self.valid_end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you copy fp32 param to fp16 param when both of them are on CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See shard_move
function in the Chunk
class.
This reverts commit 5be118f.
A New ZeRO Implementation
Backgrounds
In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.
Implementation
In order to solve this problem, I refactored the class
Chunk
. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.Another Advantage
The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.