Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Hi, thanks for the fantastic work!
I have a question about the micro-batch lockstep.
In the comment it says "During this partition is executing a micro-batch, to copy a micro-batch by the next partition would be blocked". Why would dumping the message into the queue be blocked by the next partition thread?
Hi, thanks for diving into our code.
When we copy a PyTorch tensor from a GPU to another GPU, the copy might be delayed until both GPUs finish to execute all scheduled CUDA kernels. In other words, a GPU-to-GPU tensor copy requires to synchronize both GPUs.
GPipe enforces users to design well-balanced partitions to achieve optimal performance. So we can assume every partition has similar computational cost. But one partition doesn't produce only one CUDA kernel. Each partition produces several CUDA kernels. Even the total kernel cost per partition is almost identical, the cost of each kernel may be jagged.
The lockstep approach ensures to register copy commands behind the final kernel of each partition, rather than the middle of a partition. This deterministic behavior reduces the frequency of delayed copies. The below diagram compares how the lockstep minimizes the delayed copies:
Furthermore, the time to compute the first micro-batch by latter partitions is more important than the time to compute the last micro-batch by prior partitions. The lockstep approach makes the GPUs understand this priority explicitly.
I also attached the timeline comparison: