Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about worker thread in GPipe #2

Closed
842974287 opened this issue Aug 2, 2019 · 4 comments
Assignees
Labels

Comments

@842974287
Copy link

@842974287 842974287 commented Aug 2, 2019

Hi, thanks for the fantastic work!

I have a question about the micro-batch lockstep.

out_queue.join()

In the comment it says "During this partition is executing a micro-batch, to copy a micro-batch by the next partition would be blocked". Why would dumping the message into the queue be blocked by the next partition thread?

@mrJeong mrJeong added the question label Aug 2, 2019
@sublee sublee self-assigned this Aug 2, 2019
@sublee

This comment has been minimized.

Copy link
Member

@sublee sublee commented Aug 2, 2019

Hi, thanks for diving into our code.

When we copy a PyTorch tensor from a GPU to another GPU, the copy might be delayed until both GPUs finish to execute all scheduled CUDA kernels. In other words, a GPU-to-GPU tensor copy requires to synchronize both GPUs.

GPipe enforces users to design well-balanced partitions to achieve optimal performance. So we can assume every partition has similar computational cost. But one partition doesn't produce only one CUDA kernel. Each partition produces several CUDA kernels. Even the total kernel cost per partition is almost identical, the cost of each kernel may be jagged.

The lockstep approach ensures to register copy commands behind the final kernel of each partition, rather than the middle of a partition. This deterministic behavior reduces the frequency of delayed copies. The below diagram compares how the lockstep minimizes the delayed copies:

image

Furthermore, the time to compute the first micro-batch by latter partitions is more important than the time to compute the last micro-batch by prior partitions. The lockstep approach makes the GPUs understand this priority explicitly.

I also attached the timeline comparison:

image

@842974287

This comment has been minimized.

Copy link
Author

@842974287 842974287 commented Aug 5, 2019

Thanks a lot for the detailed explanation!

For the timeline comparison, what are these blue and red represented though?

@842974287 842974287 closed this Aug 5, 2019
@sublee

This comment has been minimized.

Copy link
Member

@sublee sublee commented Aug 6, 2019

Blue and red means computation and copy, respectively. It was captured from NVIDIA Nsight Systems, which is a visual CUDA profiler.

@842974287

This comment has been minimized.

Copy link
Author

@842974287 842974287 commented Aug 7, 2019

I see, thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.