Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on torchgpipe project and paper #28

Open
xshaun opened this issue Jun 9, 2021 · 1 comment
Open

Issues on torchgpipe project and paper #28

xshaun opened this issue Jun 9, 2021 · 1 comment
Labels
question Further information is requested

Comments

@xshaun
Copy link

xshaun commented Jun 9, 2021

Thanks for your sharing this project and paper.

I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?

  1. As the description of https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf from NVIDIA, it's impossible to run kernels on default stream and other streams simultaneously. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream.
  2. Besides, due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams using Python.

Or do you introduce other technologies to address this issue?

Doubts as follows :

  1. The figure5 in your paper said that kernels run on default stream and non-blocking streams, sure?
  2. The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
  3. Is there any possible that the improvement you got is from the reduced idle gap between each kernels, not overlapped communication and computation.

Thanks for your time and answer.

@chiheonk
Copy link
Contributor

chiheonk commented Jun 9, 2021

Hi xshaun,

  1. Indeed, copy kernel can overlap with execution kernels (for reasonably new gpus). We used non-default stream for only copy kernels (since execution kernels does not need to be run simultaneously).
  1. We have seen the overlaps beetween copy kernels and execution kernels. For example, the first gpu in GPipe copies the results of the first microbatch and simultaneously start computation of the second microbatch.
    However, as you said execution kernels cannot overlap each other, and copy kernels cannot overlap each other if they share the destination device.

@chiheonk chiheonk closed this as completed Jun 9, 2021
@chiheonk chiheonk reopened this Jun 9, 2021
@chiheonk chiheonk added the question Further information is requested label Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants