Skip to content

Add non_blocking kwarg to send_to_device()#607

Merged
NouamaneTazi merged 6 commits intohuggingface:mainfrom
NouamaneTazi:asynchronous_to
Oct 5, 2022
Merged

Add non_blocking kwarg to send_to_device()#607
NouamaneTazi merged 6 commits intohuggingface:mainfrom
NouamaneTazi:asynchronous_to

Conversation

@NouamaneTazi
Copy link
Member

@NouamaneTazi NouamaneTazi commented Aug 5, 2022

The .to() seems to throttle the forward pass, by creating a cudaStreamSynchronize everytime the AlignDevicesHook is called.
For example, I managed to reduce Walltime for a single forward pass of bloom-175b by 15% (with profiling on) by applying this fix.

cc @sgugger

- to make .to() run asynchronously
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR!

Since this method is used everywhere (and not just in the big model inference) can we put the non_blocking argument (with default to True) as an arg to send_to_device? This way if we see it causes some problem in another part of the lib, we can set it off without touching big model inference.

@NouamaneTazi NouamaneTazi marked this pull request as ready for review August 5, 2022 14:10
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@NouamaneTazi NouamaneTazi requested a review from sgugger August 5, 2022 14:35
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Sep 12, 2022
@Chris-hughes10
Copy link
Contributor

Hi @sgugger, was there a reason that this was closed and not merged? I came across this when I was about to request the same thing.

@sgugger
Copy link
Collaborator

sgugger commented Sep 26, 2022

@Chris-hughes10 it was automatically closed by the stale bot due to the absence of activity. @NouamaneTazi is still investigating whether this could have some negative impact sometimes.

@NouamaneTazi
Copy link
Member Author

NouamaneTazi commented Oct 5, 2022

Should we make .to(non_blocking=True) the default in accelerate?

Let’s use this script to find out

import torch
if __name__ == '__main__':
    seed = 0
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    stream = torch.cuda.current_stream()

    x = torch.rand(32, 256, 220, 220).cuda()

    t = (x.min() - x.max()).to(torch.device("cpu"), non_blocking=False) # try with False then True
    # t = (x.min() - x.max()).to("cuda:1", non_blocking=True) # try GPU0 to GPU1 copy

    print(stream.query()) # False - Checks if all the work submitted has been completed.
    print(t)
    stream.synchronize() # wait for stream to finish the work
    
    print(stream.query()) # True - work done
    print(t)
Copy to CPU with non_blocking=False (default)

In this case .to() adds a cudaStreamSynchronize op which makes the CPU use the correct value of the tensor when printing
image

Copy to CPU with non_blocking=True

In this case the CPU submits the kernels for .to() to the GPU then moves on to perform the print operation which uses an incorrect value for the tensor tensor(0.) (The dangerous part)
image

Copy to another GPU with non_blocking=True

It seems that the non_blocking here doesn’t do much (we get basically the same thing using non_blocking=True ). In both cases we have GPU 1 waiting for GPU 0 to finish working on the tensor, and THEN copy it to GPU 1. And finally the CPU prints the tensor that’s now located on GPU 1
In this case .to() creates a cudaStreamWaitEvent event (figure 2) which makes GPU 1 waits for GPU 0. I made an issue on Pytorch’s forums to investigate why is this the case

image

image

tldr; non_blocking could be a game changer in using your GPUs efficiently.

  • Good use scenario: Copying your data from CPU to GPU in a non_blocking way then running your model which exists on GPU (this would make the CPU launch the copy kernel, then moves on to queuing other kernels in your model on the GPU. As opposed to waiting for the copy to end, and only then launching kernels from the model). Example from Pytorch's repo.
  • Bad use scenario: Copying your data from CPU to GPU in a non_blocking way then start some operations on CPU that would use the non-ready tensors. (could be in if statements, or simple arithmetics...)

=> It’s good to support that argument in accelerate but it’s better to keep the default as it is, just like it’s the case in Pytorch

@NouamaneTazi NouamaneTazi reopened this Oct 5, 2022
@Chris-hughes10
Copy link
Contributor

Chris-hughes10 commented Oct 5, 2022

@NouamaneTazi These are really interesting insights! If you are looking at inspecting signatures, do you think it would be too much complexity to set non-blocking automatically based on whether it is CPU -> GPU transfer or GPU -> CPU transfer by inspecting the tensor's current device?

If you like the idea, perhaps this would be a different feature though, toggled by an accelerator flag?

@NouamaneTazi
Copy link
Member Author

NouamaneTazi commented Oct 5, 2022

@Chris-hughes10 CPU -> GPU and GPU -> CPU both lead to the same issues as mentioned above. Only GPU -> GPU is the safe operation but as I said above, it seems that it requires the two GPUs synchronization whether we set non_blocking=True or not

@Chris-hughes10
Copy link
Contributor

@Chris-hughes10 CPU -> GPU and GPU -> CPU both lead to the same issues as mentioned above. Only GPU -> GPU is the safe operation but as I said above, it seems that it requires the two GPUs synchronization whether we set non_blocking=True or not

Apologies, I misread the post above. Please ignore my previous suggestion!

@NouamaneTazi NouamaneTazi changed the title Make send_to_device() run asynchronously Add non_blocking kwarg to send_to_device() Oct 5, 2022
@NouamaneTazi NouamaneTazi merged commit 66edfe1 into huggingface:main Oct 5, 2022
@NouamaneTazi NouamaneTazi deleted the asynchronous_to branch October 5, 2022 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants