Add `non_blocking` kwarg to `send_to_device()` by NouamaneTazi · Pull Request #607 · huggingface/accelerate

NouamaneTazi · 2022-08-05T14:07:41Z

The .to() seems to throttle the forward pass, by creating a cudaStreamSynchronize everytime the AlignDevicesHook is called.
For example, I managed to reduce Walltime for a single forward pass of bloom-175b by 15% (with profiling on) by applying this fix.

cc @sgugger

- to make .to() run asynchronously

sgugger

Thanks for your PR!

Since this method is used everywhere (and not just in the big model inference) can we put the non_blocking argument (with default to True) as an arg to send_to_device? This way if we see it causes some problem in another part of the lib, we can set it off without touching big model inference.

HuggingFaceDocBuilderDev · 2022-08-05T14:11:37Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2022-09-04T15:05:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Chris-hughes10 · 2022-09-25T19:37:54Z

Hi @sgugger, was there a reason that this was closed and not merged? I came across this when I was about to request the same thing.

sgugger · 2022-09-26T12:08:50Z

@Chris-hughes10 it was automatically closed by the stale bot due to the absence of activity. @NouamaneTazi is still investigating whether this could have some negative impact sometimes.

NouamaneTazi · 2022-10-05T10:24:55Z

Should we make .to(non_blocking=True) the default in accelerate?

Let’s use this script to find out

import torch
if __name__ == '__main__':
    seed = 0
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    stream = torch.cuda.current_stream()

    x = torch.rand(32, 256, 220, 220).cuda()

    t = (x.min() - x.max()).to(torch.device("cpu"), non_blocking=False) # try with False then True
    # t = (x.min() - x.max()).to("cuda:1", non_blocking=True) # try GPU0 to GPU1 copy

    print(stream.query()) # False - Checks if all the work submitted has been completed.
    print(t)
    stream.synchronize() # wait for stream to finish the work
    
    print(stream.query()) # True - work done
    print(t)

Copy to CPU with non_blocking=False (default)

In this case .to() adds a cudaStreamSynchronize op which makes the CPU use the correct value of the tensor when printing

Copy to CPU with non_blocking=True

In this case the CPU submits the kernels for .to() to the GPU then moves on to perform the print operation which uses an incorrect value for the tensor tensor(0.) (The dangerous part)

Copy to another GPU with non_blocking=True

It seems that the non_blocking here doesn’t do much (we get basically the same thing using non_blocking=True ). In both cases we have GPU 1 waiting for GPU 0 to finish working on the tensor, and THEN copy it to GPU 1. And finally the CPU prints the tensor that’s now located on GPU 1
In this case .to() creates a cudaStreamWaitEvent event (figure 2) which makes GPU 1 waits for GPU 0. I made an issue on Pytorch’s forums to investigate why is this the case

tldr; non_blocking could be a game changer in using your GPUs efficiently.

Good use scenario: Copying your data from CPU to GPU in a non_blocking way then running your model which exists on GPU (this would make the CPU launch the copy kernel, then moves on to queuing other kernels in your model on the GPU. As opposed to waiting for the copy to end, and only then launching kernels from the model). Example from Pytorch's repo.
Bad use scenario: Copying your data from CPU to GPU in a non_blocking way then start some operations on CPU that would use the non-ready tensors. (could be in if statements, or simple arithmetics...)

=> It’s good to support that argument in accelerate but it’s better to keep the default as it is, just like it’s the case in Pytorch

src/accelerate/utils/operations.py

… pr/NouamaneTazi/607

Chris-hughes10 · 2022-10-05T15:55:25Z

@NouamaneTazi These are really interesting insights! If you are looking at inspecting signatures, do you think it would be too much complexity to set non-blocking automatically based on whether it is CPU -> GPU transfer or GPU -> CPU transfer by inspecting the tensor's current device?

If you like the idea, perhaps this would be a different feature though, toggled by an accelerator flag?

NouamaneTazi · 2022-10-05T16:33:39Z

@Chris-hughes10 CPU -> GPU and GPU -> CPU both lead to the same issues as mentioned above. Only GPU -> GPU is the safe operation but as I said above, it seems that it requires the two GPUs synchronization whether we set non_blocking=True or not

Chris-hughes10 · 2022-10-05T16:35:34Z

@Chris-hughes10 CPU -> GPU and GPU -> CPU both lead to the same issues as mentioned above. Only GPU -> GPU is the safe operation but as I said above, it seems that it requires the two GPUs synchronization whether we set non_blocking=True or not

Apologies, I misread the post above. Please ignore my previous suggestion!

src/accelerate/utils/operations.py

use .to(..., non_blocking=True) in send_to_device()

4ca79c6

- to make .to() run asynchronously

sgugger reviewed Aug 5, 2022

View reviewed changes

NouamaneTazi marked this pull request as ready for review August 5, 2022 14:10

add non_blocking kwarg to send_to_device

10c6e18

NouamaneTazi requested a review from sgugger August 5, 2022 14:35

github-actions bot closed this Sep 12, 2022

NouamaneTazi reopened this Oct 5, 2022

make non_blocking=False the default

ec68008

NouamaneTazi commented Oct 5, 2022

View reviewed changes

src/accelerate/utils/operations.py Show resolved Hide resolved

Merge branch 'main' of https://github.com/huggingface/accelerate into…

f42d0d6

… pr/NouamaneTazi/607

apply suggestions

6dc11ed

NouamaneTazi commented Oct 5, 2022

View reviewed changes

src/accelerate/utils/operations.py Show resolved Hide resolved

sgugger approved these changes Oct 5, 2022

View reviewed changes

styling

a91face

NouamaneTazi changed the title ~~Make send_to_device() run asynchronously~~ Add non_blocking kwarg to send_to_device() Oct 5, 2022

NouamaneTazi merged commit 66edfe1 into huggingface:main Oct 5, 2022

NouamaneTazi deleted the asynchronous_to branch October 5, 2022 18:52

muellerzr mentioned this pull request Apr 18, 2024

Change dataloader send_to_device calls to non-blocking #2685

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `non_blocking` kwarg to `send_to_device()`#607

Add `non_blocking` kwarg to `send_to_device()`#607
NouamaneTazi merged 6 commits intohuggingface:mainfrom
NouamaneTazi:asynchronous_to

NouamaneTazi commented Aug 5, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 5, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Sep 4, 2022

Uh oh!

Chris-hughes10 commented Sep 25, 2022

Uh oh!

sgugger commented Sep 26, 2022

Uh oh!

NouamaneTazi commented Oct 5, 2022 •

edited

Loading

Uh oh!

Uh oh!

Chris-hughes10 commented Oct 5, 2022 •

edited

Loading

Uh oh!

NouamaneTazi commented Oct 5, 2022 •

edited

Loading

Uh oh!

Chris-hughes10 commented Oct 5, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NouamaneTazi commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2022

Uh oh!

Chris-hughes10 commented Sep 25, 2022

Uh oh!

sgugger commented Sep 26, 2022

Uh oh!

NouamaneTazi commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Should we make .to(non_blocking=True) the default in accelerate?

Uh oh!

Uh oh!

Chris-hughes10 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NouamaneTazi commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chris-hughes10 commented Oct 5, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NouamaneTazi commented Aug 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 5, 2022 •

edited

Loading

NouamaneTazi commented Oct 5, 2022 •

edited

Loading

Chris-hughes10 commented Oct 5, 2022 •

edited

Loading

NouamaneTazi commented Oct 5, 2022 •

edited

Loading