Removing `autocast` for `35-25% speedup`. (`autocast` considered harmful). #511

Narsil · 2022-09-14T15:18:04Z

While investigating performance of diffusers I figured out a relatively simple way to get 35% speedups on the default stable AI.

Simply by removing autocast.

Happy to add any test that might be worthwhile, or move some of the casting to different places.

Ran on Titan RTX:

Before

Took 0:00:18.994290

After

Took 0:00:14.049952

The reason why autocast adds so much overhead, si that a few tensors were still in fp32 and EVERY single operation afterwards would downcast to fp16 for the op, and upcast back to fp32 afterwards leading to insane amounts of copies of tensors.

Since it seems very easy to have inefficient code with it I took the liberty of removing it in the lib itself.
If that's ok with the maintainers I would like to remove it from the docs too.

HuggingFaceDocBuilderDev · 2022-09-14T15:23:22Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2022-09-14T20:38:18Z

README.md

 prompt = "a photo of an astronaut riding a horse on mars"
-with autocast("cuda"):
-    image = pipe(prompt).images[0]  
+image = pipe(prompt).images[0]  


Are you sure that this is faster? Using autocast gives currently (before this PR) a 2x boost in terms of generation speed.

Will also test a bit locally on a GPU tomorrow

This is extremely surprising but I am also measuring a 2x speedup with autocast on f32.

I am looking into it, I see to copies but not nearly the same amount, there's probably a device-to-host /host-to-device somewhere that kills performance but I haven't found it yet.

Okay, I figure it out.

autocast will actually use fp16 for some ops by doing some heuristics. https://pytorch.org/docs/stable/amp.html#cuda-op-specific-behavior

So it's faster because it's running on fp16 even if the model was loaded in f32.
So without it it's slower just because it's actually running f32.

If we enable real fp16 with a big performance boost I feel like we shouldn't need it f32 (but that does make it slower but also "more" correct.). Some ops are kind of dangerous to actual run in fp16 but we should be able to tell them apart (and for now it seems the generations are actually still good enough even when running everything in f16)

But it's still a nice way to get f16 running "for free". The heuristics they use seem OK (but as they mention, they probably wouldn't work for gradients, I guess because fp16 overflows faster)

Commented extensively here: #511

Should we maybe change the example to load native fp16 weights then?

So replace:

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)

by

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, revision="fp16")

That seems OK. Is fp16 supported by enough GPU cards at this point ? That would be my only point of concern.
But given the speed difference, advocating for fp16 is definitely something we should do !

Yes I think fp16 is widely supported now

src/diffusers/models/unet_2d_condition.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

patrickvonplaten

Overall this looks very good to me, I'm not 100% sure whether we're getting a big speed-up though if everything is in fp32 - will test tomorrow

keturn · 2022-09-14T23:36:57Z

Do you think it would be practical to take something from test_pipelines and make some sort of pipeline benchmark test from it?

I've always been reluctant to try to write performance tests when CI is made out of The Cloud, but if you can trust the tests will run somewhere with a consistent execution environment, then I think it could be very informative.

Narsil · 2022-09-15T08:00:33Z

I've always been reluctant to try to write performance tests when CI is made out of The Cloud, but if you can trust the tests will run somewhere with a consistent execution environment, then I think it could be very informative.

100% agree, I was not thinking about performance, merely correctness (the pipeline works in fp16 without autocast.
This is the biggest point with autocast, if ANY tensors forgets to be instantiated in the "correct" dtype, then probably there's going to be huge amount of copies will be added. So making sure we can run it without probably saves us from this issue from creeping up again. autocast then is a slight slowdown, but only because it will UPcast the operations it finds should be ran in fp32 (which might be more correct)

Edit: Added a slow test.

…ser.

keturn · 2022-09-27T22:05:35Z

I've been successfully using some code based on this patch. The one change I've had to make is to the stochastic schedulers such as DDIM. Their use of randn needs to be adjusted to use an explicit dtype, in much the same way the randn call has been changed in pipeline_stable_diffusion here.

keturn · 2022-09-30T16:33:36Z

With #371 merged, this needs conflicts resolved.

I think #371 actually included most, if not all, of the changes from this PR, but the DDIM issue I mentioned seems to still exist when I try to call the pipeline without autocast.

patrickvonplaten · 2022-10-03T21:14:55Z

Sorry to be so late here @Narsil ! Do you think there is a way to extract everything out of this PR that was not already merged in #371

Narsil · 2022-10-04T08:45:18Z

I think it included most of the changes here sorry I'm also laggy on my GH notifications.

For DDIM I don't know but probably all schedulers/snippet need to handle precision.

I still advocate removing autocast from all snippets (and making sure they work).
If we don't then the latency regression is likely to come back very fast.

patrickvonplaten · 2022-10-04T13:13:15Z

@NouamaneTazi do you time by any change to rebase this PR on current main and see if there is anything we forgot to add in your PR: #371?

Also, I agree with @Narsil that we should updated the docs / readmes to not have "autocast" if pure fp16 is faster and less error-prone. If @NouamaneTazi you have time maybe we could go through the docs together and see if we can remove autocast in favor of pure fp16 if we get more or less same visual results

…optimizations

patrickvonplaten · 2022-10-05T10:42:25Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

-                device=latents_device,
-                dtype=text_embeddings.dtype,
-            )
+            if self.device.type == "mps":


@pcuenca could you take a look here?

Yes, I commented in the other conservation, I think it's ok like this.

patrickvonplaten · 2022-10-05T10:43:31Z

tests/test_pipelines.py

+        diff = np.abs(image_chunked.flatten() - image.flatten())
+        # They ARE different since ops are not run always at the same precision
+        # however, they should be extremely close.
+        assert diff.mean() < 2e-2


this test tolerance is a bit high to me... => will play around with it a bit!

It's roughly exactly the same difference as in #371 for pure fp16 run.

Running some ops in f32 directly instead of f16 (which autocast will do) does change some patches.
This is less than 2% total variance in images, so it really doesn't show visually. (Much less than some other changes in #371 where I think some visual differences were visible).

patrickvonplaten · 2022-10-05T12:10:32Z

@Narsil I checked ran a quick benchmark which I think is in line with your findings:

The following script:

#!/usr/bin/env python3
from torch import autocast
from diffusers import StableDiffusionPipeline
import torch
from time import time
pipe_fp32 = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")
pipe_fp16 = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, revision="fp16").to("cuda")


prompt = 4 * ["Dark, eerie forest with beautiful sunshine."]


def benchmark(name, do_autocast, pipe, generator):
    print(name)
    start = time()
    if not do_autocast:
        image = pipe(prompt, generator=generator).images[0]
    else:
        with autocast("cuda"):
            image = pipe(prompt, generator=generator).images[0]
    image.save(name + ".png")
    end = time()
    print("time", end - start)
    print(50 * "-")


torch_device = "cuda"
name = "FP32 - Autocast"
generator = torch.Generator(device=torch_device).manual_seed(0)
benchmark(name, True, pipe_fp32, generator)

name = "FP32"
generator = torch.Generator(device=torch_device).manual_seed(0)
benchmark(name, False, pipe_fp32, generator)

name = "FP16 - Autocast"
generator = torch.Generator(device=torch_device).manual_seed(0)
benchmark(name, True, pipe_fp16, generator)

name = "FP16"
generator = torch.Generator(device=torch_device).manual_seed(0)
benchmark(name, False, pipe_fp16, generator)

gives these results on a TITAN RTX:

FP32 - Autocast
100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [00:19<00:00,  2.62it/s]
time 20.597806692123413
--------------------------------------------------
FP32
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:37<00:00,  1.34it/s]
time 38.97595691680908
--------------------------------------------------
FP16 - Autocast
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:19<00:00,  2.61it/s]
time 20.543357133865356
--------------------------------------------------
FP16
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:14<00:00,  3.58it/s]
time 15.096464395523071
--------------------------------------------------

The saved images are all identical - in general it doesn't seem like fp16 is hurting performance really (see: https://huggingface.co/datasets/patrickvonplaten/images/tree/main/to_delete) => so should we maybe change all of the docs to advertise native FP16 indeed meaning we'll advertise the following in all our examples:

from diffusers import StableDiffusionPipeline
import torch
from time import time

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, revision="fp16").to("cuda")

image = pipe(prompt).images[0]

It seems to make most sense given that we already advertise everywhere that it should be run on GPU no?

What do you think @Narsil @pcuenca @patil-suraj @anton-l ?

src/diffusers/pipelines/README.md

anton-l · 2022-10-05T12:23:29Z

@patrickvonplaten I agree!

But just to be on the safe side I think we need to throw a custom informative error when users try to use the fp16 weights on the CPU, as RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' isn't too useful there. Happy to do that in another PR :)

tests/test_pipelines.py

Narsil · 2022-10-05T13:24:50Z

@Narsil I checked ran a quick benchmark which I think is in line with your findings:

Perfectly in line. The other kind of optimization torch.jit.script actually helped a lot too, and autocast when the attention was scripted led to 7% slowdown instead of 25%. Just FYI (I'm guessing because the fp32 softmax can be done more efficiently). Also the speedups/slowdowns did seem to depend from GPU to GPU (the overall direction + magnitude is always correlated, but not necessarily the same throughout).

patrickvonplaten · 2022-10-05T13:33:10Z

Merging now and then updating the docs in a follow up PR

WASasquatch · 2022-10-14T04:50:00Z

For the Testla T4, on Google Colab, this results in a 2-second gain in time, from 28s to 30s (this timing the actual process itself, not the time the pipe gives you).

Using PNDM, 50 steps, guidance scale 13.5 (usual I use), resolution 512x768 (Portrait)

Narsil · 2022-10-14T07:45:01Z

@WASasquatch Most of the speedups were already included in #371 as their was some duplicated efforts here.

If you're interested you can even try https://github.com/huggingface/diffusers/tree/optim_attempts where we went a bit further.
using https://pytorch.org/docs/stable/backends.html#torch.backends.cudnn.torch.backends.cudnn.benchmark (Set it to True in your script). That provides an additional few %.

WASasquatch · 2022-10-15T17:28:09Z

@WASasquatch Most of the speedups were already included in #371 as their was some duplicated efforts here.

If you're interested you can even try https://github.com/huggingface/diffusers/tree/optim_attempts where we went a bit further. using https://pytorch.org/docs/stable/backends.html#torch.backends.cudnn.torch.backends.cudnn.benchmark (Set it to True in your script). That provides an additional few %.

I'll check it out.

The speed I got after Attention Slicing PR (over a month ago) was 28 seconds, all through-out these changes (I use current github branch and make changes as they happen). With this PR it's increased to 30 seconds.

Could it be that enable_attention_slicing() is having undesired effects with this PR?

Narsil · 2022-10-17T09:08:59Z

Could it be that enable_attention_slicing() is having undesired effects with this PR?

Most likely. Attention slicing is for memory reduction, not inference speed.
But it's hard to tell without looking at the whole script.

You could also try using pytorch profiler to see what's wrong. Maybe the slow part lies somewhere else.
Can't recommend enough using : https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html

WASasquatch · 2022-10-17T15:56:31Z

You could also try using pytorch profiler to see what's wrong. Maybe the slow part lies somewhere else.
Can't recommend enough using : https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html

@Narsil Thanks for that. I assume I'm looking at the profiler bit. Thanks again.

…ful). (huggingface#511) * Removing `autocast` for `35-25% speedup`. * iQuality * Adding a slow test. * Fixing mps noise generation. * Raising error on wrong device, instead of just casting on behalf of user. * Quality. * fix merge Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>

Narsil requested a review from patrickvonplaten September 14, 2022 15:18

Removing autocast for 35-25% speedup.

70abbb7

Narsil force-pushed the optimizations branch from 4bc7ee5 to 70abbb7 Compare September 14, 2022 15:19

Narsil requested a review from patil-suraj September 14, 2022 15:20

iQuality

6334170

patrickvonplaten reviewed Sep 14, 2022

View reviewed changes

src/diffusers/models/unet_2d_condition.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Sep 14, 2022

View reviewed changes

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py Show resolved Hide resolved

patrickvonplaten reviewed Sep 14, 2022

View reviewed changes

Narsil added 4 commits September 15, 2022 08:35

Adding a slow test.

e4f9388

Fixing mps noise generation.

6b853af

Raising error on wrong device, instead of just casting on behalf of u…

ce66d60

…ser.

Quality.

43c2d17

leszekhanusz mentioned this pull request Sep 18, 2022

Merging Stable diffusion pipelines just makes sense #551

Closed

patrickvonplaten mentioned this pull request Sep 21, 2022

Implementation for TensorRT #36

Closed

patrickvonplaten self-assigned this Sep 27, 2022

ilikenwf mentioned this pull request Sep 28, 2022

Potential 25%+ speedup for some nvidia cards - multiple PR's over at huggingface diffusers AUTOMATIC1111/stable-diffusion-webui#1207

Closed

keturn mentioned this pull request Oct 2, 2022

Optimize VRAM use in textual inversion training #687

Closed

shirayu mentioned this pull request Oct 4, 2022

Add an argument "negative_prompt" #549

Merged

Merge branch 'main' of https://github.com/huggingface/diffusers into …

c207659

…optimizations

patrickvonplaten reviewed Oct 5, 2022

View reviewed changes

anton-l reviewed Oct 5, 2022

View reviewed changes

src/diffusers/pipelines/README.md Show resolved Hide resolved

anton-l approved these changes Oct 5, 2022

View reviewed changes

patil-suraj approved these changes Oct 5, 2022

View reviewed changes

NouamaneTazi reviewed Oct 5, 2022

View reviewed changes

tests/test_pipelines.py Show resolved Hide resolved

fix merge

cf393eb

pcuenca approved these changes Oct 5, 2022

View reviewed changes

patrickvonplaten merged commit 3dcc75c into main Oct 5, 2022

patrickvonplaten deleted the optimizations branch October 5, 2022 13:33

keturn mentioned this pull request Oct 5, 2022

fix(DDIM scheduler): use correct dtype for noise #742

Merged

shirayu added a commit to shirayu/purepale that referenced this pull request Oct 6, 2022

Removed autocast huggingface/diffusers#511

e7191ff

pcuenca mentioned this pull request Oct 14, 2022

Minor improvements to avoid out of memory issues in stable_diffusion.ipynb notebook (attempt 2) fastai/diffusion-nbs#5

Merged

patrickvonplaten mentioned this pull request Dec 11, 2022

Dreambooth: fix #1566: Use autocast to correct full vs half precision error #1575

Closed

patrickvonplaten mentioned this pull request Dec 19, 2022

question about attention block #1704

Closed

patrickvonplaten mentioned this pull request Jan 10, 2023

stable diffusion 2 float16 mode not working: expected scalar type Half but found Float #1931

Closed

Removing autocast for 35-25% speedup. (autocast considered harmful). #511

Removing autocast for 35-25% speedup. (autocast considered harmful). #511

Uh oh!

Conversation

Narsil commented Sep 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

keturn commented Sep 14, 2022

Uh oh!

Narsil commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keturn commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keturn commented Sep 30, 2022

Uh oh!

patrickvonplaten commented Oct 3, 2022

Uh oh!

Narsil commented Oct 4, 2022

Uh oh!

patrickvonplaten commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Oct 5, 2022

Uh oh!

Uh oh!

anton-l commented Oct 5, 2022

Uh oh!

Uh oh!

Narsil commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Oct 5, 2022

Uh oh!

WASasquatch commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Oct 14, 2022

Uh oh!

WASasquatch commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Oct 17, 2022

Uh oh!

Removing `autocast` for `35-25% speedup`. (`autocast` considered harmful). #511

Removing `autocast` for `35-25% speedup`. (`autocast` considered harmful). #511

HuggingFaceDocBuilderDev commented Sep 14, 2022 •

edited

Loading

Narsil Sep 15, 2022 •

edited

Loading

patrickvonplaten Oct 5, 2022 •

edited

Loading

Narsil commented Sep 15, 2022 •

edited

Loading

keturn commented Sep 27, 2022 •

edited

Loading

patrickvonplaten commented Oct 4, 2022 •

edited

Loading

Narsil Oct 5, 2022 •

edited

Loading

Narsil commented Oct 5, 2022 •

edited

Loading

WASasquatch commented Oct 14, 2022 •

edited

Loading

WASasquatch commented Oct 15, 2022 •

edited

Loading