minimal stable diffusion GPU memory usage with accelerate hooks #850

piEsposito · 2022-10-14T20:01:48Z

Attempts to solve #540 in a more readable, less intrusive, less verbose way than #537.

@patil-suraj I made another attempt to solve this, now in a non-intrusive way, using accelerate.cpu_offload to keep on GPU only the parts of the models that are being used on operations, keeping the GPU memory footprint as little as 800 mb.

Would you be so gentle as to take a look and tell me if I'm going on the right direction?

Thanks!

Changes:

If pipeline would return torch.device("meta") on the .device getter, return torch.device("cpu"), otherwise we won't be able to move tensors to self.device if using accelerate
Created StableDiffusionPipeline.cuda_with_minimal_gpu_usage method to acclerate.cpu_offload all models to reduce GPU memory footprint.
Made PR tests depend on accelerate from master, as per @sgugger suggestion.

We need to install accelerate from master to use it, as this PR depends on huggingface/accelerate#768.

HuggingFaceDocBuilderDev · 2022-10-14T20:06:00Z

The documentation is not available anymore as the PR was closed or merged.

hafriedlander · 2022-10-17T01:22:44Z

Just for reference, this is how I solved the same problem - https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b75deaa743c77415f277fedd37494a661a0cbaf2/sdgrpcserver/pipeline/unified_pipeline.py#L364.

This gets used as a baseclass for a pipeline rather than DiffusionPipeline, and moves one model at a time onto the GPU. The advantage compared to this is that all the models run on the GPU, the disadvantage is that it's dependent on garbage collection to free the GPU memory.

piEsposito · 2022-10-17T11:39:06Z

@hafriedlander thank you for the suggestion. I found out that we can do that with accelerate: if we add this hook to each model, it will offload to GPU onnly when performing inference, keeping the memory usage minimal.

I tried to avoid losing time with I/O and thus did not added everything to be loaded to GPU. Nevertheless, I agree with you this is a possible add-on to the PR.

patrickvonplaten · 2022-10-17T17:57:56Z

Hey @piEsposito,

Generally, I think I'm fine with this PR! Just one question: In our experiments it's better and faster to run the model in pure "fp16" and to not use autocast at all - see: #371

Should we maybe try this here as well? E.g. just removing the autocast call?

patrickvonplaten · 2022-10-17T17:58:08Z

Otherwise happy to add this to the pipeline :-)

piEsposito · 2022-10-17T18:12:14Z

@patrickvonplaten thank you for your feedback.

About your concern: I think we can't run the model in pure fp16 because some layers (e.g. layernorm) are implemented only for fp32. Because of that we can only keep the unet on fp16 (as it will run on the GPU), and the other models need to be in fp32 to be able to run on CPU. I know this is not optimal, but it would let people with smaller GPUs have access to Stable Diffusion with the price of a decrese in performance.

When I put everything on fp16 I get:

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

That way, the autocast works as a way to put the tensors in the right precision without having to intrude on the pipeline code.

Please let me know if there is a way to keep them on fp16 o CPU and be able to run those layers, if that's the case I can add it to the PR.

I tried using accelerate.cpu_offload, but for some reason it won't move the StableDiffusionSafetyChecker concept_embeds, special_care_embeds, concept_embeds_weights, special_care_embeds_weights attributes to the execution device when predicting, (it seems an issue accelerate cpu_offload has an issue with conflictig hooks), so we would still have one model on fp32 and on CPU, but after the issue on accelerate is solved we can try using cpu_offload (edit: I opened an issue: huggingface/accelerate#767 and will try to open a PR).

For now, best I could do is explicitely keeping unet on fp16 o gpu and everything else on fp32 on cpu.

It is not the best scenario, but at least people with less powerful setups will (like my mother) will be able to run it on less powerful setups.

What do you think?

Thanks!

piEsposito · 2022-10-17T22:40:33Z

Update: I've opened huggingface/accelerate#768 to solve the issue with accelerate.cpu_offload and the execution devices.

If the PR on accelerate is merge, we can change this PR to get as little as 640mb of GPU memory usage.

If this accelerate PR is merged we would just have to change

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Lines 122 to 137 in 0b601e6

    
           def cuda_with_minimal_gpu_usage(self): 
        
               if is_accelerate_available(): 
        
                   from accelerate.hooks import attach_execution_device_hook 
        
               else: 
        
                   raise ImportError("Please install accelerate via `pip install accelerate`") 
        
               device = torch.device("cuda") 
        
               self.unet.half().to(device) 
        
               attach_execution_device_hook(self.unet, device) 
        
               self.unet.forward = torch.autocast("cuda")(self.unet.forward) 
        
               self.enable_attention_slicing(1) 
        
               for cpu_offloaded_model in [self.text_encoder, self.vae, self.safety_checker]: 
        
                   cpu_offloaded_model.to(torch.float32) 
        
                   attach_execution_device_hook(cpu_offloaded_model, "cpu")

to

def cuda_with_minimal_gpu_usage(self):
      if is_accelerate_available():
          from accelerate import cpu_offload
      else:
          raise ImportError("Please install accelerate via `pip install accelerate`")

      device = torch.device("cuda")
      self.enable_attention_slicing(1)

      for cpu_offloaded_model in [self.unet, self.text_encoder, self.vae, self.safety_checker]:
          cpu_offload(cpu_offloaded_model, device)

And reduce memory usage even further thanks to CPU offload.

piEsposito · 2022-10-18T14:19:22Z

@patrickvonplaten I will wait for accelerate to have a release with this commit (huggingface/accelerate@5e8ab12 from huggingface/accelerate#768) merged, so that I can just do cpu_offload and the PR as clarer.

Would that be ok?

piEsposito · 2022-10-19T22:55:01Z

@sgugger, can I ask for a new patch release of accelerate with (huggingface/accelerate@5e8ab12 from huggingface/accelerate#768) merged? It would unblock this PR.

sgugger · 2022-10-20T13:08:53Z

Not sure we will do a patch release for that, it's not a regression introduced by anything, just a bug fix. The big model inference in Accelerate is getting improved every day, so Diffusers should test again the main branch of Accelerate for now (that's what we do in Transformers).

patrickvonplaten · 2022-10-20T17:30:05Z

Hey @piEsposito,

Cool to see that the change has been merged to accelerate - If you want feel free to go ahead with making this PR simpler and then we'll just update the requirement to accelerate>=0.14.0 since it's currently on 0.14.0.dev0 meaning the next minor version will be 0.14.0 (see: https://github.com/huggingface/accelerate/blob/87a7e0783f7b428fafa72a9e64d36625ee22888f/src/accelerate/__init__.py#L5)

Then we can just ask people to install accelerate from github until 0.14.0 :-):

pip install git+https://github.com/huggingface/accelerate.git

Does this sound good to you? :-)

piEsposito · 2022-10-20T17:44:12Z

@patrickvonplaten just did it. The problem is the tests won't pass because accelerate==0.14.0.dev0 is not on pypi. Should we wait for the next release of accelerate or is there a way to make the tests pass? Is there a way to make it test against the master branch of accelerate?

For my commits below this comment you will notice that, if this is possible, I'm having a hard time to figure it out haha.

…e master branch from accelerate

patrickvonplaten · 2022-10-20T18:18:44Z

Ah I see regarding the tests, let me and @anton-l take care of it :-) Don't worry about the failing test we'll make a change to the testing files

patrickvonplaten · 2022-10-20T18:19:29Z

.github/workflows/pr_tests.yml

+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
+          python -m pip install -e .[quality,test]


Suggested change

python -m pip install -e .[quality,test]

python -m pip install git+https://github.com/huggingface/accelerate

python -m pip install -e .[quality,test]

This should work I think :-)

I tried putting the accelerate from master after the quality,test to ensure this is the version we will keep for the tests.

patrickvonplaten · 2022-10-20T18:20:02Z

.github/workflows/push_tests.yml

+          python -m pip install --upgrade pip
+          python -m pip uninstall -y torch torchvision torchtext
+          python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
+          python -m pip install -e .[quality,test]


Suggested change

python -m pip install -e .[quality,test]

python -m pip install git+https://github.com/huggingface/accelerate

python -m pip install -e .[quality,test]

I tried putting the accelerate from master after the quality,test to ensure this is the version we will keep for the tests. How does that sound?

patrickvonplaten · 2022-10-20T18:21:24Z

Actually you seem to have figured out already how the githuh actions work :-) Think you'll just need to install accelerate before pip install . [dev] (maybe it also work the way you've done it now though) - in any way it's the right approach :-)

piEsposito · 2022-10-20T18:22:49Z

Actually you seem to have figured out already how the githuh actions work :-) Think you'll just need to install accelerate before pip install . [dev] (maybe it also work the way you've done it now though) - in any way it's the right approach :-)

Actually, I think you were right. We can install it before because it will comply with accelerate>=0.11.0. Do you want me to change that?

piEsposito · 2022-10-20T18:32:13Z

@patrickvonplaten it worked, only the MPS tests on Apple M1 are not passing, but I think that's happening on the other PRs too right?

piEsposito · 2022-10-20T19:01:32Z

Now I think we are good to go.

piEsposito · 2022-10-20T19:03:49Z

@patrickvonplaten, can we add this funcionality to the other Stable Diffusion Pipelines on follow up PRs?

piEsposito · 2022-10-25T17:47:41Z

@patrickvonplaten, sorry to bother, can I do anything to help moving on with this PR?

patrickvonplaten

Great PR @piEsposito - sorry for replying so late! I was a bit swamped with issues 😅

patrickvonplaten · 2022-10-26T13:52:53Z

@anton-l FYI regarding the tests -> until the next accelerate release we'll install from main to make these tests pass

piEsposito · 2022-10-26T14:11:31Z

@patrickvonplaten anything I can do to help with the issues?

Thank you for merging that one. Can I do the same for the other stable diffusion pipelines?

patrickvonplaten · 2022-10-26T16:43:00Z

Yes I think adding this to the img2img and also inpaint pipeline makes a lot of sense :-) Also would you be interested in adding a section about this feature and the max memory usage to: https://huggingface.co/docs/diffusers/optimization/fp16 maybe?

piEsposito · 2022-10-26T16:57:22Z

All right, I will do that on the next few days. About the documentation: how can I do that?

patrickvonplaten · 2022-10-27T09:32:57Z

@piEsposito regarding the documentation you simply need to change the corresponding files here: https://github.com/huggingface/diffusers/tree/main/docs/source as well as add docstring to your method.
BTW I've done some renaming here: https://github.com/huggingface/diffusers/pull/1016/files

patrickvonplaten · 2022-10-27T09:34:53Z

I think as soon as we have docs up as mentioned here: #850 (comment) we can promote this super cool new reduction to less than 2GB RAM, what do you think? :-)

piEsposito · 2022-10-27T10:41:49Z

@patrickvonplaten great idea, I will try to do that till the end of the week. Should I add it to https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx ?

If we join this with attention slicing, it gets as small as 700mb.

…ingface#850) * add method to enable cuda with minimal gpu usage to stable diffusion * add test to minimal cuda memory usage * ensure all models but unet are onn torch.float32 * move to cpu_offload along with minor internal changes to make it work * make it test against accelerate master branch * coming back, its official: I don't know how to make it test againt the master branch from accelerate * make it install accelerate from master on tests * go back to accelerate>=0.11 * undo prettier formatting on yml files * undo prettier formatting on yml files againn

piEsposito added 2 commits October 14, 2022 16:14

add method to enable cuda with minimal gpu usage to stable diffusion

f51cccc

add test to minimal cuda memory usage

88acaf3

Merge branch 'main' into main

c07834f

piEsposito added 3 commits October 17, 2022 18:56

ensure all models but unet are onn torch.float32

a065897

Merge branch 'main' of github.com:piEsposito/diffusers into main

4b4c69a

Merge branch 'main' into main

0b601e6

Merge branch 'main' into main

1501171

Merge branch 'main' into main

9af3535

piEsposito added 2 commits October 20, 2022 14:45

move to cpu_offload along with minor internal changes to make it work

965dfe1

Merge branch 'main' of github.com:piEsposito/diffusers into main

6e81ac5

piEsposito force-pushed the main branch from 67f8450 to 8eeba88 Compare October 20, 2022 18:02

make it test against accelerate master branch

c7b3646

piEsposito force-pushed the main branch from 8eeba88 to c7b3646 Compare October 20, 2022 18:03

piEsposito added 3 commits October 20, 2022 15:11

coming back, its official: I don't know how to make it test againt th…

45a61d8

…e master branch from accelerate

make it install accelerate from master on tests

3af822b

go back to accelerate>=0.11

358f59a

patrickvonplaten reviewed Oct 20, 2022

View reviewed changes

piEsposito requested a review from patrickvonplaten October 20, 2022 18:30

piEsposito added 2 commits October 20, 2022 15:59

undo prettier formatting on yml files

9793bc8

undo prettier formatting on yml files againn

d1f00ba

Merge branch 'main' into main

c728184

piEsposito added 4 commits October 21, 2022 08:13

Merge branch 'main' into main

b6ebfb4

Merge branch 'main' into main

49a1711

Merge branch 'main' into main

b00fe03

Merge branch 'main' into main

ce745c9

patrickvonplaten approved these changes Oct 26, 2022

View reviewed changes

patrickvonplaten merged commit b2e2d14 into huggingface:main Oct 26, 2022

patrickvonplaten mentioned this pull request Oct 27, 2022

[Accelerate model loading] Fix meta device and super low memory usage #1016

Merged

This was referenced Oct 27, 2022

Document sequential CPU offload method on Stable Diffusion pipeline #1024

Merged

Reduce Stable Diffusion memory usage by keeping unet only on GPU. #540

Closed

averad mentioned this pull request Nov 3, 2022

8007000E Not enough memory resources are available to complete this operation. #791

Closed

siriux mentioned this pull request Nov 6, 2022

Implement GPU memory optimization LaurentMazare/diffusers-rs#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimal stable diffusion GPU memory usage with accelerate hooks #850

minimal stable diffusion GPU memory usage with accelerate hooks #850

piEsposito commented Oct 14, 2022 •

edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 •

edited

hafriedlander commented Oct 17, 2022

piEsposito commented Oct 17, 2022

patrickvonplaten commented Oct 17, 2022

patrickvonplaten commented Oct 17, 2022

piEsposito commented Oct 17, 2022 •

edited

piEsposito commented Oct 17, 2022 •

edited

piEsposito commented Oct 18, 2022

piEsposito commented Oct 19, 2022 •

edited

sgugger commented Oct 20, 2022

patrickvonplaten commented Oct 20, 2022

piEsposito commented Oct 20, 2022 •

edited

patrickvonplaten commented Oct 20, 2022

patrickvonplaten Oct 20, 2022

piEsposito Oct 20, 2022

patrickvonplaten Oct 20, 2022

piEsposito Oct 20, 2022

patrickvonplaten commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 25, 2022

patrickvonplaten left a comment

patrickvonplaten commented Oct 26, 2022

piEsposito commented Oct 26, 2022

patrickvonplaten commented Oct 26, 2022

piEsposito commented Oct 26, 2022

patrickvonplaten commented Oct 27, 2022

patrickvonplaten commented Oct 27, 2022

piEsposito commented Oct 27, 2022

	python -m pip install -e .[quality,test]
	python -m pip install git+https://github.com/huggingface/accelerate
	python -m pip install -e .[quality,test]

minimal stable diffusion GPU memory usage with accelerate hooks #850

minimal stable diffusion GPU memory usage with accelerate hooks #850

Conversation

piEsposito commented Oct 14, 2022 • edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 • edited

hafriedlander commented Oct 17, 2022

piEsposito commented Oct 17, 2022

patrickvonplaten commented Oct 17, 2022

patrickvonplaten commented Oct 17, 2022

piEsposito commented Oct 17, 2022 • edited

piEsposito commented Oct 17, 2022 • edited

piEsposito commented Oct 18, 2022

piEsposito commented Oct 19, 2022 • edited

sgugger commented Oct 20, 2022

patrickvonplaten commented Oct 20, 2022

piEsposito commented Oct 20, 2022 • edited

patrickvonplaten commented Oct 20, 2022

patrickvonplaten Oct 20, 2022

Choose a reason for hiding this comment

piEsposito Oct 20, 2022

Choose a reason for hiding this comment

patrickvonplaten Oct 20, 2022

Choose a reason for hiding this comment

piEsposito Oct 20, 2022

Choose a reason for hiding this comment

patrickvonplaten commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 20, 2022

piEsposito commented Oct 25, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 26, 2022

piEsposito commented Oct 26, 2022

patrickvonplaten commented Oct 26, 2022

piEsposito commented Oct 26, 2022

patrickvonplaten commented Oct 27, 2022

patrickvonplaten commented Oct 27, 2022

piEsposito commented Oct 27, 2022

piEsposito commented Oct 14, 2022 •

edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 •

edited

piEsposito commented Oct 17, 2022 •

edited

piEsposito commented Oct 17, 2022 •

edited

piEsposito commented Oct 19, 2022 •

edited

piEsposito commented Oct 20, 2022 •

edited