Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal stable diffusion GPU memory usage with accelerate hooks #850

Merged
merged 21 commits into from
Oct 26, 2022

Conversation

piEsposito
Copy link
Contributor

@piEsposito piEsposito commented Oct 14, 2022

Attempts to solve #540 in a more readable, less intrusive, less verbose way than #537.

@patil-suraj I made another attempt to solve this, now in a non-intrusive way, using accelerate.cpu_offload to keep on GPU only the parts of the models that are being used on operations, keeping the GPU memory footprint as little as 800 mb.

Would you be so gentle as to take a look and tell me if I'm going on the right direction?

Thanks!

Changes:

  • If pipeline would return torch.device("meta") on the .device getter, return torch.device("cpu"), otherwise we won't be able to move tensors to self.device if using accelerate
  • Created StableDiffusionPipeline.cuda_with_minimal_gpu_usage method to acclerate.cpu_offload all models to reduce GPU memory footprint.
  • Made PR tests depend on accelerate from master, as per @sgugger suggestion.

We need to install accelerate from master to use it, as this PR depends on huggingface/accelerate#768.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 14, 2022

The documentation is not available anymore as the PR was closed or merged.

@hafriedlander
Copy link
Contributor

Just for reference, this is how I solved the same problem - https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b75deaa743c77415f277fedd37494a661a0cbaf2/sdgrpcserver/pipeline/unified_pipeline.py#L364.

This gets used as a baseclass for a pipeline rather than DiffusionPipeline, and moves one model at a time onto the GPU. The advantage compared to this is that all the models run on the GPU, the disadvantage is that it's dependent on garbage collection to free the GPU memory.

@piEsposito
Copy link
Contributor Author

@hafriedlander thank you for the suggestion. I found out that we can do that with accelerate: if we add this hook to each model, it will offload to GPU onnly when performing inference, keeping the memory usage minimal.

I tried to avoid losing time with I/O and thus did not added everything to be loaded to GPU. Nevertheless, I agree with you this is a possible add-on to the PR.

@patrickvonplaten
Copy link
Contributor

Hey @piEsposito,

Generally, I think I'm fine with this PR! Just one question: In our experiments it's better and faster to run the model in pure "fp16" and to not use autocast at all - see: #371

Should we maybe try this here as well? E.g. just removing the autocast call?

@patrickvonplaten
Copy link
Contributor

Otherwise happy to add this to the pipeline :-)

@piEsposito
Copy link
Contributor Author

piEsposito commented Oct 17, 2022

@patrickvonplaten thank you for your feedback.

About your concern: I think we can't run the model in pure fp16 because some layers (e.g. layernorm) are implemented only for fp32. Because of that we can only keep the unet on fp16 (as it will run on the GPU), and the other models need to be in fp32 to be able to run on CPU. I know this is not optimal, but it would let people with smaller GPUs have access to Stable Diffusion with the price of a decrese in performance.

When I put everything on fp16 I get:

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

That way, the autocast works as a way to put the tensors in the right precision without having to intrude on the pipeline code.

Please let me know if there is a way to keep them on fp16 o CPU and be able to run those layers, if that's the case I can add it to the PR.

I tried using accelerate.cpu_offload, but for some reason it won't move the StableDiffusionSafetyChecker concept_embeds, special_care_embeds, concept_embeds_weights, special_care_embeds_weights attributes to the execution device when predicting, (it seems an issue accelerate cpu_offload has an issue with conflictig hooks), so we would still have one model on fp32 and on CPU, but after the issue on accelerate is solved we can try using cpu_offload (edit: I opened an issue: huggingface/accelerate#767 and will try to open a PR).

For now, best I could do is explicitely keeping unet on fp16 o gpu and everything else on fp32 on cpu.

It is not the best scenario, but at least people with less powerful setups will (like my mother) will be able to run it on less powerful setups.

What do you think?

Thanks!

@piEsposito
Copy link
Contributor Author

piEsposito commented Oct 17, 2022

Update: I've opened huggingface/accelerate#768 to solve the issue with accelerate.cpu_offload and the execution devices.

If the PR on accelerate is merge, we can change this PR to get as little as 640mb of GPU memory usage.

If this accelerate PR is merged we would just have to change

def cuda_with_minimal_gpu_usage(self):
if is_accelerate_available():
from accelerate.hooks import attach_execution_device_hook
else:
raise ImportError("Please install accelerate via `pip install accelerate`")
device = torch.device("cuda")
self.unet.half().to(device)
attach_execution_device_hook(self.unet, device)
self.unet.forward = torch.autocast("cuda")(self.unet.forward)
self.enable_attention_slicing(1)
for cpu_offloaded_model in [self.text_encoder, self.vae, self.safety_checker]:
cpu_offloaded_model.to(torch.float32)
attach_execution_device_hook(cpu_offloaded_model, "cpu")

to

def cuda_with_minimal_gpu_usage(self):
      if is_accelerate_available():
          from accelerate import cpu_offload
      else:
          raise ImportError("Please install accelerate via `pip install accelerate`")

      device = torch.device("cuda")
      self.enable_attention_slicing(1)

      for cpu_offloaded_model in [self.unet, self.text_encoder, self.vae, self.safety_checker]:
          cpu_offload(cpu_offloaded_model, device)

And reduce memory usage even further thanks to CPU offload.

@piEsposito
Copy link
Contributor Author

@patrickvonplaten I will wait for accelerate to have a release with this commit (huggingface/accelerate@5e8ab12 from huggingface/accelerate#768) merged, so that I can just do cpu_offload and the PR as clarer.

Would that be ok?

@piEsposito
Copy link
Contributor Author

piEsposito commented Oct 19, 2022

@sgugger, can I ask for a new patch release of accelerate with (huggingface/accelerate@5e8ab12 from huggingface/accelerate#768) merged? It would unblock this PR.

@sgugger
Copy link
Contributor

sgugger commented Oct 20, 2022

Not sure we will do a patch release for that, it's not a regression introduced by anything, just a bug fix. The big model inference in Accelerate is getting improved every day, so Diffusers should test again the main branch of Accelerate for now (that's what we do in Transformers).

@patrickvonplaten
Copy link
Contributor

Hey @piEsposito,

Cool to see that the change has been merged to accelerate - If you want feel free to go ahead with making this PR simpler and then we'll just update the requirement to accelerate>=0.14.0 since it's currently on 0.14.0.dev0 meaning the next minor version will be 0.14.0 (see: https://github.com/huggingface/accelerate/blob/87a7e0783f7b428fafa72a9e64d36625ee22888f/src/accelerate/__init__.py#L5)

Then we can just ask people to install accelerate from github until 0.14.0 :-):

pip install git+https://github.com/huggingface/accelerate.git

Does this sound good to you? :-)

@piEsposito
Copy link
Contributor Author

piEsposito commented Oct 20, 2022

@patrickvonplaten just did it. The problem is the tests won't pass because accelerate==0.14.0.dev0 is not on pypi. Should we wait for the next release of accelerate or is there a way to make the tests pass? Is there a way to make it test against the master branch of accelerate?

For my commits below this comment you will notice that, if this is possible, I'm having a hard time to figure it out haha.

@patrickvonplaten
Copy link
Contributor

Ah I see regarding the tests, let me and @anton-l take care of it :-) Don't worry about the failing test we'll make a change to the testing files

run: |
python -m pip install --upgrade pip
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
python -m pip install -e .[quality,test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate
python -m pip install -e .[quality,test]

This should work I think :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried putting the accelerate from master after the quality,test to ensure this is the version we will keep for the tests.

python -m pip install --upgrade pip
python -m pip uninstall -y torch torchvision torchtext
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
python -m pip install -e .[quality,test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate
python -m pip install -e .[quality,test]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried putting the accelerate from master after the quality,test to ensure this is the version we will keep for the tests. How does that sound?

@patrickvonplaten
Copy link
Contributor

Actually you seem to have figured out already how the githuh actions work :-) Think you'll just need to install accelerate before pip install . [dev] (maybe it also work the way you've done it now though) - in any way it's the right approach :-)

@piEsposito
Copy link
Contributor Author

Actually you seem to have figured out already how the githuh actions work :-) Think you'll just need to install accelerate before pip install . [dev] (maybe it also work the way you've done it now though) - in any way it's the right approach :-)

Actually, I think you were right. We can install it before because it will comply with accelerate>=0.11.0. Do you want me to change that?

@piEsposito
Copy link
Contributor Author

@patrickvonplaten it worked, only the MPS tests on Apple M1 are not passing, but I think that's happening on the other PRs too right?

@piEsposito
Copy link
Contributor Author

Now I think we are good to go.

@piEsposito
Copy link
Contributor Author

@patrickvonplaten, can we add this funcionality to the other Stable Diffusion Pipelines on follow up PRs?

@piEsposito
Copy link
Contributor Author

@patrickvonplaten, sorry to bother, can I do anything to help moving on with this PR?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR @piEsposito - sorry for replying so late! I was a bit swamped with issues 😅

@patrickvonplaten
Copy link
Contributor

@anton-l FYI regarding the tests -> until the next accelerate release we'll install from main to make these tests pass

@patrickvonplaten patrickvonplaten merged commit b2e2d14 into huggingface:main Oct 26, 2022
@piEsposito
Copy link
Contributor Author

@patrickvonplaten anything I can do to help with the issues?

Thank you for merging that one. Can I do the same for the other stable diffusion pipelines?

@patrickvonplaten
Copy link
Contributor

Yes I think adding this to the img2img and also inpaint pipeline makes a lot of sense :-) Also would you be interested in adding a section about this feature and the max memory usage to: https://huggingface.co/docs/diffusers/optimization/fp16 maybe?

@piEsposito
Copy link
Contributor Author

All right, I will do that on the next few days. About the documentation: how can I do that?

@patrickvonplaten
Copy link
Contributor

@piEsposito regarding the documentation you simply need to change the corresponding files here: https://github.com/huggingface/diffusers/tree/main/docs/source as well as add docstring to your method.
BTW I've done some renaming here: https://github.com/huggingface/diffusers/pull/1016/files

@patrickvonplaten
Copy link
Contributor

I think as soon as we have docs up as mentioned here: #850 (comment) we can promote this super cool new reduction to less than 2GB RAM, what do you think? :-)

@piEsposito
Copy link
Contributor Author

@patrickvonplaten great idea, I will try to do that till the end of the week. Should I add it to https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx ?

If we join this with attention slicing, it gets as small as 700mb.

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
…ingface#850)

* add method to enable cuda with minimal gpu usage to stable diffusion

* add test to minimal cuda memory usage

* ensure all models but unet are onn torch.float32

* move to cpu_offload along with minor internal changes to make it work

* make it test against accelerate master branch

* coming back, its official: I don't know how to make it test againt the master branch from accelerate

* make it install accelerate from master on tests

* go back to accelerate>=0.11

* undo prettier formatting on yml files

* undo prettier formatting on yml files againn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants