Enable Gradient Checkpointing for UNet2DModel #6718

dg845 · 2024-01-26T10:27:45Z

What does this PR do?

This PR enables gradient checkpointing for UNet2DModel by setting the _supports_gradient_checkpointing flag to True. Since UNet2DConditionModel has _supports_gradient_checkpointing = True, it seems like UNet2DModel should support gradient checkpointing as well, unless I'm missing something.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten
@sayakpaul

sayakpaul

Do we not have also configure the gradient checkpointing blocks like how we do here?

HuggingFaceDocBuilderDev · 2024-01-26T11:52:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2024-01-27T00:30:32Z

Do we not have also configure the gradient checkpointing blocks like how we do here?

You're right, I missed this 😅.

… and AttnUpBlock2D.

…ard for gradient checkpointing in AttnDownBlock2D and AttnUpBlock2D.

dg845 · 2024-01-27T01:58:41Z

The UNetMidBlock2D, AttnDownBlock2D, and AttnUpBlock2D blocks used in UNet2DModel currently do not have gradient checkpointing implemented, so I have added gradient checkpointing to each of those blocks.

As a note, in their current forward method, AttnDownBlock2D and AttnUpBlock2D use the scale keyword argument when calling resnet, e.g.:

diffusers/src/diffusers/models/unets/unet_2d_blocks.py

Lines 1045 to 1046 in d4c7ab7

    
           cross_attention_kwargs.update({"scale": lora_scale}) 
        
           hidden_states = resnet(hidden_states, temb, scale=lora_scale)

So I have written the create_custom_forward function as

diffusers/src/diffusers/models/unets/unet_2d_blocks.py

Lines 1072 to 1079 in e837857

    
           def create_custom_forward(module, return_dict=None): 
        
               def custom_forward(*inputs, **kwargs): 
        
                   if return_dict is not None: 
        
                       return module(*inputs, return_dict=return_dict, **kwargs) 
        
                   else: 
        
                       return module(*inputs, **kwargs) 
        
               return custom_forward

This has the potential to cause problems if return_dict is also supplied through kwargs (see https://docs.python.org/3/tutorial/controlflow.html#function-examples).

CrossAttnDownBlock2D handles this by omitting the scale argument when calling custom_forward entirely:

diffusers/src/diffusers/models/unets/unet_2d_blocks.py

Lines 1183 to 1188 in d4c7ab7

    
           hidden_states = torch.utils.checkpoint.checkpoint( 
        
               create_custom_forward(resnet), 
        
               hidden_states, 
        
               temb, 
        
               **ckpt_kwargs, 
        
           )

which seems wrong when lora_scale is not ResnetBlock2D.forward's default scale value of 1.0 because the forward passes with and without gradient checkpointing are not equivalent.

Since scale is currently the third keyword argument to ResnetBlock2D.forward, we can probably supply it as a positional argument and use CrossAttnDownBlock2D's create_custom_forward implementation. I'm not sure which approach is best.

sayakpaul · 2024-01-27T03:13:27Z

which seems wrong when lora_scale is not ResnetBlock2D.forward's default scale value of 1.0 because the forward passes with and without gradient checkpointing are not equivalent.

I think this is still fine because lora_scale is never going to interfere during training. This is still reasonable to me because there is always a bit of subtle differences for training and inference in the forward of a model. WDYT?

sayakpaul

Just some nits, but looks very good. Nice test, too.

tests/models/test_models_unet_2d.py

src/diffusers/models/unets/unet_2d_blocks.py

… positional arg when gradient checkpointing for AttnDownBlock2D/AttnUpBlock2D.

…checkpointing for CrossAttnDownBlock2D/CrossAttnUpBlock2D as well.

dg845 · 2024-01-28T01:34:43Z

Regarding #6718 (comment): I think in this case the best short term solution is to use the standard create_custom_forward implementation and supply the lora_scale as a positional argument when calling resnet using the custom_forward function. This allows the gradient checkpointing forward pass to be the same as the non-gradient checkpointing forward pass during training (which is also the same as the forward pass during inference). [Note that this is implemented in e0ee9ca and 8756be5]. This does introduce some dependency on the order of the positional arguments in ResnetBlock2D.forward, but I think it's probably fine since the ResnetBlock2D API is likely to remain stable over time.

In the long term, at least in src/diffusers/models/unets/unet_2d_blocks.py, I think it might make sense to revisit the create_custom_forward implementation. A quick search through that file indicates that create_custom_forward is never called with the return_dict keyword argument. So perhaps a more general implementation like

def create_custom_forward(module):
    def custom_forward(*inputs, **kwargs):
        return module(*inputs, **kwargs)

    return custom_forward

could be used, and return_dict could be supplied through kwargs if necessary: for example, if we're using torch.utils.checkpoint.checkpoint, something like

ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = torch.utils.checkpoint.checkpoint(
    create_custom_forward(resnet),
    hidden_states,
    temb,
    scale=lora_scale,
    return_dict=True,
    **ckpt_kwargs,
)

Note that torch.utils.checkpoint.checkpoint supports both positional and keyword arguments, at least since v1.11.

sayakpaul · 2024-01-28T05:13:39Z

@dg845 I think I don't quite follow the concern fully.

Could you maybe try to demonstrate the issue with a simpler example?

which seems wrong when lora_scale is not ResnetBlock2D.forward's default scale value of 1.0 because the forward passes with and without gradient checkpointing are not equivalent.

Would like to see when this case arises. From what I understand, gradient checkpointing is used during training, and lora_scale is never supposed to be supplied during training. So, I don't quite understand how a discrepancy stems here. Maybe I am missing something.

I would like to keep the legacy blocks as is until and unless absolutely necessary. This is why I am asking for a simpler example to understand the consequences.

dg845 · 2024-01-28T08:28:11Z

which seems wrong when lora_scale is not ResnetBlock2D.forward's default scale value of 1.0 because the forward passes with and without gradient checkpointing are not equivalent.

Sorry, I should have made it clear that the above follows from my belief that the lora_scale should be supplied during training.

My understanding is that in the original LoRA paper the LoRA scale parameter $\alpha$ is a hyperparameter during training:

I think in practice $\alpha$ is typically held constant and the learning rate is tuned during training (following the highlighted section), but theoretically we could treat $\alpha$ and the learning rate as independent hyperparameters and tune them both.

Similarly, if we look at peft.tuners.lora.layer.Linear, the forward method does not disable scaling the LoRA update $\Delta W$ during training:

https://github.com/huggingface/peft/blob/bfc102c0c095dc9094cdd3523b729583bfad4688/src/peft/tuners/lora/layer.py#L318-L320

unlike for something like dropout where the forward pass would be different depending on whether torch.nn.Module.training is set.

So in my view the discrepancy between the gradient checkpointing code and non-gradient checkpointing code in e.g. CrossAttnDownBlock2D when lora_scale != 1.0 is a bug because

The gradient checkpointing forward pass differs from the non-gradient checkpointing forward pass during training.
For the same reason, the gradient checkpointing forward pass during training differs from the forward pass during inference, which will result in a train-test mismatch if we train using gradient checkpointing.

Practically speaking, we might not consider the train-test mismatch that arises to be that bad, since we may want to tune the scaling of the LoRA update during inference anyway (e.g. if we are performing inference with multiple LoRAs simultaneously).

dg845 · 2024-01-28T08:30:20Z

That being said, perhaps it's better if I move the changes (especially to CrossAttnDownBlock2D and CrossAttnUpBlock2D in 8756be5) out of this PR, and revisit this in a separate issue/PR.

sayakpaul · 2024-01-28T12:07:18Z

alpha is not the same as the scale parameter in LoRA training in my understanding. scale is an inference-time parameter and shouldn't influence training whereas alpha could be tuned during training of LoRAs. For LoRA training, we rely on PEFT. If we want to expose alpha for training we can easily do so. Example:

diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py

Line 987 in 5b1b80a

lora_alpha=args.rank,

But this discussion is starting to deviate from the original topic of the PR a bit IMO.

The gradient checkpointing forward pass differs from the non-gradient checkpointing forward pass during training.
For the same reason, the gradient checkpointing forward pass during training differs from the forward pass during inference, which will result in a train-test mismatch if we train using gradient checkpointing.

^ this I agree. And maybe this could be handled first in a separate PR and then we revisit this PR. Does that work?

dg845 · 2024-01-29T02:13:27Z

And maybe this could be handled first in a separate PR and then we revisit this PR. Does that work?

Sounds good :). To be more precise, would something like this sound good to you?

In this PR, gradient checkpointing is implemented for UNet2DModel and its associated blocks such as AttnDownBlock2D/AttnUpBlock2D in a way which is exactly parallel to the current gradient checkpointing implementation in UNet2DConditionModel and CrossAttnDownBlock2D/CrossAttnUpBlock2D.
The question of how the scale parameter should be handled for the legacy LoRA implementation will be revisited in a separate issue and/or PR.

…radient checkpointing for CrossAttnDownBlock2D/CrossAttnUpBlock2D as well." This reverts commit 8756be5.

…ions exactly parallel to CrossAttnDownBlock2D/CrossAttnUpBlock2D implementations.

sayakpaul · 2024-01-29T02:25:10Z

Yeah that is right.

dg845 · 2024-01-30T01:41:58Z

I have updated the gradient checkpointing implementation in this PR to be exactly parallel to that of UNet2DConditionModel and opened a new issue regarding the LoRA scale parameter scale at #6756.

sayakpaul

Thanks a mile!

…a movement. (huggingface#6704) * load cumprod tensor to device Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * fixing ci Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * make fix-copies Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> --------- Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…uggingface#6736) Fix bug in ResnetBlock2D.forward when not USE_PEFT_BACKEND and using scale_shift for time emb where the lora scale gets overwritten. Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update train_diffusion_dpo.py Address huggingface#6702 * Update train_diffusion_dpo_sdxl.py * Empty-Commit --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* update * update --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

…ss (huggingface#6762) * add is_flaky to test_model_cpu_offload_forward_pass * style * update --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

add ipo and hinge loss to dpo trainer

fix

* update * update * updaet * add tests and docs * clean up * add to toctree * fix copies * pr review feedback * fix copies * fix tests * update docs * update * update * update docs * update * update * update * update

add missing param

--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

move sigma to device Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

* add * remove transformer --------- Co-authored-by: yiyixuxu <yixu310@gmail,com>

…gface#6738) * harmonize the module structure for models in tests * make the folders modules. --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Update testing_utils.py * Update testing_utils.py

github-actions · 2024-02-27T15:03:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-02-27T15:08:26Z

I think the PR is borked. Should we open a new PR instead? @dg845

dg845 · 2024-03-04T04:40:42Z

Created a new PR with the changes at #7201. Will close this PR.

* Port UNet2DModel gradient checkpointing code from #6718. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Vincent Neemie <92559302+VincentNeemie@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: hlky <hlky@hlky.ac>

* Port UNet2DModel gradient checkpointing code from huggingface#6718. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Vincent Neemie <92559302+VincentNeemie@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: hlky <hlky@hlky.ac>

* Port UNet2DModel gradient checkpointing code from #6718. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Vincent Neemie <92559302+VincentNeemie@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: hlky <hlky@hlky.ac>

Enable gradient checkpointing for UNet2DModel.

cc039df

sayakpaul reviewed Jan 26, 2024

View reviewed changes

patrickvonplaten requested a review from yiyixuxu January 26, 2024 12:14

dg845 added 2 commits January 26, 2024 16:01

Add _set_gradient_checkpointing private method to UNet2DModel.

c85c5ed

Add gradient checkpointing tests to UNet2DModelTests.

6aac570

dg845 added 2 commits January 26, 2024 17:09

Implement gradient checkpointing for UNetMidBlock2D, AttnDownBlock2D,…

68c269a

… and AttnUpBlock2D.

Enable keyword arguments such as scale to be used for the custom_forw…

e837857

…ard for gradient checkpointing in AttnDownBlock2D and AttnUpBlock2D.

sayakpaul approved these changes Jan 27, 2024

View reviewed changes

tests/models/test_models_unet_2d.py Outdated Show resolved Hide resolved

tests/models/test_models_unet_2d.py Outdated Show resolved Hide resolved

src/diffusers/models/unets/unet_2d_blocks.py Outdated Show resolved Hide resolved

dg845 added 5 commits January 27, 2024 16:22

apply suggestions from review

341f0f2

make fix-copies

a2b7533

Use standard create_custom_forward implementation and call scale as a…

e0ee9ca

… positional arg when gradient checkpointing for AttnDownBlock2D/AttnUpBlock2D.

Supply scale as a positional arg when calling resnet during gradient …

8756be5

…checkpointing for CrossAttnDownBlock2D/CrossAttnUpBlock2D as well.

make fix-copies

6d0f8ed

dg845 added 2 commits January 28, 2024 18:19

Revert "Supply scale as a positional arg when calling resnet during g…

af80518

…radient checkpointing for CrossAttnDownBlock2D/CrossAttnUpBlock2D as well." This reverts commit 8756be5.

Make AttnDownBlock2D/AttnUpBlock2D gradient checkpointing implementat…

50aaadd

…ions exactly parallel to CrossAttnDownBlock2D/CrossAttnUpBlock2D implementations.

make fix-copies

ebe3d4a

dg845 mentioned this pull request Jan 30, 2024

Handling LoRA Scale in the Legacy LoRA Implementation. #6756

Closed

sayakpaul approved these changes Jan 30, 2024

View reviewed changes

woshiyyya and others added 19 commits February 2, 2024 14:53

Fix bug in ResnetBlock2D.forward where LoRA Scale gets Overwritten (h…

4d85aa0

…uggingface#6736) Fix bug in ResnetBlock2D.forward when not USE_PEFT_BACKEND and using scale_shift for time emb where the lora scale gets overwritten. Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

add note about serialization (huggingface#6764)

41cc8d2

Update train_diffusion_dpo.py (huggingface#6754)

087b1aa

* Update train_diffusion_dpo.py Address huggingface#6702 * Update train_diffusion_dpo_sdxl.py * Empty-Commit --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Pin torch < 2.2.0 in test runners (huggingface#6780)

3bbd6e5

* update * update --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

[Kandinsky tests] add is_flaky to test_model_cpu_offload_forward_pa…

9434668

…ss (huggingface#6762) * add is_flaky to test_model_cpu_offload_forward_pass * style * update --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

add ipo, hinge and cpo loss to dpo trainer (huggingface#6788)

58f11ab

add ipo and hinge loss to dpo trainer

Fix setting scaling factor in VAE config (huggingface#6779)

e2d2dbe

fix

Add PIA Model/Pipeline (huggingface#6698)

23a08d7

* update * update * updaet * add tests and docs * clean up * add to toctree * fix copies * pr review feedback * fix copies * fix tests * update docs * update * update * update docs * update * update * update * update

[docs] Add missing parameter (huggingface#6775)

64533cd

add missing param

[IP-Adapter] Support multiple IP-Adapters (huggingface#6573)

8fac00f

--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[sdxl k-diffusion pipeline]move sigma to device (huggingface#6757)

0e93330

move sigma to device Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

[Feat] add I2VGenXL for image-to-video generation (huggingface#6665)

eb95634

--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

fix torchvision import (huggingface#6796)

81277bb

add is_torchvision_available (huggingface#6800)

f36ed4e

* add * remove transformer --------- Co-authored-by: yiyixuxu <yixu310@gmail,com>

[Refactor] harmonize the module structure for models in tests (huggin…

e9336c0

…gface#6738) * harmonize the module structure for models in tests * make the folders modules. --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>

[refactor]Scheduler.set_begin_index (huggingface#6728)

e24f590

[Contributor Experience] Fix test collection on MPS (huggingface#6808)

574d01a

* Update testing_utils.py * Update testing_utils.py

Merge branch 'main' into unet-2d-model-support-grad-checkpointing

13c179d

github-actions bot added the stale Issues that haven't received updates label Feb 27, 2024

sayakpaul removed the stale Issues that haven't received updates label Feb 27, 2024

dg845 added a commit to dg845/diffusers that referenced this pull request Mar 4, 2024

Port UNet2DModel gradient checkpointing code from huggingface#6718.

3f71d82

dg845 mentioned this pull request Mar 4, 2024

Enable Gradient Checkpointing for UNet2DModel (New) #7201

Merged

6 tasks

dg845 closed this Mar 4, 2024

Enable Gradient Checkpointing for UNet2DModel #6718

Enable Gradient Checkpointing for UNet2DModel #6718

Uh oh!

Conversation

dg845 commented Jan 26, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 26, 2024

Uh oh!

dg845 commented Jan 27, 2024

Uh oh!

dg845 commented Jan 27, 2024

Uh oh!

sayakpaul commented Jan 27, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dg845 commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 28, 2024

Uh oh!

dg845 commented Jan 28, 2024

Uh oh!

dg845 commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 28, 2024

Uh oh!

dg845 commented Jan 29, 2024

Uh oh!

sayakpaul commented Jan 29, 2024

Uh oh!

dg845 commented Jan 30, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 27, 2024

Uh oh!

sayakpaul commented Feb 27, 2024

Uh oh!

dg845 commented Mar 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

dg845 commented Jan 28, 2024 •

edited

Loading

dg845 commented Jan 28, 2024 •

edited

Loading