7529 do not disable autocast for cuda devices #7530

bghira · 2024-03-30T16:51:09Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

bghira · 2024-03-30T16:52:15Z

@sayakpaul @christopher5106

…tocast implementation makes it a non-issue

sayakpaul

Thanks for taking care of it!

HuggingFaceDocBuilderDev · 2024-03-31T14:41:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bghira · 2024-03-31T14:53:17Z

i'm going to switch from this method to contextlib's nullcontext, as local testing with nightly pytorch indicates that'll avoid some new errors that will come down the pipe, and specifically the null context will allow us to entirely bypass all issues with autocast - sometimes, the platform is disabled incorrectly, or has partial support.

the nullcontext will allow us to fully decide when to commit to providing autocast on a new platform

bghira · 2024-03-31T19:17:38Z

@sayakpaul ready :-)

bghira · 2024-03-31T19:18:11Z

@pcuenca i tried to unify the behaviour of the context selection to use contextlib

sayakpaul

Nice 👌

simbrams · 2024-04-01T13:52:40Z

Can you also remove the error test in the instruct pix2pix xl pipeline 🙏 :

diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_instruct_pix2pix.py

Lines 927 to 930 in 5266ab7

    
           else: 
        
               raise ValueError( 
        
                   "For the given accelerator, there seems to be an unexpected problem in type-casting. Please file an issue on the PyTorch GitHub repository. See also: https://github.com/huggingface/diffusers/pull/7446/." 
        
               )

bghira · 2024-04-01T14:44:17Z

@simbrams done

…ple that uses it

bghira · 2024-04-01T14:58:17Z

@sayakpaul @DN6 @yiyixuxu i've updated every pipeline that uses 🤗 Accelerate to disable AMP so that accelerator.prepare(...) does not try loading an autocast ctx.

christopher5106

Works for me now. I mainly checked the pipeline and the train script for sdxl lora.

sayakpaul · 2024-04-02T01:19:36Z

@bghira thanks for the changes. A couple of comments:

GPUs are first-class citizens for the training scripts. In order to support Silicon compatibility, we cannot break that. I am sorry that I didn't check that earlier.
I see we're now making changes to the pipelines so that they use an inference context. I don't think we can have that level of change in our pipelines. The silicon support is experimental from PyTorch itself. So when that gets stabilized a bit, we might have to undo the changes we're currently making. So, I am not in favor of going via this route.
IMO, the changes should be kept as minimal as possible. If making changes to the training scripts and the schedulers (like we did before) to support Silicon compatibility isn't enough, that means this feature itself is a little too experimental to support as that requires non-trivial changes to the legacy core.

So, IMO, we should make sure:

Training works as before on GPUs.
Silicon training works but with minimal changes. If that prevents FP16 training, I would still be okay with that because it seems like it's quite experimental. If disabling the intermediate validation could work -- that could be another route too (albeit not a pragmatic one).

@pcuenca @yiyixuxu would like to get your opinions too.

yiyixuxu · 2024-04-02T01:29:11Z

@sayakpaul

I see we're now making changes to the pipelines so that they use an inference context.

I didn't see it in this PR though

yiyixuxu · 2024-04-02T01:31:55Z

@sayakpaul

So, IMO, we should make sure:

Training works as before on GPUs.
Silicon training works but with minimal changes. If that prevents FP16 training, I would still be okay with that because it >seems like it's quite experimental. If disabling the intermediate validation could work -- that could be another route too (albeit not a pragmatic one).

agree with this

sayakpaul · 2024-04-02T01:37:49Z

I didn't see it in this PR though

@yiyixuxu see here: https://github.com/huggingface/diffusers/pull/7530/files#diff-942ed0e4f11fae99750e9b042c4950baa2078bf7f75b1a450a7fa0339f5f034bR551

bghira · 2024-04-02T01:41:56Z

that was already using a context manager, i simply updated it to be consistent

bghira · 2024-04-02T01:47:00Z

as an aside, i've finetuned the issue out of 2.1-v weights. so, a fixed model is available ... but it seems like the actual 2.1-v weights should be fixable somehow with a new or fixed extraction. (SDXL works great)

sayakpaul · 2024-04-02T01:48:19Z

Okay good to know. Thanks!

What are our blockers now? I am happy to cook up a context manager in training_utils.py and then we can take it from there?

bghira · 2024-04-02T01:49:44Z

also, the instruct pix2pix script is where a lot of the original confusion came from! that you wrote :D there, the autocast disables for fp16 and pulls the device type str to chop :0 off - so i learned a few things along the way to fixing this. but now everything is at least consistent - there were some community pipelines i did not update, as it didn't seem appropriate to do so.

bghira · 2024-04-02T01:54:37Z

i've been trying to chase down that training issue. i want to try running SD 1.5 and see if the problem for legacy models is limited to 2.x, in which case it'll be best to limit training to SD 1.x.. will update shortly

sayakpaul · 2024-04-02T01:55:58Z

the autocast disables for fp16 and pulls the device type str to chop :0 off

I see it's enabled when mixed precision is FP16:

diffusers/examples/instruct_pix2pix/train_instruct_pix2pix.py

Line 947 in 7956c36

    
           str(accelerator.device).replace(":0", ""), enabled=accelerator.mixed_precision == "fp16"

But anyway, as mentioned I am down to writing a context manager to help ease things a bit. Maybe providing everything you have discovered so far regarding training on Silicon could be a very nice resource.

bghira · 2024-04-02T02:31:19Z

on mps,

sd 2.1 does not work in diffusers or simpletuner
sd 1.5 does not work in diffusers or simpletuner
- both of these have MPS internal crashes
sdxl does not work in diffusers, but does work in simpletuner
- in diffusers, fp16 training on sdxl causes the attention processor query to be in float32 while the weights are half

i've gone through them a lot today, and really not sure why it would cause the MPS crash, it's incredibly unfortunate.

the SDXL error does seem like it can be overcome, i just need to put a bit more time into hunting it down

so far, this patch works as advertised at least to resolve the cuda-specific regression and it should probably be merged

sayakpaul · 2024-04-02T02:34:04Z

That's quite informative indeed.

sdxl does not work in diffusers, but does work in simpletuner
in diffusers, fp16 training on sdxl causes the attention processor query to be in float32 while the weights are half

Curious, what do you do on simpletuner to have that fixed?

sayakpaul · 2024-04-02T02:34:23Z

I will wait for @pcuenca to give it a shot too, before merging.

bghira · 2024-04-02T02:36:15Z

in simpletuner i have a lot more brute-force general handling of dtypes and leave less responsibility for that on other parts of the stack.

examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py

bghira · 2024-04-02T12:10:27Z

good morning. after some more checking, fp32 training works fine for MPS on SD 1.5 and 2.1. but this feels disappointing, as dtype issues just shouldn't be the only thing holding back memory-efficient training. but i guess it's all we can do for now?

bghira · 2024-04-02T12:27:25Z

pytorch 2.3 supports MPS bf16:

Valid Types: [torch.float32, torch.float32, torch.float16, torch.float16, torch.bfloat16, torch.complex64, torch.uint8, torch.int8, torch.int16, torch.int16, torch.int32, torch.int32, torch.int64, torch.int64, torch.bool]

Invalid Types: [torch.float64, torch.float64, torch.complex128, torch.complex128, torch.quint8, torch.qint8, torch.quint4x2]

pytorch 2.2.1 does not:

Valid Types: [torch.float32, torch.float32, torch.float16, torch.float16, torch.uint8, torch.int8, torch.int16, torch.int16, torch.int32, torch.int32, torch.int64, torch.int64, torch.bool]
Invalid Types: [torch.float64, torch.float64, torch.bfloat16, torch.complex64, torch.complex128, torch.complex128, torch.quint8, torch.qint8, torch.quint4x2]
---
[bfloat16] BFloat16 is not supported on MPS

sayakpaul · 2024-04-02T13:06:47Z

@bghira I am going to merge the PR in a while. But tomorrow, I will take a closer look into autocast related things that we're using in our training scripts and see if we can get rid of them. I can do these tests fast in a CUDA environment. Will look into an autocast context manager (to add to training_utils.py too).

But just so I understand the issues for M3 training correctly, w.r.t #7530 (comment), were you able to pinpoint a bug when training with FP16? Does it fail during intermediate inference?

sayakpaul · 2024-04-02T13:19:06Z

examples/controlnet/train_controlnet.py

@@ -752,6 +752,10 @@ def main(args):
        project_config=accelerator_project_config,
    )

+    # Disable AMP for MPS.


@bghira possible to add a more descriptive comment here?

Also, I see this is not added to some scripts such as the advanced diffusion or consistency distillation scripts.

bghira · 2024-04-02T13:30:51Z

the fp16 mps bug happens on the third call to nn.Linear, the Parameter object is created with fp32 dtype for weight and bias but the query is fp16

bghira · 2024-04-02T13:32:15Z

its an actual training failure on step 1 when t=999

i saw that nn.Linear has a dtype parameter i tried setting but the Parameter class it gets passed into actually doesnt make use of the dtype parameter which seems like a torch bug

bghira · 2024-04-02T13:46:59Z

https://github.com/Birch-san/mpt-play/blob/main/src%2Fmps_autocast_stub.py

maybe useful.

sayakpaul · 2024-04-02T14:43:44Z

Thanks very much folks for the insightful discussions on autocast and bfloat16. I am under the impression that PyTorch, indeed, has some weird voodoo going under the hood for these.

Meanwhile, I am going to merge this PR to unblock our users. So, thanks to @bghira for the prompt action here.

This thread still remains open for discussions around autocast and bfloat16 which I believe will be valuable for the community.

bghira · 2024-04-02T14:43:57Z

without attention slicing, we see:

/AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:788: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'

but with attention slicing, this error disappears.

the resolution seems to be:

    # Base components to prepare
    if torch.backends.mps.is_available():
        accelerator.native_amp = False
    results = accelerator.prepare(unet, lr_scheduler, optimizer, *train_dataloaders)
    unet = results[0]
    if torch.backends.mps.is_available():
        unet.set_attention_slice()

however, the casting type issue is now there for bf16 and fp16, but bf16 no longer crashes. still digging into it.

AmericanPresidentJimmyCarter · 2024-04-02T15:44:52Z

Thanks very much folks for the insightful discussions on autocast and bfloat16. I am under the impression that PyTorch, indeed, has some weird voodoo going under the hood for these.

Meanwhile, I am going to merge this PR to unblock our users. So, thanks to @bghira for the prompt action here.

This thread still remains open for discussions around autocast and bfloat16 which I believe will be valuable for the community.

The issues seem to be scoped to bfloat16, which is unfortunate because most hardware has moved to it over the past several years.

The assert_close should fail. Interestingly, the assert_close passes if we change the autocast dtype to torch.float16.
pytorch/pytorch#120930
A model with PyTorch AMP produces different model results when used in conjunction with CUDA graphs.
In other words, it seems that conducting CUDA graph capture while wrapped with autocast leads to wrong outputs during training (specifically, after the first weight update).
pytorch/pytorch#71631

The second issue is from Jan 2022, so we may still be waiting for a while for a fix. It is clear that autocast was originally designed with fp16 and the fp16 grad scaler in mind and has issues with other types.

bghira · 2024-04-02T15:53:30Z

so disabling autocast on bf16 everywhere does seem like the way to go. honestly the new optimiser solves any need to use fp16 - pure bf16 is lighter weight and simpler code

* 7529 do not disable autocast for cuda devices * Remove typecasting error check for non-mps platforms, as a correct autocast implementation makes it a non-issue * add autocast fix to other training examples * disable native_amp for dreambooth (sdxl) * disable native_amp for pix2pix (sdxl) * remove tests from remaining files * disable native_amp on huggingface accelerator for every training example that uses it * convert more usages of autocast to nullcontext, make style fixes * make style fixes * style. * Empty-Commit --------- Co-authored-by: bghira <bghira@users.github.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

7529 do not disable autocast for cuda devices

688051c

bghira and others added 2 commits March 30, 2024 11:09

Merge branch 'main' into bugfix/7529-autocast-failure-cuda

14a72b8

Remove typecasting error check for non-mps platforms, as a correct au…

20a7c8f

…tocast implementation makes it a non-issue

sayakpaul approved these changes Mar 31, 2024

View reviewed changes

sayakpaul requested review from pcuenca and yiyixuxu March 31, 2024 14:34

add autocast fix to other training examples

ad3eb80

bghira mentioned this pull request Mar 31, 2024

Issue running stable diffusion dreambooth on mac m3 max apple silicon. #7498

Open

sayakpaul approved these changes Apr 1, 2024

View reviewed changes

bghira added 2 commits April 1, 2024 05:53

disable native_amp for dreambooth (sdxl)

381f73d

disable native_amp for pix2pix (sdxl)

59b9b41

sayakpaul and others added 2 commits April 1, 2024 19:34

Merge branch 'main' into bugfix/7529-autocast-failure-cuda

2f9e7c6

remove tests from remaining files

35cdecd

bghira added 2 commits April 1, 2024 08:47

disable native_amp on huggingface accelerator for every training exam…

561bf2d

…ple that uses it

convert more usages of autocast to nullcontext, make style fixes

55912ed

make style fixes

455bb6f

christopher5106 approved these changes Apr 1, 2024

View reviewed changes

Merge branch 'main' into bugfix/7529-autocast-failure-cuda

0ce09ec

DN6 reviewed Apr 2, 2024

View reviewed changes

examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py Show resolved Hide resolved

Merge branch 'main' into bugfix/7529-autocast-failure-cuda

b1b8bc9

style.

1483581

Empty-Commit

d084ad5

sayakpaul reviewed Apr 2, 2024

View reviewed changes

sayakpaul merged commit 8e963d1 into huggingface:main Apr 2, 2024
15 checks passed

bghira mentioned this pull request Apr 2, 2024

[mps] training / inference dtype issues #7563

Open

bghira deleted the bugfix/7529-autocast-failure-cuda branch April 2, 2024 15:12

sayakpaul mentioned this pull request Apr 4, 2024

ValueError: For the given accelerator, there seems to be an unexpected problem in type-casting. Please file an issue on the PyTorch GitHub repository. #7529

Closed

7529 do not disable autocast for cuda devices #7530

7529 do not disable autocast for cuda devices #7530

Conversation

bghira commented Mar 30, 2024 • edited

What does this PR do?

Before submitting

Who can review?

bghira commented Mar 30, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 31, 2024

bghira commented Mar 31, 2024

bghira commented Mar 31, 2024

bghira commented Mar 31, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

simbrams commented Apr 1, 2024

bghira commented Apr 1, 2024

bghira commented Apr 1, 2024

christopher5106 left a comment

Choose a reason for hiding this comment

sayakpaul commented Apr 2, 2024

yiyixuxu commented Apr 2, 2024

yiyixuxu commented Apr 2, 2024

sayakpaul commented Apr 2, 2024

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024 • edited

sayakpaul commented Apr 2, 2024

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024

sayakpaul commented Apr 2, 2024

bghira commented Apr 2, 2024

sayakpaul commented Apr 2, 2024

sayakpaul commented Apr 2, 2024

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024 • edited

sayakpaul commented Apr 2, 2024

sayakpaul Apr 2, 2024

Choose a reason for hiding this comment

sayakpaul Apr 2, 2024 • edited

Choose a reason for hiding this comment

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024

bghira commented Apr 2, 2024

sayakpaul commented Apr 2, 2024

bghira commented Apr 2, 2024

AmericanPresidentJimmyCarter commented Apr 2, 2024 • edited

bghira commented Apr 2, 2024

bghira commented Mar 30, 2024 •

edited

bghira commented Apr 2, 2024 •

edited

bghira commented Apr 2, 2024 •

edited

sayakpaul Apr 2, 2024 •

edited

AmericanPresidentJimmyCarter commented Apr 2, 2024 •

edited