Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

timh · 2022-12-06T06:57:02Z

Describe the bug

If (accelerate is configured with fp16, or --mixed_precision=fp16 is specified on the command line) AND --save_steps is specified on the command line, Dreambooth crashes after writing a checkpoint:

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

Reproduction

Repro at command line (with some irrelevant fetching output snipped):

$ cd .../diffusers/examples/dreambooth
$ accelerate launch --num_cpu_threads_per_process 4 -- train_dreambooth.py \
  --output_dir /home/tim/models/repro \
  --instance_data_dir /tmp/images.repro --instance_prompt "photo of reprocase" \
  --save_steps 5 --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5
Steps:  45%|████                      | 5/11 [00:14<00:05,  1.17it/s, loss=0.156, lr=5e-6]

Traceback (most recent call last):
  File "/home/tim/devel/diffusers-hf/examples/dreambooth/train_dreambooth.py", line 713, in <module>
    main(args)
  File "/home/tim/devel/diffusers-hf/examples/dreambooth/train_dreambooth.py", line 632, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tim/devel/diffusers-hf/src/diffusers/models/unet_2d_condition.py", line 371, in forward
    sample = self.conv_in(sample)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
Steps:  45%|████                      | 5/11 [00:14<00:17,  2.95s/it, loss=0.156, lr=5e-6]
Traceback (most recent call last):
  File "/home/tim/.conda/envs/diffusers-hf/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tim/.conda/envs/diffusers-hf/bin/python', 'train_dreambooth.py', '--output_dir', '/home/tim/models/repro', '--instance_data_dir', '/tmp/images.repro', '--instance_prompt', 'photo of reprocase', '--save_steps', '5', '--pretrained_model_name_or_path', 'runwayml/stable-diffusion-v1-5']' returned non-zero exit status 1.
(diffusers-hf)

Accelerate config:

$ accelerate  env                                                                                                                [22:51:52]

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.15.0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.10.8
- Numpy version: 1.23.4
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

(PR to fix is incoming)

Logs

No response

System Info

$ git show --format=oneline|head -1
af04479e858c7fbb2ff3bf4c31f8c077703a339e [docs] [dreambooth training] default accelerate config (#1564)

The text was updated successfully, but these errors were encountered:

… checkpoint to avoid crash when running fp16

patil-suraj · 2022-12-06T12:08:08Z

Great catch! Will take a look at the PR!

… checkpoint to avoid crash when running fp16

…ions of accelerate. part of fix for huggingface#1566

…ld check

… checkpoint to avoid crash when running fp16

…ions of accelerate. part of fix for huggingface#1566

…ld check

…check happy

…ions of accelerate. part of fix for huggingface#1566

… checkpoint to avoid crash when running fp16

…ions of accelerate. part of fix for huggingface#1566

… checkpoint to avoid crash when running fp16 (huggingface#1618) * dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16 * dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for huggingface#1566 * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * Update examples/dreambooth/train_dreambooth.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

timh added the bug Something isn't working label Dec 6, 2022

timh added a commit to timh/diffusers that referenced this issue Dec 6, 2022

dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a…

2ca48e3

… checkpoint to avoid crash when running fp16

patil-suraj self-assigned this Dec 6, 2022

DrewWalkup mentioned this issue Dec 6, 2022

Dreambooth: fix #1566: Use autocast to correct full vs half precision error #1575

Closed

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a…

c231c62

… checkpoint to avoid crash when running fp16

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth: guard against passing keep_fp32_wrapper arg to older vers…

ae306ef

…ions of accelerate. part of fix for huggingface#1566

brian6091 mentioned this issue Dec 8, 2022

Error: "Attempting to unscale FP16 gradients." brian6091/Dreambooth#7

Closed

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth huggingface#1566: fixed formatting reported by 'black' bui…

fea0615

…ld check

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a…

c3922b5

… checkpoint to avoid crash when running fp16

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth: guard against passing keep_fp32_wrapper arg to older vers…

6217a5a

…ions of accelerate. part of fix for huggingface#1566

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth huggingface#1566: fixed formatting reported by 'black' bui…

d5d3c81

…ld check

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth huggingface#1566: fix missing import

3ad72f5

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth huggingface#1566: re-sort imports to make "isort" quality …

5b909c3

…check happy

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022

dreambooth: guard against passing keep_fp32_wrapper arg to older vers…

053e6b7

…ions of accelerate. part of fix for huggingface#1566

timh mentioned this issue Dec 9, 2022

fix global steps tracking & --save_steps intermittent saves (#6, #8) cloneofsimo/lora#11

Merged

timh added a commit to timh/diffusers that referenced this issue Dec 10, 2022

dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a…

80486c1

… checkpoint to avoid crash when running fp16

timh added a commit to timh/diffusers that referenced this issue Dec 10, 2022

dreambooth: guard against passing keep_fp32_wrapper arg to older vers…

ebfd8bf

…ions of accelerate. part of fix for huggingface#1566

patrickvonplaten closed this as completed in 2868d99 Dec 10, 2022

d8ahazard mentioned this issue Dec 11, 2022

Error When Training Dreambooth after Saving Weights: RuntimeError: Inference tensors cannot be saved for backward. #1658

Closed

aksh-at mentioned this issue Jan 1, 2023

Fix dreambooth example modal-labs/modal-examples#174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

timh commented Dec 6, 2022 •

edited

Loading

patil-suraj commented Dec 6, 2022

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

Comments

timh commented Dec 6, 2022 • edited Loading

Describe the bug

Reproduction

Logs

System Info

patil-suraj commented Dec 6, 2022

timh commented Dec 6, 2022 •

edited

Loading