Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

Closed
timh opened this issue Dec 6, 2022 · 1 comment
Closed

Dreambooth: crash after saving a checkpoint if fp16 output is enabled #1566

timh opened this issue Dec 6, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@timh
Copy link
Contributor

timh commented Dec 6, 2022

Describe the bug

If (accelerate is configured with fp16, or --mixed_precision=fp16 is specified on the command line) AND --save_steps is specified on the command line, Dreambooth crashes after writing a checkpoint:

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

Reproduction

Repro at command line (with some irrelevant fetching output snipped):

$ cd .../diffusers/examples/dreambooth
$ accelerate launch --num_cpu_threads_per_process 4 -- train_dreambooth.py \
  --output_dir /home/tim/models/repro \
  --instance_data_dir /tmp/images.repro --instance_prompt "photo of reprocase" \
  --save_steps 5 --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5
Steps:  45%|████                      | 5/11 [00:14<00:05,  1.17it/s, loss=0.156, lr=5e-6]

Traceback (most recent call last):
  File "/home/tim/devel/diffusers-hf/examples/dreambooth/train_dreambooth.py", line 713, in <module>
    main(args)
  File "/home/tim/devel/diffusers-hf/examples/dreambooth/train_dreambooth.py", line 632, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tim/devel/diffusers-hf/src/diffusers/models/unet_2d_condition.py", line 371, in forward
    sample = self.conv_in(sample)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
Steps:  45%|████                      | 5/11 [00:14<00:17,  2.95s/it, loss=0.156, lr=5e-6]
Traceback (most recent call last):
  File "/home/tim/.conda/envs/diffusers-hf/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/tim/.conda/envs/diffusers-hf/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tim/.conda/envs/diffusers-hf/bin/python', 'train_dreambooth.py', '--output_dir', '/home/tim/models/repro', '--instance_data_dir', '/tmp/images.repro', '--instance_prompt', 'photo of reprocase', '--save_steps', '5', '--pretrained_model_name_or_path', 'runwayml/stable-diffusion-v1-5']' returned non-zero exit status 1.
(diffusers-hf)

Accelerate config:

$ accelerate  env                                                                                                                [22:51:52]

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.15.0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.10.8
- Numpy version: 1.23.4
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

(PR to fix is incoming)

Logs

No response

System Info

$ git show --format=oneline|head -1
af04479e858c7fbb2ff3bf4c31f8c077703a339e [docs] [dreambooth training] default accelerate config (#1564)
@timh timh added the bug Something isn't working label Dec 6, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 6, 2022
… checkpoint to avoid crash when running fp16
@patil-suraj patil-suraj self-assigned this Dec 6, 2022
@patil-suraj
Copy link
Contributor

Great catch! Will take a look at the PR!

timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
… checkpoint to avoid crash when running fp16
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
… checkpoint to avoid crash when running fp16
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 8, 2022
timh added a commit to timh/diffusers that referenced this issue Dec 10, 2022
… checkpoint to avoid crash when running fp16
timh added a commit to timh/diffusers that referenced this issue Dec 10, 2022
tcapelle pushed a commit to tcapelle/diffusers that referenced this issue Dec 12, 2022
… checkpoint to avoid crash when running fp16 (huggingface#1618)

* dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16

* dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for huggingface#1566

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update examples/dreambooth/train_dreambooth.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
sliard pushed a commit to sliard/diffusers that referenced this issue Dec 21, 2022
… checkpoint to avoid crash when running fp16 (huggingface#1618)

* dreambooth: fix huggingface#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16

* dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for huggingface#1566

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update examples/dreambooth/train_dreambooth.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants