Skip to content

ValueError: Attempting to unscale FP16 gradients. when using examples/dreambooth/train_dreambooth_lora_sd3.py #9628

@yafeim

Description

@yafeim

Describe the bug

I was using the example training script examples/dreambooth/train_dreambooth_lora_sd3.py.

I changed autocast_ctx = nullcontext() to autocast_ctx = torch.autocast(accelerator.device.type, dtype=torch.float32) to address RuntimeError: Input type (float) and bias type (c10::Half) should be the same.

However, I got ValueError: Attempting to unscale FP16 gradients during running validation.

The complete error message:

(/home/ubuntu/code/yafeimao_env/diffusers) ubuntu@ip-172-31-20-87:~/code/yafeimao_code/diffusers_new/diffusers/examples/dreambooth$ accelerate launch train_dreambooth_lora_sd3.py --mixed_precision="fp16" --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --instance_prompt="a photo of green floral dress" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=4 --learning_rate=1e-5 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=500 --validation_prompt="A photo of a female model wearing green floral dress" --validation_epochs=25 --seed="0" --push_to_hub
10/09/2024 19:14:52 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'base_shift', 'max_shift', 'use_dynamic_shifting', 'max_image_seq_len', 'base_image_seq_len'} was not found in config. Values will be initialized to default values.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 288.50it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:22<00:00, 71.45s/it]
{'mid_block_add_attention'} was not found in config. Values will be initialized to default values.
10/09/2024 19:19:14 - INFO - main - ***** Running training *****
10/09/2024 19:19:14 - INFO - main - Num examples = 4
10/09/2024 19:19:14 - INFO - main - Num batches each epoch = 4
10/09/2024 19:19:14 - INFO - main - Num Epochs = 500
10/09/2024 19:19:14 - INFO - main - Instantaneous batch size per device = 1
10/09/2024 19:19:14 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
10/09/2024 19:19:14 - INFO - main - Gradient Accumulation steps = 4
10/09/2024 19:19:14 - INFO - main - Total optimization steps = 500
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 312.30it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:22<00:00, 71.10s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:22<00:00, 70.46s/itLoaded tokenizer_3 as T5TokenizerFast from tokenizer_3 subfolder of stabilityai/stable-diffusion-3-medium-diffusers. | 0/9 [00:00<?, ?it/s] Loaded tokenizer_2 as CLIPTokenizer from tokenizer_2 subfolder of stabilityai/stable-diffusion-3-medium-diffusers. | 1/9 [00:00<00:02, 2.75it/s]Loaded tokenizer as CLIPTokenizer from tokenizer subfolder of stabilityai/stable-diffusion-3-medium-diffusers.
{'base_shift', 'max_shift', 'use_dynamic_shifting', 'max_image_seq_len', 'base_image_seq_len'} was not found in config. Values will be initialized to default values.█████████████ | 6/9 [00:00<00:00, 14.12it/s]
Loaded scheduler as FlowMatchEulerDiscreteScheduler from scheduler subfolder of stabilityai/stable-diffusion-3-medium-diffusers.
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 16.60it/s]
10/09/2024 19:22:07 - INFO - main - Running validation...
Generating 4 images with prompt: A photo of a female model wearing green floral dress.
Steps: 0%|▍ | 1/500 [04:08<24:06, 2.90s/it, loss=0.176, lr=Steps: 0%|▍ | 1/500 [04:08<24:06, 2.90s/it, loss=0.301, lr=Steps: 0%|▍ | 1/500 [04:08<24:06, 2.90s/it, loss=0.134, lr=1e-5]Traceback (most recent call last):
File "/home/ubuntu/code/yafeimao_code/diffusers_new/diffusers/examples/dreambooth/train_dreambooth_lora_sd3.py", line 1872, in
main(args)
File "/home/ubuntu/code/yafeimao_code/diffusers_new/diffusers/examples/dreambooth/train_dreambooth_lora_sd3.py", line 1727, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 2346, in clip_grad_norm_
self.unscale_gradients()
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 2290, in unscale_gradients
self.scaler.unscale_(opt)
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0%|▍ | 1/500 [04:09<34:32:16, 249.17s/it, loss=0.134, lr=1e-5]
Traceback (most recent call last):
File "/home/ubuntu/code/yafeimao_env/diffusers/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1174, in launch_command
simple_launcher(args)
File "/home/ubuntu/code/yafeimao_env/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 769, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/code/yafeimao_env/diffusers/bin/python3.9', 'train_dreambooth_lora_sd3.py', '--mixed_precision=fp16', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-3-medium-diffusers', '--instance_data_dir=/home/ubuntu/code/yafeimao_code/diffusers/images/green_floral_dress', '--output_dir=trained-sd3-lora', '--instance_prompt=a photo of green floral dress', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-5', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=500', '--validation_prompt=A photo of a female model wearing green floral dress', '--validation_epochs=25', '--seed=0', '--push_to_hub']' returned non-zero exit status 1.

Can you help? Thanks!

Reproduction

I was using this script

And my command:

accelerate launch train_dreambooth_lora_sd3.py
--mixed_precision="fp16"
--pretrained_model_name_or_path=$MODEL_NAME
--instance_data_dir=$INSTANCE_DIR
--output_dir=$OUTPUT_DIR
--instance_prompt="a photo of green floral dress"
--resolution=512
--train_batch_size=1
--gradient_accumulation_steps=4
--learning_rate=1e-5
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=500
--validation_prompt="A photo of a female model wearing green floral dress"
--validation_epochs=25
--seed="0"
--push_to_hub

Logs

No response

System Info

Ubuntu 20.04

NVIDIA A10G Single GPU

And my pip list

Package Version Editable project location


absl-py 2.1.0
accelerate 0.34.2
bitsandbytes 0.44.1
certifi 2024.8.30
charset-normalizer 3.3.2
diffusers 0.31.0.dev0 /home/ubuntu/code/yafeimao_code/diffusers
filelock 3.16.0
fsspec 2024.9.0
ftfy 6.2.3
grpcio 1.66.2
huggingface-hub 0.24.6
idna 3.8
importlib_metadata 8.4.0
Jinja2 3.1.4
Markdown 3.7
MarkupSafe 3.0.1
mpmath 1.3.0
networkx 3.2.1
numpy 1.24.1
packaging 24.1
peft 0.11.1
pillow 10.4.0
pip 24.2
protobuf 5.28.2
psutil 6.0.0
PyYAML 6.0.2
regex 2024.7.24
requests 2.32.3
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 73.0.1
six 1.16.0
sympy 1.12
tensorboard 2.18.0
tensorboard-data-server 0.7.2
tokenizers 0.20.0
torch 2.1.2+cu121
torchaudio 2.1.2+cu121
torchvision 0.16.2+cu121
tqdm 4.66.5
transformers 4.45.2
triton 2.1.0
typing_extensions 4.12.2
urllib3 2.2.2
wcwidth 0.2.13
Werkzeug 3.0.4
wheel 0.44.0
zipp 3.20.1

Who can help?

@yiyixuxu @sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions