Skip to content

In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred. #6086

@shangvo

Description

@shangvo

Describe the bug

12/07/2023 07:37:24 - INFO - main - ***** Running training *****
12/07/2023 07:37:24 - INFO - main - Num examples = 833
12/07/2023 07:37:24 - INFO - main - Num Epochs = 72
12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1
12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4
12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000
Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last):
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 960, in
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 798, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/sddata/finetune/lora/pokemon', '--push_to_hub', '--hub_model_id=pokemon-lora', '--report_to=wandb', '--checkpointing_steps=500', '--validation_prompt=A pokemon with blue eyes.', '--seed=1337']' returned non-zero exit status 1.

Reproduction

!git clone https://github.com/huggingface/diffusers
%cd diffusers
!pip install .
%cd examples/text_to_image
!pip install -r requirements.txt
!accelerate config default
!pip install huggingface_hub wandb

from huggingface_hub import HfFolder, login

使用 Hugging Face 的 API 密钥登录

login(token='hf_tlt---------BRqMBjwdi')

设置 WandB 的 API 密钥

import wandb
wandb.login(key='b6a210-------------7f543c')

运行训练脚本

!accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
--dataset_name="lambdalabs/pokemon-blip-captions"
--dataloader_num_workers=8
--resolution=512
--center_crop
--random_flip
--train_batch_size=1
--gradient_accumulation_steps=4
--max_train_steps=15000
--learning_rate=1e-04
--max_grad_norm=1
--lr_scheduler="cosine"
--lr_warmup_steps=0
--output_dir="/sddata/finetune/lora/pokemon"
--push_to_hub
--hub_model_id="pokemon-lora"
--report_to=wandb
--checkpointing_steps=500
--validation_prompt="A pokemon with blue eyes."
--seed=1337

Logs

|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|

System Info

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0xffffffff
cpu MHz : 2199.998
cache size : 56320 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips : 4399.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

Who can help?

@sayakpaul @patrickvonplaten

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions