-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
Describe the bug
12/07/2023 07:37:24 - INFO - main - ***** Running training *****
12/07/2023 07:37:24 - INFO - main - Num examples = 833
12/07/2023 07:37:24 - INFO - main - Num Epochs = 72
12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1
12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4
12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000
Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last):
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 960, in
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 798, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/sddata/finetune/lora/pokemon', '--push_to_hub', '--hub_model_id=pokemon-lora', '--report_to=wandb', '--checkpointing_steps=500', '--validation_prompt=A pokemon with blue eyes.', '--seed=1337']' returned non-zero exit status 1.
Reproduction
!git clone https://github.com/huggingface/diffusers
%cd diffusers
!pip install .
%cd examples/text_to_image
!pip install -r requirements.txt
!accelerate config default
!pip install huggingface_hub wandb
from huggingface_hub import HfFolder, login
使用 Hugging Face 的 API 密钥登录
login(token='hf_tlt---------BRqMBjwdi')
设置 WandB 的 API 密钥
import wandb
wandb.login(key='b6a210-------------7f543c')
运行训练脚本
!accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
--dataset_name="lambdalabs/pokemon-blip-captions"
--dataloader_num_workers=8
--resolution=512
--center_crop
--random_flip
--train_batch_size=1
--gradient_accumulation_steps=4
--max_train_steps=15000
--learning_rate=1e-04
--max_grad_norm=1
--lr_scheduler="cosine"
--lr_warmup_steps=0
--output_dir="/sddata/finetune/lora/pokemon"
--push_to_hub
--hub_model_id="pokemon-lora"
--report_to=wandb
--checkpointing_steps=500
--validation_prompt="A pokemon with blue eyes."
--seed=1337
Logs
|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING| /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING| /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
System Info
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0xffffffff
cpu MHz : 2199.998
cache size : 56320 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips : 4399.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management: