Skip to content

ValueError: Attempting to unscale FP16 gradients. #6454

@qinchangchang

Description

@qinchangchang

Describe the bug

Traceback (most recent call last):
File "train_text_to_image_lora.py", line 950, in
main()
File "train_text_to_image_lora.py", line 777, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/qc/miniconda3/envs/lora-diffussion/lib/python3.8/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/home/qc/miniconda3/envs/lora-diffussion/lib/python3.8/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/home/qc/miniconda3/envs/lora-diffussion/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "/home/qc/miniconda3/envs/lora-diffussion/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Reproduction

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_NAME --caption_column="text"
--resolution=512 --random_flip
--train_batch_size=1
--num_train_epochs=100 --checkpointing_steps=5000
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0
--seed=42
--output_dir="sd-pokemon-model-lora"
--validation_prompt="cute dragon creature" --report_to="wandb"

Logs

No response

System Info

  • diffusers version: 0.25.0.dev0
  • Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.10
  • Python version: 3.8.0
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Huggingface_hub version: 0.20.1
  • Transformers version: 4.36.2
  • Accelerate version: 0.25.0
  • xFormers version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions