Skip to content

Attempting to Unscale FP16 Gradients Bug #10752

@iszihan

Description

@iszihan

Describe the bug

Hello, I have the following error when trying to train a LoRA with SDXL:

ValueError: Attempting to unscale FP16 gradients.
Traceback (most recent call last):
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
    main(args)
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2396, in clip_grad_norm_
    self.unscale_gradients()
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2340, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
# export INSTANCE_DIR="dog"
export INSTANCE_DIR="/scratch/year/zling/progressive-shading/picasso-data/surrealism_images"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch --gpu_ids 0,1 train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME  \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a drawing in sks style" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A drawing of sks style" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub

Logs

System Info

  • 🤗 Diffusers version: 0.33.0.dev0
  • Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.28.1
  • Transformers version: 4.48.3
  • Accelerate version: 1.3.0
  • PEFT version: 0.7.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.5.2
  • xFormers version: not installed
  • Accelerator:
    NVIDIA RTX A6000, 49140 MiB
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions