Explosive variance observed in `latents` and `noise_pred` when using `torch.autocast()` #119

AlezHibali · 2024-04-15T18:18:12Z

AlezHibali
Apr 15, 2024

Hello, recently I tried training LoRA using your code https://github.com/PixArt-alpha/PixArt-alpha/blob/93f6bfe8942664052937b78171d3ed5ed56d8dba/train_scripts/train_pixart_lora_hf.py but ran into an issue that the DiT (PixArt model) does not work with torch.autocast(). Briefly, PixArt-alpha returns drastically differnet output if cast to fp16. Details below.

Before everything, my bash script is:

export MODEL_NAME="PATH/TO/MODEL"
export OUTPUT_DIR="PATH/TO/OUTPUT/DIR"
export DATASET_NAME="PATH/TO/DATASET"

accelerate launch --mixed_precision="fp16" pixart_lora_train.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATASET_NAME \
  --dataloader_num_workers=0 \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --train_batch_size=4 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=800 \
  --learning_rate=1e-04 \
  --max_grad_norm=1 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --report_to=wandb \
  --checkpointing_steps=100 \
  --validation_prompt="A pokemon with blue eyes." \
  --num_validation_images=1 \
  --validation_epochs=50 \
  --rank=4 \
  --seed=36

And my accelerator configuration is

In which compute environment are you running?
This machine                                                                                                                                  
Which type of machine are you using?                                                                                                          
multi-GPU                                                                                                                                     
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                    
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO             
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                             
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                    
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                                     
Do you want to use Megatron-LM ? [yes/NO]: NO                                                                                                 
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:6,7
Do you wish to use FP16 or BF16 (mixed precision)?
fp16

My dataset contains 13 pixel-art-styled images.

To begin with, running the code from the repo directly returns error ValueError: Attempting to unscale FP16 gradients.
I resolved this by making sure the trainable params are all float32 after applying the adapter (LoRA):
Line 520 +

# resolve ValueError: Attempting to unscale FP16 gradients.
if args.mixed_precision == "fp16":
    # only upcast trainable parameters (LoRA) into fp32
    cast_training_params(transformer, dtype=torch.float32)

following SD (unet) lora training in HF Diffusers: https://github.com/huggingface/diffusers/blob/cf6e0407e051467b480830d3ed97d2873b5019d3/examples/text_to_image/train_text_to_image_lora.py#L496

The next issue is 'DistributedDataParallel' object has no attribute 'peft_config'. I find the code does not unwrap the model before saving weights though the training is under HF accelerator multi-GPU setting. So I add a function to unwrap the model:

def unwrap_model(model):
    model = accelerator.unwrap_model(model)
    model = model._orig_mod if is_compiled_module(model) else model
    return model

Line 893 +

# resolve 'DistributedDataParallel' object has no attribute 'peft_config'
unwrapped_transformer = unwrap_model(transformer)                        
transformer_lora_state_dict = get_peft_model_state_dict(unwrapped_transformer)

Now the training code can work without errors. But, the validation images generated look as if the fine-tuning not working at all and every validation image looks invariant throughout the training process. Below are the validation images at step 2, 300, 700, 1000 and the image generated by the testing (final inference) script around L970:

Notice that though validation images are almost invariant through time, the test image shows the actual effectiveness of the trained LoRA. In the codes, the only difference is test inference uses torch.autocast() but validation does not: https://github.com/PixArt-alpha/PixArt-alpha/blob/93f6bfe8942664052937b78171d3ed5ed56d8dba/train_scripts/train_pixart_lora_hf.py#L986C16-L986C53

Tricky is, if we apply torch.autocast() to the validation inferences as the test inference does, the result looks kinda "weird" still (where we expect a smooth transition from a raw-SD-generated image to a LoRA-stylized one):

Especially, the first image is pure black in which case PixArt is turned to fp16 by autocast() without LoRA attached.
Looking deeper in to the reverse diffusion process with torch.autocast("cuda", dtype=torch.float16) , I find latents and noise_pred have exponentially growing variance as we iterate - yielding a latent with huge magnitude, and the corresponding VAE decoder output becomes NaN (shown as pure black as a result). Plots below show how the variance of latents and noise_pred changes in the 20-step DPM-Solver++ inference (mind the huge values!):

But if without using autocast, the variance stays stable:

Here is my inference code with torch.autocast(), fyi:

# Load previous transformer
transformer = Transformer2DModel.from_pretrained(PIXART_MODEL_DIR, subfolder='transformer', torch_dtype=torch.float16)

# Load previous pipeline
pipeline = DiffusionPipeline.from_pretrained(PIXART_MODEL_DIR, transformer=transformer, torch_dtype=torch.float16, safety_checker=None)
pipeline = pipeline.to("cuda")

del transformer
torch.cuda.empty_cache()

generator = torch.cuda.manual_seed(0)
prompt = "A pokemon with blue eyes."
with torch.autocast("cuda", dtype=torch.float16):
    image = pipeline(prompt, num_inference_steps=20, generator=generator).images[0]

Does anyone have an idea on why torch.autocast() would result in such explosive variance and any suggestions to avoid this? TYVM!

Answered by raulc0399

Apr 19, 2024

@lawrence-cj 2 small changes in the PR, i added params for dora and rslora
https://huggingface.co/docs/peft/package_reference/lora#peft.LoraConfig.use_rslora
https://huggingface.co/docs/peft/package_reference/lora#peft.LoraConfig.use_dora

both of them show better results on my dataset.

@AlezHibali might give you better results as well.

View full answer

lawrence-cj · 2024-04-16T08:05:34Z

lawrence-cj
Apr 16, 2024
Maintainer

Hi @AlezHibali , Really nice observation. Do u mean that if we remove the torch.autocast() in inference and update the training code, then the results would be normal? (BTW, can you contact me directly in discord? I do want to discuss this LoRA part with you)

1 reply

AlezHibali Apr 16, 2024
Author

Hi, thanks for replying! Removing autocast() will generate normal images in inference but those images are almost invariant with and without trained LoRA. I will contact you through Discord later today.

raulc0399 · 2024-04-16T13:01:45Z

raulc0399
Apr 16, 2024

HI @AlezHibali , i noticed something similar mentioned here on inference it generates as if no peft model is loaded

i did the change below, once i noticed this problem - using the base 512 model:

this works:

pipe2 = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-512x512", torch_dtype=torch.float16)
pipe2.to("cuda")

while this does not:

pipe1 = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-512x512")
pipe1.to("cuda", dtype=torch.float16)

so i removed the dtype params from the "to" calls in the training script.

Here is the patch that i created after i observed the point above. https://github.com/raulc0399/PixArt-alpha-finetuning/blob/main/train_pixart_lora_hf_2.patch
Once i started using the patched training script, it started showing results.

basically i load the models in float16 and move them to cuda without convertion.

using the train_hf.sh from the my repo i managed after 100 epochs to fine-tune on simpsons.

attached a sample generated after fine-tuning. is not quite there yet, but you can see that is learning.
however it only started working once i removed the params from the to calls.

13 replies

AlezHibali Apr 17, 2024
Author

@raulc0399 I tried your patch and it works for single gpu training! Thanks so much!
But for distributed training, it gives error 'DistributedDataParallel' object has no attribute 'peft_config'.
To enable multi-gpu training, you could change L893:

PixArt-alpha/train_scripts/train_pixart_lora_hf.py

Line 893 in 4ddfd85

transformer_lora_state_dict = get_peft_model_state_dict(transformer)

to the two lines below:

unwrapped_transformer = accelerator.unwrap_model(transformer, keep_fp32_wrapper=False)
transformer_lora_state_dict = get_peft_model_state_dict(unwrapped_transformer)

lawrence-cj Apr 18, 2024
Maintainer

i have created the pull request:
#122

Nice, let me have a try. @raulc0399

raulc0399 Apr 18, 2024

@AlezHibali thanks, i have changed it

@lawrence-cj the commit is in the PR:
8151464

raulc0399 Apr 19, 2024

@lawrence-cj 2 small changes in the PR, i added params for dora and rslora
https://huggingface.co/docs/peft/package_reference/lora#peft.LoraConfig.use_rslora
https://huggingface.co/docs/peft/package_reference/lora#peft.LoraConfig.use_dora

both of them show better results on my dataset.

@AlezHibali might give you better results as well.

Answer selected by lawrence-cj

lawrence-cj · 2024-04-23T04:03:17Z

lawrence-cj
Apr 23, 2024
Maintainer

Hi guys. The Lora training script fixing PR is merged: cab13f2
Thank you so much guys for your support! ❤️ @raulc0399 @AlezHibali

4 replies

raulc0399 May 1, 2024

thanks @lawrence-cj
i plan to create a PR for pixart-sigma as well, is that ok?

lawrence-cj May 2, 2024
Maintainer

Oh, I have already added the similar LoRA training script for PixArt-Sigma a few days ago. Can you help me to check if it's good? @raulc0399

raulc0399 May 2, 2024

sure will do :)

raulc0399 May 6, 2024

@lawrence-cj yes seems like everything has been taken over.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PixArt

Explosive variance observed in `latents` and `noise_pred` when using `torch.autocast()` #119

{{title}}

Replies: 3 comments 18 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Explosive variance observed in latents and noise_pred when using torch.autocast() #119

Replies: 3 comments · 18 replies

lawrence-cj Apr 16, 2024 Maintainer

AlezHibali Apr 16, 2024 Author

AlezHibali Apr 17, 2024 Author

lawrence-cj Apr 18, 2024 Maintainer

lawrence-cj Apr 23, 2024 Maintainer

lawrence-cj May 2, 2024 Maintainer

Explosive variance observed in `latents` and `noise_pred` when using `torch.autocast()` #119

Replies: 3 comments 18 replies

lawrence-cj
Apr 16, 2024
Maintainer

AlezHibali Apr 16, 2024
Author

AlezHibali Apr 17, 2024
Author

lawrence-cj Apr 18, 2024
Maintainer

lawrence-cj
Apr 23, 2024
Maintainer

lawrence-cj May 2, 2024
Maintainer