Why does train_text_to_image.py perform so differently from the CompVis script? #1153

john-sungjin · 2022-11-06T07:13:41Z

I posted about this on the forum but didn't get any useful feedback - would love to hear from someone who knows the in and outs of the diffusers codebase!

https://discuss.huggingface.co/t/discrepancies-between-compvis-and-diffuser-fine-tuning/25556

To summarize the post: the train_text_to_image.py script and original CompVis repo perform very differently when fine-tuning on the same dataset with the same hyperparameters. I'm trying to reproduce the Lamda Labs Pokemon fine-tuning results and finding difficulty doing so (picture results in forum post).

I've been digging into the implementations and I'm not noticing any obvious differences in how the models are trained, losses are calculated, etc - so what explains the large behavioral discrepancies?

Would really appreciate any insight on what might be causing this.

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-11-07T20:45:20Z

Thanks a lot for posting the question here @john-sungjin!

Sorry for the moment we try to keep everything on GitHub. From the discussion it seems like many people are encountering your issue so let's maybe spend some time to figure out what's going on. @patil-suraj do you think you might have some time soon to look into it? If not I can try to allocate time (or maybe cc @anton-l)

patil-suraj · 2022-11-08T10:20:38Z

Thanks for posting the detailed issue @john-sungjin !

As you said, the implementation is very similar to the compvis one. The one difference that I'm aware of is that, in the compvis script, for example the Pokemon fine-tuning script, the model is initialised from the sd-v1-4-full-ema.ckpt checkpoint, so it loads the non-ema weights for training and ema weights for doing ema. While in diffusers script the ema checkpoint is used for both training and EMA.

I am going to add an option which enables loading both the non-ema (for training) and ema (for EMA updates) in diffusers script and then compare again. Will report here as soon as possible :)

AIXiaoBaiDemon · 2022-11-09T01:27:22Z

I had the same problem, looking forward to your experiment。

treksis · 2022-11-09T19:57:58Z

same problem

Line290 · 2022-11-21T09:39:25Z

Mark it

patil-suraj · 2022-11-21T17:11:30Z

Going to update the script soon, I am getting good results with script now, see for example the emoji model

Line290 · 2022-11-25T06:59:46Z

Hi @patil-suraj ，
I tried to load unet_with_ema in EMAModel and unet_without_ema for fine-tuning, but the result remains terrible. Could you help me fix it? Thanks.

Here are my code snippets:

class EMAModel:
    """
    Exponential Moving Average of models weights
    """

    def __init__(self, parameters: Iterable[torch.nn.Parameter], device, dtype, decay=0.9999):
        parameters = list(parameters)
        self.shadow_params = [p.clone().detach().to(device=device, dtype=dtype)
                              for p in parameters]
        self.decay = decay
        self.optimization_step = 0

unet_ema = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet")
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet_wo_ema")

# Create EMA for the unet.
if args.use_ema:
    ema_unet = EMAModel(unet_ema.parameters(), device=accelerator.device, dtype=weight_dtype)

haofanwang · 2022-12-02T08:04:43Z

Any update? The results are bad as least on the pokemon dataset.

patrickvonplaten · 2022-12-02T17:08:22Z

Ping @patil-suraj here

patrickvonplaten · 2022-12-11T16:06:20Z

@patil-suraj can you please make the required updates? Also cc @williamberman

haofanwang · 2022-12-12T07:05:58Z

The problem goes away in the lastest 0.10.0 version. But it would be helpful to show what specific modifications make it work.

patrickvonplaten · 2022-12-20T00:55:36Z

Another ping here @patil-suraj - could you look into this. Should we show how to use the EMA weights?

patil-suraj · 2022-12-26T13:40:08Z

Super sorry for being late here. I think using non-ema weights for training and ema weights for EMA updates will fix this issue. Also, the script is working for some users (and me also) as is, for example, @Norod has gotten good results with it, see https://huggingface.co/Norod78/sd2-simpsons-blip.

Here's the fix I'm proposing.

For now, upload non-ema weights under non-ema branch in all SD repos. And use that to load non-ema weights for training.
Soon, we'll have a variation argument in from_pretrained which we could use to load ema, non-ema weights. With that, we'll have all the weights in a single repo which will make it easy to implement this. But this will take some time to add, so, for now, will resort to using branches.

patil-suraj · 2022-12-26T15:27:59Z

Also, make sure to pass the use_ema argument to the script, otherwise, ema updates won't be used and it might make the results bad.

patil-suraj · 2022-12-26T15:35:23Z

WIP PR is here #1834, will start running some experiments with it.

patil-suraj · 2022-12-26T16:40:19Z

Have started a training run with SD1.5, and will get to know the results after 5 hours.

patil-suraj · 2022-12-30T20:55:13Z

Hey everyone here, we merged two PRs today that should fix the issues with script and make it perform similarly to compvis script.

#1868: which fixes a subtle big with ema updates.
#1834: which allows using non-ema weights for training and ema weights for ema updates.

With these changes I trained a model on the pokemon dataset and results are looking good now!

(Guess the prompts -:) )

Here's the model if you want to try yourself .

Trained it for about 140 epochs on 2 A100s, here's the command I used

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export NOM_EMA_REVISION="non-ema"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export WANDB_PROJECT="stable-diffusion-pokemon"

accelerate launch --multi_gpu --gpu_ids="0,1" --mixed_precision="no"  \
   ../diffusers/examples/text_to_image/train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --non_ema_revision=$NOM_EMA_REVISION \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=12 --gradient_checkpointing \
  --max_train_steps=5000 --checkpointing_steps=500 \
  --learning_rate=1e-04 --use_ema \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="models/pokemon-model" --seed=34637847 \
  --allow_tf32 \
  --enable_xformers_memory_efficient_attention \

The script should work well now, but if you still issues with it, feel free to re-open this issue :)

Line290 · 2023-01-09T01:57:36Z

Cool thanks.

In my experiment, I got the same results, the setting was as below:
4*NVIDIA-V100-PCIE-16GB,
batch_size=2,
gradient_accumulation_steps=3,
mixed_precision="fp16"
deepspeed_stage=2

patrickvonplaten assigned patil-suraj and patrickvonplaten Nov 7, 2022

patrickvonplaten removed their assignment Dec 20, 2022

patil-suraj mentioned this issue Dec 26, 2022

[train_text_to_image] allow using non-ema weights for training #1834

Merged

patil-suraj closed this as completed in #1834 Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

john-sungjin commented Nov 6, 2022 •

edited

patrickvonplaten commented Nov 7, 2022

patil-suraj commented Nov 8, 2022

AIXiaoBaiDemon commented Nov 9, 2022

treksis commented Nov 9, 2022

Line290 commented Nov 21, 2022

patil-suraj commented Nov 21, 2022

Line290 commented Nov 25, 2022 •

edited

haofanwang commented Dec 2, 2022

patrickvonplaten commented Dec 2, 2022

patrickvonplaten commented Dec 11, 2022

haofanwang commented Dec 12, 2022

patrickvonplaten commented Dec 20, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 30, 2022 •

edited

Line290 commented Jan 9, 2023

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Comments

john-sungjin commented Nov 6, 2022 • edited

patrickvonplaten commented Nov 7, 2022

patil-suraj commented Nov 8, 2022

AIXiaoBaiDemon commented Nov 9, 2022

treksis commented Nov 9, 2022

Line290 commented Nov 21, 2022

patil-suraj commented Nov 21, 2022

Line290 commented Nov 25, 2022 • edited

haofanwang commented Dec 2, 2022

patrickvonplaten commented Dec 2, 2022

patrickvonplaten commented Dec 11, 2022

haofanwang commented Dec 12, 2022

patrickvonplaten commented Dec 20, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 26, 2022

patil-suraj commented Dec 30, 2022 • edited

Line290 commented Jan 9, 2023

john-sungjin commented Nov 6, 2022 •

edited

Line290 commented Nov 25, 2022 •

edited

patil-suraj commented Dec 30, 2022 •

edited