Degraded Image Generation Performance in SD-XL U-Net Fine-tuning

### Describe the bug

Problem:
I am experiencing degraded image generation performance while fine-tuning the U-Net part of the SDXL model using the train_text_to_image_sdxl.py script from the diffusers library. The generated results are noticeably worse compared to training on SD1.5 with the exact same dataset.
Here are some examples of generating indoor decoration images：
sd1.5: 
<img width="508" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/0351d382-cd59-4109-a6c4-b48a7d35b228">
<img width="498" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/3b8d1d28-da2c-4e8b-a537-83753cb76f95">
sd-xl1.0:
<img width="658" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/88b8808f-3524-4766-a196-cbaa09295185">
<img width="656" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/c9de4e02-3bca-44b9-9baf-bb85a66b1426">

Then I tried to finetune the sdxl on the pokeman dataset. As the number of training steps increased, the image generation effect seemed to deteriorate, and the fine-tuning script of the sdxl model seemed to only damage the original model's generation effect:
steps 100:
<img width="644" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/d7506cf7-09da-4eef-84bf-3417f68a2506">
steps 300:
<img width="650" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/d79895e6-29d4-4105-af5b-b33b586e7c4f">
steps 1000:
<img width="649" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/7b636a37-48bb-4b18-a46f-89f88cc17ad1">
steps 2000:
<img width="653" alt="image" src="https://github.com/huggingface/diffusers/assets/37566045/85e0fbbb-4409-40a3-9a0b-318cac34e36a">


Has anyone else encountered a similar issue?
Alternatively, has anyone achieved favorable results when fine-tuning the U-Net script in SDXL?
I hope to receive insights from the community on this matter. Thank you!


### Reproduction

export MODEL_NAME="stable-diffusion-xl-base-1.0"
export VAE_NAME="sdxl-vae-fp16-fix"
export DATASET_NAME="pokemon-blip-captions"

/aistudio/workspace/system-default/envs/diffusers/bin/python train_sdxl_test.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --output_dir=/aistudio/workspace/sd/diffusers/examples/text_to_image/ft_model/pokemon_modify/ \
  --resolution=512 --center_crop --random_flip \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --enable_xformers_memory_efficient_attention \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --max_train_steps=2000 \
  --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --checkpointing_steps=100 \
  --dataloader_num_workers=20

### Logs

_No response_

### System Info

- `diffusers` version: 0.22.0.dev0
- Platform: Linux-4.19.96-x86_64-with-glibc2.17
- Python version: 3.10.13
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Huggingface_hub version: 0.17.3
- Transformers version: 4.35.1
- Accelerate version: 0.24.1
- xFormers version: 0.0.22.post7
- Using GPU in script?: 
- Using distributed or parallel set-up in script?: yes

### Who can help?

@sayakpaul @patrickvonplaten 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Degraded Image Generation Performance in SD-XL U-Net Fine-tuning #5956

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Degraded Image Generation Performance in SD-XL U-Net Fine-tuning #5956

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions