Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
eb8f2db
Add files via upload
cene555 Jul 12, 2023
13a409c
Rename examples/tune_decoder.py to examples/kandinsky2_2_train/tune_d…
cene555 Jul 12, 2023
03502dc
Rename examples/tune_prior.py to examples/kandinsky2_2_train/tune_pri…
cene555 Jul 12, 2023
eb434ec
Rename examples/tune_decoder_lora.py to examples/kandinsky2_2_train/t…
cene555 Jul 12, 2023
e7924e6
Rename examples/tune_prior_lora.py to examples/kandinsky2_2_train/tun…
cene555 Jul 12, 2023
f17466f
style
Jul 14, 2023
9c368c1
Add files via upload
cene555 Jul 21, 2023
dce997a
Merge branch 'main' of github.com:ai-forever/diffusers into kandinsky…
Sep 2, 2023
272f789
update tune_decoder
Sep 4, 2023
3d7f795
Merge remote-tracking branch 'origin/main' into kandinsky-finetune
Sep 4, 2023
64abbaf
Merge remote-tracking branch 'origin/main' into kandinsky-finetune
Sep 4, 2023
6e8faa4
rename
Sep 4, 2023
d39efc7
style
Sep 4, 2023
03c9da0
save only decoder pipeline
Sep 4, 2023
810425c
update text-to-image-lora
Sep 5, 2023
4e3a210
remove xformer
Sep 5, 2023
9ec48d2
update train_prior
Sep 6, 2023
3f8ea1c
fix clip_mean
Sep 6, 2023
a7d136b
fix
Sep 6, 2023
429498e
fix more
Sep 6, 2023
78812ef
update prior lora
Sep 7, 2023
982b2a6
fix
Sep 7, 2023
5ce8398
test lora loader
Sep 7, 2023
9be8440
style + fix
Sep 7, 2023
dd42e9d
rename files and add readme
Sep 7, 2023
3545212
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
a3581d5
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
121f3cb
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
8ee5597
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
c3707a7
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
3676185
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
8146003
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
af96d29
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
cdc83a4
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
d5b06b9
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
eaf6797
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
8e8cb1c
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
db25048
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
91c8367
Update examples/kandinsky2_2/text_to_image/README.md
yiyixuxu Sep 11, 2023
7066be9
add
Sep 11, 2023
cf8ea15
make style
Sep 13, 2023
77c86be
Merge branch 'main' into kandinsky-finetune
sayakpaul Sep 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/en/training/text2image.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,3 +281,8 @@ image.save("yoda-pokemon.png")

* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md).
* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md).


## Kandinsky 2.2

* We support fine-tuning both the decoder and prior in Kandinsky2.2 with the `train_text_to_image_prior.py` and `train_text_to_image_decoder.py` scripts. LoRA support is also included. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/README_sdxl.md).
317 changes: 317 additions & 0 deletions examples/kandinsky2_2/text_to_image/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
# Kandinsky2.2 text-to-image fine-tuning

Kandinsky 2.2 includes a prior pipeline that generates image embeddings from text prompts, and a decoder pipeline that generates the output image based on the image embeddings. We provide `train_text_to_image_prior.py` and `train_text_to_image_decoder.py` scripts to show you how to fine-tune the Kandinsky prior and decoder models separately based on your own dataset. To achieve the best results, you should fine-tune **_both_** your prior and decoder models.

___Note___:

___This script is experimental. The script fine-tunes the whole model and often times the model overfits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparameters to get the best result on your dataset.___


## Running locally with PyTorch

Before running the scripts, make sure to install the library's training dependencies:

**Important**

To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

Then cd in the example folder and run
```bash
pip install -r requirements.txt
```

And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:

```bash
accelerate config
```
For this example we want to directly store the trained LoRA embeddings on the Hub, so we need to be logged in and add the --push_to_hub flag.

___

### Pokemon example

For all our examples, we will directly store the trained weights on the Hub, so we need to be logged in and add the `--push_to_hub` flag. In order to do that, you have to be a registered user on the 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to the [User Access Tokens](https://huggingface.co/docs/hub/security-tokens) guide.

Run the following command to authenticate your token

```bash
huggingface-cli login
```

We also use [Weights and Biases](https://docs.wandb.ai/quickstart) logging by default, because it is really useful to monitor the training progress by regularly generating sample images during training. To install wandb, run

```bash
pip install wandb
```

To disable wandb logging, remove the `--report_to=="wandb"` and `--validation_prompts="A robot pokemon, 4k photo"` flags from below examples

#### Fine-tune decoder
<br>

<!-- accelerate_snippet_start -->
```bash
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" train_text_to_image_decoder.py \
--dataset_name=$DATASET_NAME \
--resolution=768 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--checkpoints_total_limit=3 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--validation_prompts="A robot pokemon, 4k photo" \
--report_to="wandb" \
--push_to_hub \
--output_dir="kandi2-decoder-pokemon-model"
```
<!-- accelerate_snippet_end -->


To train on your own training files, prepare the dataset according to the format required by `datasets`. You can find the instructions for how to do that in the [ImageFolder with metadata](https://huggingface.co/docs/datasets/en/image_load#imagefolder-with-metadata) guide.
If you wish to use custom loading logic, you should modify the script and we have left pointers for that in the training script.

```bash
export TRAIN_DIR="path_to_your_dataset"

accelerate launch --mixed_precision="fp16" train_text_to_image_decoder.py \
--train_data_dir=$TRAIN_DIR \
--resolution=768 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--checkpoints_total_limit=3 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--validation_prompts="A robot pokemon, 4k photo" \
--report_to="wandb" \
--push_to_hub \
--output_dir="kandi22-decoder-pokemon-model"
```


Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `kandi22-decoder-pokemon-model`. To load the fine-tuned model for inference just pass that path to `AutoPipelineForText2Image`

```python
from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(output_dir, torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt='A robot pokemon, 4k photo'
images = pipe(prompt=prompt).images
images[0].save("robot-pokemon.png")
```

Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
```python
from diffusers import AutoPipelineForText2Image, UNet2DConditionModel

model_path = "path_to_saved_model"

unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet")

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", unet=unet, torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

image = pipe(prompt="A robot pokemon, 4k photo").images[0]
image.save("robot-pokemon.png")
```

#### Fine-tune prior

You can fine-tune the Kandinsky prior model with `train_text_to_image_prior.py` script. Note that we currently do not support `--gradient_checkpointing` for prior model fine-tuning.

<br>

<!-- accelerate_snippet_start -->
```bash
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" train_text_to_image_prior.py \
--dataset_name=$DATASET_NAME \
--resolution=768 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--checkpoints_total_limit=3 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--validation_prompts="A robot pokemon, 4k photo" \
--report_to="wandb" \
--push_to_hub \
--output_dir="kandi2-prior-pokemon-model"
```
<!-- accelerate_snippet_end -->


To perform inference with the fine-tuned prior model, you will need to first create a prior pipeline by passing the `output_dir` to `DiffusionPipeline`. Then create a `KandinskyV22CombinedPipeline` from a pretrained or fine-tuned decoder checkpoint along with all the modules of the prior pipeline you just created.

```python
from diffusers import AutoPipelineForText2Image, DiffusionPipeline
import torch

pipe_prior = DiffusionPipeline.from_pretrained(output_dir, torch_dtype=torch.float16)
prior_components = {"prior_" + k: v for k,v in pipe_prior.components.items()}
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", **prior_components, torch_dtype=torch.float16)

pipe.enable_model_cpu_offload()
prompt='A robot pokemon, 4k photo'
images = pipe(prompt=prompt, negative_prompt=negative_prompt).images
images[0]
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also make a note on how to perform inference when someone fine-tunes both the prior and the decoder? Of course, it's obvious to us as to what needs to be done but it might not be for all.


If you want to use a fine-tuned decoder checkpoint along with your fine-tuned prior checkpoint, you can simply replace the "kandinsky-community/kandinsky-2-2-decoder" in above code with your custom model repo name. Note that in order to be able to create a `KandinskyV22CombinedPipeline`, your model repository need to have a prior tag. If you have created your model repo using our training script, the prior tag is automatically included.

#### Training with multiple GPUs

`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:

```bash
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image_decoder.py \
--dataset_name=$DATASET_NAME \
--resolution=768 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--checkpoints_total_limit=3 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--validation_prompts="A robot pokemon, 4k photo" \
--report_to="wandb" \
--push_to_hub \
--output_dir="kandi2-decoder-pokemon-model"
```


#### Training with Min-SNR weighting

We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps achieve faster convergence
by rebalancing the loss. Enable the `--snr_gamma` argument and set it to the recommended
value of 5.0.


## Training with LoRA

Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.

In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and **only** training those newly added weights. This has a couple of advantages:

- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114).
- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter.

[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

With LoRA, it's possible to fine-tune Kandinsky 2.2 on a custom image-caption pair dataset
on consumer GPUs like Tesla T4, Tesla V100.

### Training

First, you need to set up your development environment as explained in the [installation](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder) and the [Pokemons dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions).


#### Train decoder

```bash
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" train_text_to_image_decoder_lora.py \
--dataset_name=$DATASET_NAME --caption_column="text" \
--resolution=768 \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--rank=4 \
--gradient_checkpointing \
--output_dir="kandi22-decoder-pokemon-lora" \
--validation_prompt="cute dragon creature" --report_to="wandb" \
--push_to_hub \
```

#### Train prior

```bash
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" train_text_to_image_prior_lora.py \
--dataset_name=$DATASET_NAME --caption_column="text" \
--resolution=768 \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--rank=4 \
--output_dir="kandi22-prior-pokemon-lora" \
--validation_prompt="cute dragon creature" --report_to="wandb" \
--push_to_hub \
```

**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run above scripts in consumer GPUs like T4 or V100.___**


### Inference

#### Inference using fine-tuned LoRA checkpoint for decoder

Once you have trained a Kandinsky decoder model using the above command, inference can be done with the `AutoPipelineForText2Image` after loading the trained LoRA weights. You need to pass the `output_dir` for loading the LoRA weights, which in this case is `kandi22-decoder-pokemon-lora`.


```python
from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipe.unet.load_attn_procs(output_dir)
pipe.enable_model_cpu_offload()

prompt='A robot pokemon, 4k photo'
image = pipe(prompt=prompt).images[0]
image.save("robot_pokemon.png")
```

#### Inference using fine-tuned LoRA checkpoint for prior

```python
from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here we want to create a combined pipeline, so will have to use the decoder checkpoint for that
once we have the pipe as the combined pipeline, we then access the prior with pipe.prior_prior

pipe.prior_prior.load_attn_procs(output_dir)
pipe.enable_model_cpu_offload()

prompt='A robot pokemon, 4k photo'
image = pipe(prompt=prompt).images[0]
image.save("robot_pokemon.png")
image
```

### Training with xFormers:

You can enable memory efficient attention by [installing xFormers](https://huggingface.co/docs/diffusers/main/en/optimization/xformers) and passing the `--enable_xformers_memory_efficient_attention` argument to the script.

xFormers training is not available for fine-tuning the prior model.

**Note**:

According to [this issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training in some GPUs. If you observe that problem, please install a development version as indicated in that comment.
7 changes: 7 additions & 0 deletions examples/kandinsky2_2/text_to_image/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
accelerate>=0.16.0
torchvision
transformers>=4.25.1
datasets
ftfy
tensorboard
Jinja2
Loading