-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[WIP] adding Kandinsky training scripts #4890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…une_decoder_lora.py
|
The documentation is not available anymore as the PR was closed or merged. |
|
|
||
|
|
||
| class PriorTransformer(ModelMixin, ConfigMixin): | ||
| class PriorTransformer(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit surprised to see UNet2DConditionLoadersMixin works out of box for PriorTransformer:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just diffusers things 😎
| ### Training with xFormers: | ||
|
|
||
| You can enable memory efficient attention by [installing xFormers](https://huggingface.co/docs/diffusers/main/en/optimization/xformers) and passing the `--enable_xformers_memory_efficient_attention` argument to the script. | ||
|
|
||
| xFormers training is not available for Prior model fine-tune. | ||
|
|
||
| **Note**: | ||
|
|
||
| According to [this issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training in some GPUs. If you observe that problem, please install a development version as indicated in that comment. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have LoRA too. Let's include a section on LoRA as well in the README.
| block_id = int(name[len("down_blocks.")]) | ||
| hidden_size = unet.config.block_out_channels[block_id] | ||
|
|
||
| lora_attn_procs[name] = LoRAAttnAddedKVProcessor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Seems like we don't have a faster implementation of this processor. Based on the usage of the training scripts, I think we can monitor and added them later if needed. WDYT?
| model_pred = unet(noisy_latents, timesteps, None, added_cond_kwargs=added_cond_kwargs).sample[:, :4] | ||
|
|
||
| if args.snr_gamma is None: | ||
| loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC Kandinsky models weren't trained using the epsilon prediction objective. If so, does this still lead to reasonable results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure what's the objective used in decoder training, I looked back into the dalle-2 paper, and I think it only mentioned that it predicts samples directly for prior
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
Questions / comments:
- Do we not LoRA fine-tune the text encoders as typically done in the SD world?
- Let's add tests! Pinging @DN6 for a second opinion here.
- Let's port over the README to our training documentation too so that it reflects on https://huggingface.co/docs/diffusers/main/en/index.
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for writing this training guide up! 😄
| from diffusers import AutoPipelineForText2Image | ||
| import torch | ||
|
|
||
| pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) | |
| pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here we want to create a combined pipeline, so will have to use the decoder checkpoint for that
once we have the pipe as the combined pipeline, we then access the prior with pipe.prior_prior
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
Is it okay to merge now? Maybe @sayakpaul can take a look again |
|
@sayakpaul about your questions
kandinsky does not have a pure text-conditioned diffusion process - it decodes image embedding instead. it does use a text encoder but I'm not sure how important it is to fine-tune it. Maybe the community can try it out |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dope! Thank you so much for seeing this through, Yiyi!
* Add files via upload Co-authored-by: Shahmatov Arseniy <62886550+cene555@users.noreply.github.com> Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Add files via upload Co-authored-by: Shahmatov Arseniy <62886550+cene555@users.noreply.github.com> Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
🚨🚨🚨 Note: The main author of this PR is @cene555 and the Kandinsky team. In this PR we mainly just make some small modifications to the training script provided by the authors from original PR so that it's consistent with all our other training scripts. Thanks a million for the contribution @cene555 🚨🚨🚨
Authors of this PR:
Arseniy Shakhmatov
Anton Razzhigaev
Aleksandr Nikolich
Igor Pavlov
Andrey Kuznetsov
Denis Dimitrov
Note:
will include dreambooth in a separate PR