-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clip skip for Stable diffusion pipelines #1721
Comments
Hey @xunings, Yes we indeed also skip the final layer of CLIP and instead use the "penultimate" layer. See: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19 |
Hi @patrickvonplaten , |
SD 1.x models also use a CLIPTextModel: https://huggingface.co/CompVis/stable-diffusion-v1-4/blob/main/text_encoder/config.json#L19 but I have no idea how you look at a config like that and figure out what is being "skipped." None of those values obviously look like "next-to-last" to me, and reading the CLIPTextConfig docs didn't shed any light on that either. |
Also replied on #1863 but https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/dev/sdgrpcserver/pipeline/text_embedding/text_encoder_alt_layer.py will wrap a TextEncoder and return some earlier layer to the pipeline. |
Yeah, sorry it's not super obvious and we should probably have left a least a comment in the code here. But just to confirm.
How is this implemented?! The CLIP models used for SD 1.+ has 12 layers, for SD 2.+. has 24 layers (conventions are usually: "base": 12 layers, "large": 24 layers, ..., GPT3: 96 layers). What we do is we simply use all layers of Stable Diffusion 1.4/1.5, but for SD 2.+ we've thrown away the final layer (the 24th) when converting the checkpoint since it's never used: Hope this helps a bit. @patil-suraj we should leave a comment in the code and/or conversion script going forward I think. |
That explains a thing or two, thank you. But people are looking to do this during inference on various fine-tunings of SD 1.x. (How this makes any sense mathematically, I have no idea. I guess that's part of the magic of neural networks and diffusion models.) It sounds like those models should be distributed with that adjustment made to their |
People are actually sometimes using penultimate (or even deeper) layers even when the model doesn't recommend it. I believe the theory is that the ultimate layer is too detailed to specific concepts. But using earlier layers you get a more conceptual / wider latent space. (So "Oak" becomes "Tree"). The model doesn't stress itself out trying to hit a specific target and you get a less prompt-accurate but more coherent result. I'm not sure how much of this has been actually tested and how much is just assumptions based on the success using it for models designed for it. |
oh, I guess the other consideration here is that we're talking about the CLIPTextModel. So different diffusion "models" may be different diffusion UNets, but all referring to the same CLIPTextModel, in which case it would make sense to have a parameter for that instead of having clip-vit-h-12-layers.bin, clip-vit-h-11-layers.bin, clip-vit-h-10.layers.bin, etc. |
Thanks a lot for the explanation. Regarding sd 1.4/1.5, I am just wondering if we can switch to the "penultimate" layer by changing the num_hidden_layers config to 11, or is there anything deeper in the code to be modified? |
For 1.4/1.5 it would be enough to just change the config. It'll throw a warning at loading, but should work. Nevertheless, one can also easily retrieve previous hidden layers by adding
So for 1.4/1.5 I would maybe recommend doing this for training. E.g. do: text_embeddings = self.text_encoder(
text_input_ids.to(device),
attention_mask=attention_mask,
output_hidden_states=True,
)
text_embeddings = text_embeddings.hidden_states[-2] # for penultimate layer Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj |
The last hidden state is normed in the code, but the other hidden states seem not. last_hidden_state = self.final_layer_norm(last_hidden_state) Can we also support the final_layer_norm for the penultimate layer in the pipeline? At least it is explicityly done in automatic1111, if opts.CLIP_stop_at_last_layers > 1:
z = outputs.hidden_states[-opts.CLIP_stop_at_last_layers]
z = self.wrapped.transformer.text_model.final_layer_norm(z)
else:
z = outputs.last_hidden_state |
@xunings - see the code I linked to earlier. It gets the earlier layers correctly normalized. |
If we skip the last layer when using the model and set the penultimate layer as the last layer, it will get normalized.
We could definitely support this for fine-tuning; for example, add an argument called Is there any example or experiment with SD1.x using the penultimate layer that we could check? |
@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config? @hafriedlander thanks for the code. The solution looks very solid. I am just curious, did you try simply modifying the "num_hidden_layers"? If yes, what was the problem? |
@xunings yes, that's right. |
Thanks for the confirmation, it's clear for me now. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Would be nice to just have a simple config to just use |
Hi,
Is the "clip skip" feature supported by the Stable diffusion pipelines? Just did a little bit of search on the documentation of StableDiffusionPipeline but found nothing related.
See: Using Hidden States of CLIP’s Penultimate Layer
Thanks in advance.
The text was updated successfully, but these errors were encountered: