Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip skip for Stable diffusion pipelines #1721

Closed
xunings opened this issue Dec 16, 2022 · 18 comments
Closed

clip skip for Stable diffusion pipelines #1721

xunings opened this issue Dec 16, 2022 · 18 comments
Labels
stale Issues that haven't received updates

Comments

@xunings
Copy link

xunings commented Dec 16, 2022

Hi,
Is the "clip skip" feature supported by the Stable diffusion pipelines? Just did a little bit of search on the documentation of StableDiffusionPipeline but found nothing related.

See: Using Hidden States of CLIP’s Penultimate Layer

Thanks in advance.

@patrickvonplaten
Copy link
Contributor

Hey @xunings,

Yes we indeed also skip the final layer of CLIP and instead use the "penultimate" layer. See: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19

@xunings
Copy link
Author

xunings commented Dec 21, 2022

Hey @xunings,

Yes we indeed also skip the final layer of CLIP and instead use the "penultimate" layer. See: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19

Hi @patrickvonplaten ,
Thanks a lot for the reply. Is "num_hidden_layers" the correct parameter to modify in the config file? Can you also show an example for sd v1?

@keturn
Copy link
Contributor

keturn commented Dec 30, 2022

SD 1.x models also use a CLIPTextModel: https://huggingface.co/CompVis/stable-diffusion-v1-4/blob/main/text_encoder/config.json#L19

but I have no idea how you look at a config like that and figure out what is being "skipped." None of those values obviously look like "next-to-last" to me, and reading the CLIPTextConfig docs didn't shed any light on that either.

@hafriedlander
Copy link
Contributor

Also replied on #1863 but https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/dev/sdgrpcserver/pipeline/text_embedding/text_encoder_alt_layer.py will wrap a TextEncoder and return some earlier layer to the pipeline.

@patrickvonplaten
Copy link
Contributor

Yeah, sorry it's not super obvious and we should probably have left a least a comment in the code here. But just to confirm.

  • We use the "final" layer when using Stable Diffusion 1.4/1.5.
  • We use the "penultimate" layer when using Stable Diffusion 2.+

How is this implemented?!

The CLIP models used for SD 1.+ has 12 layers, for SD 2.+. has 24 layers (conventions are usually: "base": 12 layers, "large": 24 layers, ..., GPT3: 96 layers).

What we do is we simply use all layers of Stable Diffusion 1.4/1.5,
see here:
https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/text_encoder/config.json#L19 <- all 12 layers

but for SD 2.+ we've thrown away the final layer (the 24th) when converting the checkpoint since it's never used:
https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19 <- only 23 layers
Compare to the original checkpoint:
https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54 <- 24 layers

Hope this helps a bit.

@patil-suraj we should leave a comment in the code and/or conversion script going forward I think.

@keturn
Copy link
Contributor

keturn commented Jan 5, 2023

That explains a thing or two, thank you.

But people are looking to do this during inference on various fine-tunings of SD 1.x. (How this makes any sense mathematically, I have no idea. I guess that's part of the magic of neural networks and diffusion models.)

It sounds like those models should be distributed with that adjustment made to their text_encoder/config.json? Or, if they're confident about that being the best way to use the model, the pre-trained weights they distribute should have those unused layers truncated entirely?

@hafriedlander
Copy link
Contributor

People are actually sometimes using penultimate (or even deeper) layers even when the model doesn't recommend it.

I believe the theory is that the ultimate layer is too detailed to specific concepts. But using earlier layers you get a more conceptual / wider latent space. (So "Oak" becomes "Tree"). The model doesn't stress itself out trying to hit a specific target and you get a less prompt-accurate but more coherent result. I'm not sure how much of this has been actually tested and how much is just assumptions based on the success using it for models designed for it.

@keturn
Copy link
Contributor

keturn commented Jan 5, 2023

oh, I guess the other consideration here is that we're talking about the CLIPTextModel. So different diffusion "models" may be different diffusion UNets, but all referring to the same CLIPTextModel, in which case it would make sense to have a parameter for that instead of having clip-vit-h-12-layers.bin, clip-vit-h-11-layers.bin, clip-vit-h-10.layers.bin, etc.

@xunings
Copy link
Author

xunings commented Jan 9, 2023

Yeah, sorry it's not super obvious and we should probably have left a least a comment in the code here. But just to confirm.

  • We use the "final" layer when using Stable Diffusion 1.4/1.5.
  • We use the "penultimate" layer when using Stable Diffusion 2.+

How is this implemented?!

The CLIP models used for SD 1.+ has 12 layers, for SD 2.+. has 24 layers (conventions are usually: "base": 12 layers, "large": 24 layers, ..., GPT3: 96 layers).

What we do is we simply use all layers of Stable Diffusion 1.4/1.5, see here: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/text_encoder/config.json#L19 <- all 12 layers

but for SD 2.+ we've thrown away the final layer (the 24th) when converting the checkpoint since it's never used: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19 <- only 23 layers Compare to the original checkpoint: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54 <- 24 layers

Hope this helps a bit.

@patil-suraj we should leave a comment in the code and/or conversion script going forward I think.

Thanks a lot for the explanation. Regarding sd 1.4/1.5, I am just wondering if we can switch to the "penultimate" layer by changing the num_hidden_layers config to 11, or is there anything deeper in the code to be modified?

@patrickvonplaten
Copy link
Contributor

For 1.4/1.5 it would be enough to just change the config. It'll throw a warning at loading, but should work.

Nevertheless, one can also easily retrieve previous hidden layers by adding output_hidden_states=True to:

which will return a list of hidden_states which can then be used for training.

So for 1.4/1.5 I would maybe recommend doing this for training. E.g. do:

        text_embeddings = self.text_encoder(
            text_input_ids.to(device),
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        text_embeddings = text_embeddings.hidden_states[-2]  # for penultimate layer

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

@xunings
Copy link
Author

xunings commented Jan 16, 2023

For 1.4/1.5 it would be enough to just change the config. It'll throw a warning at loading, but should work.

Nevertheless, one can also easily retrieve previous hidden layers by adding output_hidden_states=True to:

which will return a list of hidden_states which can then be used for training.
So for 1.4/1.5 I would maybe recommend doing this for training. E.g. do:

        text_embeddings = self.text_encoder(
            text_input_ids.to(device),
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        text_embeddings = text_embeddings.hidden_states[-2]  # for penultimate layer

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

The last hidden state is normed in the code, but the other hidden states seem not.

https://github.com/huggingface/transformers/blob/05b8e25fffd61feecb21928578ad39e63af21b4f/src/transformers/models/clip/modeling_clip.py#L735

last_hidden_state = self.final_layer_norm(last_hidden_state)

Can we also support the final_layer_norm for the penultimate layer in the pipeline? At least it is explicityly done in automatic1111,

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/fcbe0f35fbf1355e986d17410feb5c734d55ed47/modules/sd_hijack_clip.py#L306

        if opts.CLIP_stop_at_last_layers > 1:
            z = outputs.hidden_states[-opts.CLIP_stop_at_last_layers]
            z = self.wrapped.transformer.text_model.final_layer_norm(z)
        else:
            z = outputs.last_hidden_state

@hafriedlander
Copy link
Contributor

@xunings - see the code I linked to earlier. It gets the earlier layers correctly normalized.

@patil-suraj
Copy link
Contributor

The last hidden state is normed in the code, but the other hidden states seem not.

If we skip the last layer when using the model and set the penultimate layer as the last layer, it will get normalized.

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

We could definitely support this for fine-tuning; for example, add an argument called use_penultimate_layer, and it's True, the script will throw out the last layer and use the penultimate layer (like it's done for SD2.0 now). This way, it'll also save the model correctly, and during inference, it should work out of the box with pipelines.

Is there any example or experiment with SD1.x using the penultimate layer that we could check?

@xunings
Copy link
Author

xunings commented Jan 16, 2023

The last hidden state is normed in the code, but the other hidden states seem not.

If we skip the last layer when using the model and set the penultimate layer as the last layer, it will get normalized.

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

We could definitely support this for fine-tuning; for example, add an argument called use_penultimate_layer, and it's True, the script will throw out the last layer and use the penultimate layer (like it's done for SD2.0 now). This way, it'll also save the model correctly, and during inference, it should work out of the box with pipelines.

Is there any example or experiment with SD1.x using the penultimate layer that we could check?

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@hafriedlander thanks for the code. The solution looks very solid. I am just curious, did you try simply modifying the "num_hidden_layers"? If yes, what was the problem?

@patil-suraj
Copy link
Contributor

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@xunings yes, that's right.

@xunings
Copy link
Author

xunings commented Jan 30, 2023

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@xunings yes, that's right.

Thanks for the confirmation, it's clear for me now.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@JemiloII
Copy link

JemiloII commented Jun 7, 2023

Would be nice to just have a simple config to just use clip_skip: int to the pipeline.image() or pipeline.pretrained() etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

6 participants