clip skip for Stable diffusion pipelines #1721

xunings · 2022-12-16T02:47:04Z

Hi,
Is the "clip skip" feature supported by the Stable diffusion pipelines? Just did a little bit of search on the documentation of StableDiffusionPipeline but found nothing related.

See: Using Hidden States of CLIP’s Penultimate Layer

Thanks in advance.

patrickvonplaten · 2022-12-19T15:56:05Z

Hey @xunings,

Yes we indeed also skip the final layer of CLIP and instead use the "penultimate" layer. See: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19

xunings · 2022-12-21T08:28:14Z

Hey @xunings,

Yes we indeed also skip the final layer of CLIP and instead use the "penultimate" layer. See: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19

Hi @patrickvonplaten ,
Thanks a lot for the reply. Is "num_hidden_layers" the correct parameter to modify in the config file? Can you also show an example for sd v1?

keturn · 2022-12-30T03:57:06Z

SD 1.x models also use a CLIPTextModel: https://huggingface.co/CompVis/stable-diffusion-v1-4/blob/main/text_encoder/config.json#L19

but I have no idea how you look at a config like that and figure out what is being "skipped." None of those values obviously look like "next-to-last" to me, and reading the CLIPTextConfig docs didn't shed any light on that either.

hafriedlander · 2022-12-30T04:12:59Z

Also replied on #1863 but https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/dev/sdgrpcserver/pipeline/text_embedding/text_encoder_alt_layer.py will wrap a TextEncoder and return some earlier layer to the pipeline.

patrickvonplaten · 2023-01-05T20:57:20Z

Yeah, sorry it's not super obvious and we should probably have left a least a comment in the code here. But just to confirm.

We use the "final" layer when using Stable Diffusion 1.4/1.5.
We use the "penultimate" layer when using Stable Diffusion 2.+

How is this implemented?!

The CLIP models used for SD 1.+ has 12 layers, for SD 2.+. has 24 layers (conventions are usually: "base": 12 layers, "large": 24 layers, ..., GPT3: 96 layers).

What we do is we simply use all layers of Stable Diffusion 1.4/1.5,
see here:
https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/text_encoder/config.json#L19 <- all 12 layers

but for SD 2.+ we've thrown away the final layer (the 24th) when converting the checkpoint since it's never used:
https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19 <- only 23 layers
Compare to the original checkpoint:
https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54 <- 24 layers

Hope this helps a bit.

@patil-suraj we should leave a comment in the code and/or conversion script going forward I think.

keturn · 2023-01-05T21:29:37Z

That explains a thing or two, thank you.

But people are looking to do this during inference on various fine-tunings of SD 1.x. (How this makes any sense mathematically, I have no idea. I guess that's part of the magic of neural networks and diffusion models.)

It sounds like those models should be distributed with that adjustment made to their text_encoder/config.json? Or, if they're confident about that being the best way to use the model, the pre-trained weights they distribute should have those unused layers truncated entirely?

hafriedlander · 2023-01-05T21:34:09Z

People are actually sometimes using penultimate (or even deeper) layers even when the model doesn't recommend it.

I believe the theory is that the ultimate layer is too detailed to specific concepts. But using earlier layers you get a more conceptual / wider latent space. (So "Oak" becomes "Tree"). The model doesn't stress itself out trying to hit a specific target and you get a less prompt-accurate but more coherent result. I'm not sure how much of this has been actually tested and how much is just assumptions based on the success using it for models designed for it.

keturn · 2023-01-05T21:38:58Z

oh, I guess the other consideration here is that we're talking about the CLIPTextModel. So different diffusion "models" may be different diffusion UNets, but all referring to the same CLIPTextModel, in which case it would make sense to have a parameter for that instead of having clip-vit-h-12-layers.bin, clip-vit-h-11-layers.bin, clip-vit-h-10.layers.bin, etc.

xunings · 2023-01-09T03:20:39Z

Yeah, sorry it's not super obvious and we should probably have left a least a comment in the code here. But just to confirm.

We use the "final" layer when using Stable Diffusion 1.4/1.5.

We use the "penultimate" layer when using Stable Diffusion 2.+

How is this implemented?!

The CLIP models used for SD 1.+ has 12 layers, for SD 2.+. has 24 layers (conventions are usually: "base": 12 layers, "large": 24 layers, ..., GPT3: 96 layers).

What we do is we simply use all layers of Stable Diffusion 1.4/1.5, see here: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/text_encoder/config.json#L19 <- all 12 layers

but for SD 2.+ we've thrown away the final layer (the 24th) when converting the checkpoint since it's never used: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/text_encoder/config.json#L19 <- only 23 layers Compare to the original checkpoint: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54 <- 24 layers

Hope this helps a bit.

@patil-suraj we should leave a comment in the code and/or conversion script going forward I think.

Thanks a lot for the explanation. Regarding sd 1.4/1.5, I am just wondering if we can switch to the "penultimate" layer by changing the num_hidden_layers config to 11, or is there anything deeper in the code to be modified?

patrickvonplaten · 2023-01-13T11:30:35Z

For 1.4/1.5 it would be enough to just change the config. It'll throw a warning at loading, but should work.

Nevertheless, one can also easily retrieve previous hidden layers by adding output_hidden_states=True to:

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Line 278 in 135567f

)

which will return a list of hidden_states which can then be used for training.

So for 1.4/1.5 I would maybe recommend doing this for training. E.g. do:

        text_embeddings = self.text_encoder(
            text_input_ids.to(device),
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        text_embeddings = text_embeddings.hidden_states[-2]  # for penultimate layer

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

xunings · 2023-01-16T09:33:16Z

For 1.4/1.5 it would be enough to just change the config. It'll throw a warning at loading, but should work.

Nevertheless, one can also easily retrieve previous hidden layers by adding output_hidden_states=True to:

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Line 278 in 135567f

)

which will return a list of hidden_states which can then be used for training.
So for 1.4/1.5 I would maybe recommend doing this for training. E.g. do:
        text_embeddings = self.text_encoder(
            text_input_ids.to(device),
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        text_embeddings = text_embeddings.hidden_states[-2]  # for penultimate layer
Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

The last hidden state is normed in the code, but the other hidden states seem not.

https://github.com/huggingface/transformers/blob/05b8e25fffd61feecb21928578ad39e63af21b4f/src/transformers/models/clip/modeling_clip.py#L735

last_hidden_state = self.final_layer_norm(last_hidden_state)

Can we also support the final_layer_norm for the penultimate layer in the pipeline? At least it is explicityly done in automatic1111,

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/fcbe0f35fbf1355e986d17410feb5c734d55ed47/modules/sd_hijack_clip.py#L306

        if opts.CLIP_stop_at_last_layers > 1:
            z = outputs.hidden_states[-opts.CLIP_stop_at_last_layers]
            z = self.wrapped.transformer.text_model.final_layer_norm(z)
        else:
            z = outputs.last_hidden_state

hafriedlander · 2023-01-16T10:28:19Z

@xunings - see the code I linked to earlier. It gets the earlier layers correctly normalized.

patil-suraj · 2023-01-16T10:47:55Z

The last hidden state is normed in the code, but the other hidden states seem not.

If we skip the last layer when using the model and set the penultimate layer as the last layer, it will get normalized.

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

We could definitely support this for fine-tuning; for example, add an argument called use_penultimate_layer, and it's True, the script will throw out the last layer and use the penultimate layer (like it's done for SD2.0 now). This way, it'll also save the model correctly, and during inference, it should work out of the box with pipelines.

Is there any example or experiment with SD1.x using the penultimate layer that we could check?

xunings · 2023-01-16T12:40:03Z

The last hidden state is normed in the code, but the other hidden states seem not.

If we skip the last layer when using the model and set the penultimate layer as the last layer, it will get normalized.

Note this does not make too much sense for inference as the unet needs to be conditioned on this, but we could def add this for training when doing dreambooth, lora, text-to-image etc... cc @patil-suraj

We could definitely support this for fine-tuning; for example, add an argument called use_penultimate_layer, and it's True, the script will throw out the last layer and use the penultimate layer (like it's done for SD2.0 now). This way, it'll also save the model correctly, and during inference, it should work out of the box with pipelines.

Is there any example or experiment with SD1.x using the penultimate layer that we could check?

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@hafriedlander thanks for the code. The solution looks very solid. I am just curious, did you try simply modifying the "num_hidden_layers"? If yes, what was the problem?

patil-suraj · 2023-01-16T13:15:12Z

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@xunings yes, that's right.

xunings · 2023-01-30T09:12:39Z

@patil-suraj by "If we skip the last layer when using the model" are you referring to modifying the "num_hidden_layers" config?

@xunings yes, that's right.

Thanks for the confirmation, it's clear for me now.

github-actions · 2023-02-23T15:03:44Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JemiloII · 2023-06-07T05:18:48Z

Would be nice to just have a simple config to just use clip_skip: int to the pipeline.image() or pipeline.pretrained() etc

keturn mentioned this issue Dec 30, 2022

Stable Diffusion 2.x -- Wrong CLIP layer is being used for conditioning? #1863

Closed

github-actions bot added the stale Issues that haven't received updates label Feb 23, 2023

github-actions bot closed this as completed Mar 3, 2023

YulienPohl mentioned this issue Mar 26, 2023

[enhancement]: Set manual CLIP skip invoke-ai/InvokeAI#2246

Closed

1 task

Stax124 mentioned this issue Apr 2, 2023

clip skip selection? VoltaML/voltaML-fast-stable-diffusion#51

Closed

2 tasks

damian0815 mentioned this issue Apr 17, 2023

Use penultimate CLIP layer on SD2 models invoke-ai/InvokeAI#3198

Closed

p1atdev mentioned this issue Jul 10, 2023

Penultimate CLIP layer p1atdev/LECO#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip skip for Stable diffusion pipelines #1721

clip skip for Stable diffusion pipelines #1721

xunings commented Dec 16, 2022

patrickvonplaten commented Dec 19, 2022

xunings commented Dec 21, 2022

keturn commented Dec 30, 2022

hafriedlander commented Dec 30, 2022

patrickvonplaten commented Jan 5, 2023

keturn commented Jan 5, 2023

hafriedlander commented Jan 5, 2023

keturn commented Jan 5, 2023

xunings commented Jan 9, 2023

patrickvonplaten commented Jan 13, 2023

xunings commented Jan 16, 2023

hafriedlander commented Jan 16, 2023

patil-suraj commented Jan 16, 2023

xunings commented Jan 16, 2023 •

edited

patil-suraj commented Jan 16, 2023

xunings commented Jan 30, 2023

github-actions bot commented Feb 23, 2023

JemiloII commented Jun 7, 2023

clip skip for Stable diffusion pipelines #1721

clip skip for Stable diffusion pipelines #1721

Comments

xunings commented Dec 16, 2022

patrickvonplaten commented Dec 19, 2022

xunings commented Dec 21, 2022

keturn commented Dec 30, 2022

hafriedlander commented Dec 30, 2022

patrickvonplaten commented Jan 5, 2023

keturn commented Jan 5, 2023

hafriedlander commented Jan 5, 2023

keturn commented Jan 5, 2023

xunings commented Jan 9, 2023

patrickvonplaten commented Jan 13, 2023

xunings commented Jan 16, 2023

hafriedlander commented Jan 16, 2023

patil-suraj commented Jan 16, 2023

xunings commented Jan 16, 2023 • edited

patil-suraj commented Jan 16, 2023

xunings commented Jan 30, 2023

github-actions bot commented Feb 23, 2023

JemiloII commented Jun 7, 2023

xunings commented Jan 16, 2023 •

edited