Add clip skip for diffusion pipeline #3212

NormXU · 2023-04-24T17:22:07Z

Introduction

clip skip is a trick to feed the early-stopped features encoded by CLIPTextModel into the cross-attention. If clip_skip = 2, it means that we want to use the features from the layer before the last of the clip text encoder to guide our image generation. And our current diffusion pipeline can be regarded as clip_skip = 1, which means that we just use the feature from the last layer of clip text encoder.

Here is a brief introduction to clip skip webui-wiki and related discussion link

A dominant majority of models need clip_skip=2 to reach a more aesthetic generation. I think adding this feature can give people more choices to optimize their generation.

Implementation

Adding clip_skip into diffusers is both simple and difficult. The main idea of clip_skip is simple, however, since our text encoder is imported from transformers, it is not easy to hack the CLIPTexModel in diffusers.

To do so, we need to overwrite CLIPTextModel and CLIPTextTransformer. Here is my implementation:

class MyCLIPTextTransformer(CLIPTextTransformer):

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        clip_skip: Optional[int] = 1, # <-- newly added: take the last N layer of encoder as output
    ) -> Union[Tuple, BaseModelOutputWithPooling]:

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is None:
            raise ValueError("You have to specify input_ids")

        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])

        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)

        bsz, seq_len = input_shape
        # CLIP's text model uses causal mask, prepare it here.
        # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
        causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype).to(
            hidden_states.device
        )
        # expand attention_mask
        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)


        encoder_outputs = self.encoder(
            inputs_embeds=hidden_states,
            attention_mask=attention_mask,
            causal_attention_mask=causal_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=True,
            return_dict=return_dict,
        )

        last_hidden_state = encoder_outputs.hidden_states[-clip_skip] # <-- newly added:  take the last N layer of encoder as output
        last_hidden_state = self.final_layer_norm(last_hidden_state)

        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
        pooled_output = last_hidden_state[
            torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
            input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
        ]

        if not return_dict:
            return (last_hidden_state, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=last_hidden_state,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

    def _build_causal_attention_mask(self, bsz, seq_len, dtype):
        # lazily create causal attention mask, with full attention between the vision tokens
        # pytorch uses additive attention mask; fill with -inf
        mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype)
        mask.fill_(torch.tensor(torch.finfo(dtype).min))
        mask.triu_(1)  # zero out the lower diagonal
        mask = mask.unsqueeze(1)  # expand mask
        return mask


class MyCLIPTextModel(CLIPTextModel):
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        super().__init__(config)
        self.text_model = MyCLIPTextTransformer(config) # <-- newly added:  use the overrided clip_text_transformer
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        clip_skip: Optional[int] = 1, # <-- newly added:  take the last N layer of encoder as output
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            clip_skip=clip_skip  # <-- newly added:  take the last N layer of encoder as output
        )

We can then use the overwritten clip_text_encoder in any_encode_prompt function of the diffuser pipeline. For example,
in pipeline_stable_diffusion.py

    def _encode_prompt(
        self,
        prompt,
        device,
        num_images_per_prompt,
        do_classifier_free_guidance,
        negative_prompt=None,
        prompt_embeds: Optional[torch.FloatTensor] = None,
        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
       clip_skip: Optional[int] = 1,
    ):
        # ......  # Omit Unchanged Codes

        prompt_embeds = self.text_encoder(
            text_input_ids.to(device),
            clip_skip=clip_skip,  
            attention_mask=attention_mask,
        )
        prompt_embeds = prompt_embeds[0]

        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)

        # ......  # Omit Unchanged Codes
        # this trick often applies on prompt embedding instead of negative prompt embedding
        negative_prompt_embeds = self.text_encoder(
            uncond_input.input_ids.to(device),
            attention_mask=attention_mask,
        )
        negative_prompt_embeds = negative_prompt_embeds[0]

       # ......  # Omit Unchanged Codes

        return prompt_embeds

ImplementingCLIPTextTransformer and CLIPTextModel to support clip_skip cleanly and nicely is difficult for me. I'd like to leave this issue to the diffusers team.

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2023-04-24T20:45:05Z

Hey @NormXU, you could also just do the following no:

# we skip one layer of the encoder
text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(checkpoint, torch_dtype=torch.float16)

By loading the text encoder only with 11 layers you are skipping the final layer.

NormXU · 2023-04-25T02:11:08Z

Hey @patrickvonplaten, thank you for your quick reply.

Setting num_hidden_layers while initializing the text encoder is a good choice. However, I think it will be easier to use if we can set clip_skip as an editable parameter for diffusion forward inference, such as:

@torch.no_grad()
    def __call__(
        self,
        clip_skip: Optional[int] = 1,
    ):

Or, we need to initialize a text encoder every time we want to change the value of clip_skip.

patrickvonplaten · 2023-04-25T13:10:11Z

@NormXU, can you explain a bit more when people would want to change clip_skip during runtime? Usually a model performs best with one specific layer output not with multiple IMO so I don't think one would really change this during runtime. Can you explain when this is necessary?

NormXU · 2023-04-25T13:59:56Z

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skip from 1 to 3,

But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂

CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

is good and convenient enough to use.

sandeshrajbhandari · 2023-04-26T01:25:46Z

@NormXU, can you explain a bit more when people would want to change clip_skip during runtime? Usually a model performs best with one specific layer output not with multiple IMO so I don't think one would really change this during runtime. Can you explain when this is necessary?

clip_skip 2 is used generally with models like Anything. using clip_skip 2 has shown positive results during image generation compared to clip_skip 1 for anime-related models. However, most users tend to use either clip_skip 1 and 2 to ideate, and using the num_hidden_layers=11 should suffice.

DaXu999 · 2023-07-17T18:31:52Z

Any guidance on how to use this when loading models from .safetensor files:

pipeline = StableDiffusionPipeline.from_single_file( "/.safetensors",torch_dtype=torch.float16,clip_model=clip_model )

the clip_model=clip_model argument is being ignored

patrickvonplaten · 2023-07-18T09:16:47Z

Where do you see clip_model as being an input? I think it should be text_encoder=clip_model no?

DaXu999 · 2023-07-18T11:54:22Z

Oh yeah, that seems to have worked. I've got a follow up question on this though. I'm running into this error when combining this with compel, I've disabled the truncation of the prompts but am running into this error on this line:
negative_conditioning = compel.build_conditioning_tensor(negative_prompt)
'index out of range in self'

zetyquickly · 2023-07-27T20:28:33Z

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skip from 1 to 3,

But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂
CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)
is good and convenient enough to use.

Hello. Given the recent introduction of new functionality, our existing approach may no longer be viable. Consider the implementation of LoraLoaderMixin:

class LoraLoaderMixin:
....
def load_lora_weights(...)
....

Should there be a mismatch between the keys of LoRA weights and the CLIPTextModel, the loading process could be disrupted. For instance, if the LoRA weights have 12 layers, but the text encoder only encompasses 11, an error similar to the following would be encountered:

...
File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 882, in load_lora_weights
    lora_scale=self.lora_scale,
  File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 1148, in load_lora_into_text_encoder
    f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}"
ValueError: failed to load text encoder state dict, unexpected keys: ['text_model.encoder.layers.11.mlp.fc1.lora_linear_layer.down.weight', ...

Given this scenario, I propose two potential solutions:

Incorporate clip_skip as an argument in __call__ and modify the inference flow accordingly.
Implement a failsafe loading mechanism for LoRA-like mixin extensions.

Gynjn · 2023-07-30T15:04:04Z

How can i define CLIPTextModel?

patrickvonplaten · 2023-08-03T18:15:25Z

Hmm is clip_skip really being that important as a feature? I still haven't seen a use case where a model produces better results with clip_skip different to what it was trained on in inference. Would love to see some concrete examples.

Also cc @sayakpaul and @yiyixuxu FYI

neggles · 2023-08-04T09:32:51Z

Almost every SD1.5 anime model is based off the NAI model, which was trained for a huge number of steps at CLIP Skip 2. If you run them at CLIP Skip 1 (which AIUI is a bit of a misnomer since it doesn't skip any layers, but whatever), you'll often get weird glitches in fine details or strange fractals in the background; it's been the source of quite some annoyance with some merges/finetunes.

A couple of not particularly great examples:

The second one is a better example; at CLIP skip 1 the model failed to pick up on the arms behind back in the prompt, but it followed it correctly at CLIP skip 2.

sayakpaul · 2023-08-04T09:36:56Z

Is this specific to anime models only? How does this approach generalize to other kinds of models?

The solution shown in the OP seems like the best approach as we cannot really change things at the end of transformers.

If we can see some diverse and varied examples where using CLIP Skip has been truly crucial, I think the original solution proposed in the OP could be directly incorporated.

patrickvonplaten · 2023-08-04T10:56:49Z

Do you have any reproducible code @neggles ?

neggles · 2023-08-16T02:37:15Z

Sorry for the slow reply, have a lot of irons in the fire 😅

Is this specific to anime models only? How does this approach generalize to other kinds of models?

It's most effective for anime models, since they were largely trained using the penultimate layer states, but the same approach works just fine on base SD1.5 (albeit with rather mixed results) as well as on models like OpenJourney v4:

portrait of a girl, long blonde hair, white off shoulder dress, gold trim, digital art, chiaroscuro, light particles, cinematic light, dramatic, greg rutkowski, looking at viewer, masterpiece, intricate, best quality, fine detail, artgerm, outdoors, forest
Negative: bad artist, bad quality, monochrome, blurry, bad hands, low quality, jpeg artifacts
576x640, DPMSolverSinglestepScheduler with karras sigmas, 21 steps, CFG at 7.75, seed of 3134342485.

(This was generated using A1111, but the Diffusers outputs should be similar)

It's also worth noting that SD2.1 was trained entirely using CLIP penultimate layer states (on an unfortunately undertrained text encoder, but that's beside the point).

Do you have any reproducible code @neggles ?

My implementation of the solution OP proposed is used in my AnimateDiff fork - seems to work fine. It's almost a straight copy, but I tried to stick a little closer to the existing Diffusers code style. I just instantiate the pipeline by loading the CLIPSkipTextModel manually then passing it to StableDiffusionPipeline() (or in that particular case the animatediff version, but the concept is the same)

tl;dr yeah, it's primarily an anime model thing, but it just comes down to how the model was trained, and an awful lot of models - even realism-focused ones - are based off anime models / trained with CLIP penultimate layer states. It's really just another tweakable knob 🤷

sayakpaul · 2023-08-16T03:05:53Z

@neggles what about the other inference time parameters?

Scheduler
Number of inference steps

etc.?

mudler · 2023-08-16T17:54:22Z

very interested in this - I'm struggling currently to get this to work with single_files checkpoints.

mudler · 2023-08-17T14:12:20Z

Hey @NormXU, you could also just do the following no:
# we skip one layer of the encoder
text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(checkpoint, torch_dtype=torch.float16)
By loading the text encoder only with 11 layers you are skipping the final layer.

while this seem to work, the problem of this approach is that one has to guess how many layers there are, and as far my knowledge of python goes (not that much) couldn't find a proper way to do so.

neggles · 2023-08-20T11:42:03Z

@sayakpaul see last line of code block - 576x640, DPMSolverSinglestepScheduler with karras sigmas, 21 steps, CFG at 7.75, seed of 3134342485. 😉

patrickvonplaten · 2023-08-24T13:47:11Z

This feature does seem to be requested quite a bit now - design-wise we have three options:

1. Add a method to from_pretrained(...) => 🚫 because pipeline from pretrained should only be for components
1. Add it to the config => 🚫 because it would forbid users to use it during inference and it seems like even if models are trained with clip skip some people want to use it
1. Add it as an argument to __call__ => don't think that's a good idea because clip_skip is very rarely switched for inference
1. Add a setter method:

def set_clip_output_layer(output_layer_idx: int):
     pass

which would allow the user to do:

pipe = DiffusionPipeline.from_pretrained("...")
pipe.set_clip_output_layer(...)

=> This would be a relatively simply PR where we only have to add the function to one SD pipeline and can then copy it to all other pipelines.

Wdyt? @neggles @sayakpaul ?

zetyquickly · 2023-08-24T17:59:05Z

@patrickvonplaten I appreciate the approach with the setter method, but I have a couple of suggestions that might further optimize this process:

I suggest we target pipe.text_encoder directly instead of the broader pipe. By doing so, we can benefit from modifying text_encoder independently. This would be especially handy for tasks like retrieving emphasized embeds with tools like compel.

Secondly, it's crucial to ensure the robustness of the load_lora_weights method's compatibility with this setter. For instance, if only 10 out of the 12 blocks are present in the LoRA weights, the weights should still load correctly, in case when the setter is configured with skip == 2

patrickvonplaten · 2023-08-25T18:11:34Z

Good points! Note however that the text_encoder lives in transformers not in diffusers so we cannot directly modify it. I'm also still a bit unsure about the importance of such a method here - is there really that much demand/need for it?

NormXU · 2023-08-26T02:08:48Z

@patrickvonplaten I think this feature is fun for play and valuable for research.

For playing purposes: @neggles has demonstrated how clip_skip affects the generation. It can ensure SD adheres to the text instructions strictly. I think this is because the CLIPtext encoder is trained by contrastive learning between images and their captions. These captions can hardly preserve rich information, such as lighting, color, body gesture, and relationship but only capture a small portion of information necessary for discriminating from other negative samples in a training batch. Empirically, using earlier text encoder output can somehow mitigate this problem.

For research purposes: SDXL has incorporated this trick into its architecture:

We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically,
we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel axis.

It looks like they also noticed the problem with the text encoder and solved it by concatenating two penultimate text encoder outputs along the channel axis. This is also a 'clip_skip' trick. SDXL paper link

Notably, recently VLM(Visual-Language Model), such as LLaVa, BLIVA, also use this trick to align the penultimate image features with LLM, which they claim can give better results.

We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023) as our visual encoder. The pre-trained weights are initialized and remain frozen during training. We removed the last layer from ViT (Dosovitskiy et al. 2020) and opted to use the output features of the second last layer, which yielded slightly better performance

-- from BLIVA paper

zetyquickly · 2023-08-26T20:39:44Z

Good points! Note however that the text_encoder lives in transformers not in diffusers so we cannot directly modify it. I'm also still a bit unsure about the importance of such a method here - is there really that much demand/need for it?

Certainly, while clip_skip is a noteworthy feature, imo it seems that matters related to SDXL and its subsequent enhancements might take precedence in the priority list.

neggles · 2023-09-02T10:22:26Z

[snip]
Add a setter method:
def set_clip_output_layer(output_layer_idx: int):
     pass
which would allow the user to do:
pipe = DiffusionPipeline.from_pretrained("...")
pipe.set_clip_output_layer(...)
=> This would be a relatively simply PR where we only have to add the function to one SD pipeline and can then copy it to all other pipelines.

Wdyt? @neggles @sayakpaul ?

This seems like a reasonable approach to me; keeps from having to fiddle with Transformers directly and is straightforward to use/implement. It would be nicer to have it as an argument for __call__() but it's not necessary by any means, and while people do sometimes want to change it between inference batches it's not exactly difficult to add an extra method call as/where needed.

Come to think of it, this could also be quite handy for training use cases (wrapping the pipeline in a trainer class is a common way to approach that) depending on how it's implemented.

I had a whole thing I was going to drop in here summarizing the argument in favour from this end, but based on #4834 it seems like I don't need to bother 😆

deJQK · 2023-09-14T07:40:41Z

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skip from 1 to 3,
But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂
CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)
is good and convenient enough to use.
Hello. Given the recent introduction of new functionality, our existing approach may no longer be viable. Consider the implementation of LoraLoaderMixin:
class LoraLoaderMixin:
....
def load_lora_weights(...)
....
Should there be a mismatch between the keys of LoRA weights and the CLIPTextModel, the loading process could be disrupted. For instance, if the LoRA weights have 12 layers, but the text encoder only encompasses 11, an error similar to the following would be encountered:
...
File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 882, in load_lora_weights
    lora_scale=self.lora_scale,
  File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 1148, in load_lora_into_text_encoder
    f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}"
ValueError: failed to load text encoder state dict, unexpected keys: ['text_model.encoder.layers.11.mlp.fc1.lora_linear_layer.down.weight', ...
Given this scenario, I propose two potential solutions:

Incorporate clip_skip as an argument in __call__ and modify the inference flow accordingly.

Implement a failsafe loading mechanism for LoRA-like mixin extensions.

I met the similar issue and my solution is something like this:

from safetensors.torch import load_file

ckpt_path = '/path/to/ckpt.safetensors'
state_dict_lora = load_file(ckpt_path, device='cpu')
new_state_dict_lora = {}
for k_, v_ in state_dict_lora.items():
    invalid_key = any(f'text_model_encoder_layers_{11 - layer_idx_}_' in k_ for layer_idx_ in range(clip_skip))
    if not invalid_key:
        new_state_dict_lora[k_] = v_
pipe.load_lora_weights(new_state_dict_lora)

It works well for the method from_pretrained, but seems not work well for from_single_file. The error is something that the function download_from_original_stable_diffusion_ckpt will call convert_ldm_clip_checkpoint and find some module is not in the model (the function set_module_tensor_to_device from accelerate).

alexblattner · 2023-09-14T12:10:49Z

@NormXU you can already do this like this:

clip_layers = pipe.text_encoder.text_model.encoder.layers
if clip_skip > 0:
        pipe.text_encoder.text_model.encoder.layers = clip_layers[:-clip_skip]

patrickvonplaten · 2023-09-19T09:27:33Z

PR has been merged: #3870 , please try it out

zetyquickly · 2023-09-19T19:51:43Z

#5057

andupotorac · 2024-05-22T14:46:42Z

Hmm is clip_skip really being that important as a feature? I still haven't seen a use case where a model produces better results with clip_skip different to what it was trained on in inference. Would love to see some concrete examples.

Also cc @sayakpaul and @yiyixuxu FYI

Example with clipskip 1 vs 2.

NormXU closed this as completed Apr 25, 2023

NormXU reopened this Jul 29, 2023

sayakpaul mentioned this issue Aug 23, 2023

Loading LoRA with 12 hidden layer TextEncoder and use clip_skip=2 #4735

Closed

patrickvonplaten mentioned this issue Aug 29, 2023

[Feature Request] Add clip skip #4834

Closed

NormXU closed this as completed Sep 20, 2023

njsharpe mentioned this issue Oct 5, 2023

Clip Skip for SD Txt2Img nod-ai/SHARK#1865

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add clip skip for diffusion pipeline #3212

Add clip skip for diffusion pipeline #3212

NormXU commented Apr 24, 2023 •

edited

patrickvonplaten commented Apr 24, 2023

NormXU commented Apr 25, 2023

patrickvonplaten commented Apr 25, 2023

NormXU commented Apr 25, 2023

sandeshrajbhandari commented Apr 26, 2023

DaXu999 commented Jul 17, 2023 •

edited

patrickvonplaten commented Jul 18, 2023

DaXu999 commented Jul 18, 2023 •

edited

zetyquickly commented Jul 27, 2023

Gynjn commented Jul 30, 2023

patrickvonplaten commented Aug 3, 2023

neggles commented Aug 4, 2023

sayakpaul commented Aug 4, 2023 •

edited

patrickvonplaten commented Aug 4, 2023

neggles commented Aug 16, 2023

sayakpaul commented Aug 16, 2023

mudler commented Aug 16, 2023

mudler commented Aug 17, 2023

neggles commented Aug 20, 2023

patrickvonplaten commented Aug 24, 2023

zetyquickly commented Aug 24, 2023

patrickvonplaten commented Aug 25, 2023

NormXU commented Aug 26, 2023 •

edited

zetyquickly commented Aug 26, 2023

neggles commented Sep 2, 2023

deJQK commented Sep 14, 2023

alexblattner commented Sep 14, 2023 •

edited

patrickvonplaten commented Sep 19, 2023

zetyquickly commented Sep 19, 2023

andupotorac commented May 22, 2024

Add clip skip for diffusion pipeline #3212

Add clip skip for diffusion pipeline #3212

Comments

NormXU commented Apr 24, 2023 • edited

Introduction

Implementation

patrickvonplaten commented Apr 24, 2023

NormXU commented Apr 25, 2023

patrickvonplaten commented Apr 25, 2023

NormXU commented Apr 25, 2023

sandeshrajbhandari commented Apr 26, 2023

DaXu999 commented Jul 17, 2023 • edited

patrickvonplaten commented Jul 18, 2023

DaXu999 commented Jul 18, 2023 • edited

zetyquickly commented Jul 27, 2023

Gynjn commented Jul 30, 2023

patrickvonplaten commented Aug 3, 2023

neggles commented Aug 4, 2023

sayakpaul commented Aug 4, 2023 • edited

patrickvonplaten commented Aug 4, 2023

neggles commented Aug 16, 2023

sayakpaul commented Aug 16, 2023

mudler commented Aug 16, 2023

mudler commented Aug 17, 2023

neggles commented Aug 20, 2023

patrickvonplaten commented Aug 24, 2023

zetyquickly commented Aug 24, 2023

patrickvonplaten commented Aug 25, 2023

NormXU commented Aug 26, 2023 • edited

zetyquickly commented Aug 26, 2023

neggles commented Sep 2, 2023

deJQK commented Sep 14, 2023

alexblattner commented Sep 14, 2023 • edited

patrickvonplaten commented Sep 19, 2023

zetyquickly commented Sep 19, 2023

andupotorac commented May 22, 2024

NormXU commented Apr 24, 2023 •

edited

DaXu999 commented Jul 17, 2023 •

edited

DaXu999 commented Jul 18, 2023 •

edited

sayakpaul commented Aug 4, 2023 •

edited

NormXU commented Aug 26, 2023 •

edited

alexblattner commented Sep 14, 2023 •

edited