[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) #8558

Dalanke · 2024-06-14T12:57:29Z

What does this PR do?

This PR is trying to fix the bug when you specified any number greater than 1 in num_images_per_prompt when you call StableDiffusion3Pipeline . An expection occurs when you create the pipeline without T5 text encoder (set text_encoder_3=None)

Reproduction (follow the documentation here):

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    './stable-diffusion-3-medium/sd3_medium_incl_clips.safetensors',
    torch_dtype=torch.float16,
    text_encoder_3=None
    )
pipe = pipe.to("cuda")

image = pipe(
    "a picture of a cat holding a sign that says hello world",
    negative_prompt="",
    # could not specify number of images
    num_images_per_prompt=4,
    num_inference_steps=28,
    guidance_scale=7.0,
).images

for i, img in enumerate(image):
    with open(f'./output/test_{i}.jpg','w+') as f:
        img.save(f)

Bug output

Loading pipeline components...:  62%|███████████████████████████████████████▍                       | 5/8 [00:00<00:00, 16.23it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.47it/s]
Traceback (most recent call last):
  File "/home/xxx/workspace/sd3/sd3_inference.py", line 16, in <module>
    image = pipe(
  File "/home/xxx/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/workspace/diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 778, in __call__
    ) = self.encode_prompt(
  File "/home/xxx/workspace/diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 413, in encode_prompt
    prompt_embeds = torch.cat([clip_prompt_embeds, t5_prompt_embed], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 1 for tensor number 1 in the list.

Mitigation:
Bug due to the shape mis-match. For example, in num_images_per_prompt=4 settting, line 413:

prompt_embeds = torch.cat([clip_prompt_embeds, t5_prompt_embed], dim=-2)

will have different shape in torch.Size([4, 77, 4096]) and torch.Size([1, 77, 4096])

fix in the function _get_t5_prompt_embeds return when text_encoder_3=None

if self.text_encoder_3 is None:
        return torch.zeros(
        # change shape here
        # (batch_size, self.tokenizer_max_length, self.transformer.config.joint_attention_dim),
            (batch_size * num_images_per_prompt, self.tokenizer_max_length, self.transformer.config.joint_attention_dim),
            device=device,
            dtype=dtype,
        )

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…None

HuggingFaceDocBuilderDev · 2024-06-14T21:11:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Dalanke · 2024-06-18T02:31:14Z

Seems like some checks were not successful but I could not figure out the reason. One line code changed should not lead to code quality issue. Can anyone kindly look into it?

yiyixuxu · 2024-06-18T22:42:57Z

thanks!
we have a make style and make fix-copies command you can run to pass the quality test https://huggingface.co/docs/diffusers/en/conceptual/contribution#how-to-open-a-pr

…out T5 (text_encoder_3=None) (#8558) * fix shape mismatch when num_images_per_prompt > 1 and text_encoder_3=None * style * fix copies --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: yiyixuxu <yixu310@gmail,com>

fix shape mismatch when num_images_per_prompt > 1 and text_encoder_3=…

6f4ada9

…None

yiyixuxu and others added 3 commits June 18, 2024 11:06

Merge branch 'main' into fix/num_images_per_prompt

dfa5ed4

style

3c1f545

fix copies

7238b71

yiyixuxu merged commit 2921a20 into huggingface:main Jun 18, 2024

yiyixuxu mentioned this pull request Jun 21, 2024

SD3 - num_images_per_prompt no longer honoured (throws error) #8649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) #8558

[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) #8558

Uh oh!

Dalanke commented Jun 14, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jun 14, 2024

Uh oh!

Dalanke commented Jun 18, 2024

Uh oh!

yiyixuxu commented Jun 18, 2024

Uh oh!

Uh oh!

[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) #8558

[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) #8558

Uh oh!

Conversation

Dalanke commented Jun 14, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 14, 2024

Uh oh!

Dalanke commented Jun 18, 2024

Uh oh!

yiyixuxu commented Jun 18, 2024

Uh oh!

Uh oh!