Skip to content

StableDiffusion3Pipeline with InstantX/SD3.5-Large-IP-Adapter #11627

@molar-volume

Description

@molar-volume

Describe the bug

Example comes from the Stable Diffusion 3 documentation:
https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_3#image-prompting-with-ip-adapters

Using InstantX/SD3.5-Large-IP-Adapter with StableDiffusion3Pipeline fails because the code expects an ip-adapter.safetensors file, which is not available.

Specifying the weight_name="ip-adapter.bin" resolves the issue but leads to the following runtime error:

RuntimeError: The size of tensor a (333) must match the size of tensor b (4096) at non-singleton dimension 1

Reproduction

import torch
from PIL import Image

from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor

image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"

feature_extractor = SiglipImageProcessor.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
)
image_encoder = SiglipVisionModel.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
).to( "cuda")

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    torch_dtype=torch.float16,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
).to("cuda")

pipe.load_ip_adapter(ip_adapter_id, weight_name="ip-adapter.bin")

pipe.set_ip_adapter_scale(0.6)

ref_img = Image.open("image.jpg").convert('RGB')

image = pipe(
    width=1024,
    height=1024,
    prompt="a cat",
    negative_prompt="lowres, low quality, worst quality",
    num_inference_steps=24,
    guidance_scale=5.0,
    ip_adapter_image=ref_img
).images[0]

image.save("result.jpg")

Logs

File "/home/appuser/.local/lib/python3.11/site-packages/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 1060, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/appuser/.local/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 396, in forward
    encoder_hidden_states, hidden_states = block(
                                           ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/appuser/.local/lib/python3.11/site-packages/diffusers/models/attention.py", line 244, in forward
    encoder_hidden_states = encoder_hidden_states + context_attn_output
                            ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (333) must match the size of tensor b (4096) at non-singleton dimension 1

System Info

base image:
pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime

diffusers == 0.33.1
transformers == 4.51.3

Who can help?

@yiyixuxu
@sayakpaul
@stevhliu

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions