Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

Open
jslegers opened this issue Jan 27, 2023 · 13 comments
Labels
good first issue Good for newcomers

Comments

@jslegers
Copy link

jslegers commented Jan 27, 2023

Description of the problem

I'm unable to load multilingual models like BAAI/AltDiffusion with diffusers.

The solution I'd like

I would like diffusers to have out-of-the-box support for models like BAAI/AltDiffusion, that use AltCLIP or a different multilingual CLIP model.

Alternatives I've considered

I tried loading BAAI/AltDiffusion, but it told me that I was missing a library.

After loading that library, it produced a different error.

I kinda gave up after that.

Additional context

Since AltCLIP has a max_position_embeddings value of 514 for its text encoder instead of 77, I had hoped I could just replace the text encoder and tokenizer of my models with those of BAAI/AltDiffusion to overcome the 77 token limit.

@apolinario
Copy link
Contributor

AltDiffusion has its own pipeline as you can see in the model page

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精灵公主,非常详细,幻想,非常详细,数字绘画,概念艺术,敏锐的焦点,插图"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

If that didn't work, please feel free to describe a bit more what error was produced.

@jslegers
Copy link
Author

AltDiffusion has its own pipeline as you can see in the model page

Would it be an option for you guys to add support for these models to the StableDiffusionPipeline?

Almost all SD diffusers code I've seen uses the StableDiffusionPipeline. If the models I release aren't compatible with the StableDiffusionPipeline, most people aren't going to figure out how to use it, which is going to seriously harm the potential adoption rate of my models.

It's also more user friendly IMO to have one unified StableDiffusionPipeline API for all SD models rather than having to use a dedicated AltDiffusionPipeline for AltDiffusion, another one for another variation, etc...

@patrickvonplaten
Copy link
Contributor

Hey @jslegers,

Pipelines are not supposed to be "all-in-one" toolboxes, but rather examples of how to run a certain model.
Note that AltDiffusion is not the same as stable diffusion since a different text encoder is used amongst others.

For more in-detail information you can have a look at the docs here:
https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy#pipelines

@jslegers
Copy link
Author

jslegers commented Jan 31, 2023

Pipelines are not supposed to be "all-in-one" toolboxes, but rather examples of how to run a certain model.

Pipelines are just examples?

* confused *

Note that AltDiffusion is not the same as stable diffusion since a different text encoder is used amongst others.

If the only difference between Stable Diffusion & AltDiffusion is a different text encoder & tokenizer, they're pretty much the same thing from an end user perspective, as changing every Stable Diffusion model into an AltDiffusion model (or vice versa) is trivial and can be achieved with a few tweaks to JSON files & drag-and-dropping some files...

Or am I missing something?

@apolinario
Copy link
Contributor

apolinario commented Feb 1, 2023

I think you can use the generic DiffusionPipeline that should auto-detect the pipeline according to the model, so with DiffusionPipeline.from_pretrained(model_id) you could load both Stable Diffuson and AltDiffusion models.

from diffusers import DiffusionPipeline
#model_id can be either a Stable Diffusion or AltDiffusion model
pipe = DiffusionPipeline.from_pretrained(model_id)

However, I gotta say I do understand and resonate with your desire of having a super powerful all-in-one pipeline that loads multiple models and contains all the latest features (incl. getting rid of the 77 token limits and maybe more), however I think maintaining such a pipeline - also making sure all features are inter-compatible with one another - may be beyond the scope of what the library maintainers aim for the library as you can see in the pipeline philosophy docs @patrickvonplaten has shared above.

However if you (or anyone in the community!) would be interested in maintaining an all-in-one interoperable pipeline with as many features possible - one can feel free to add it as a community pipeline - community pipelines are maintained by the community and loadable from the library. The community pipeline Stable Diffusion Mega is an example of how this merging multiple pipelines could work out if expanded to even more features.

@jslegers
Copy link
Author

jslegers commented Feb 1, 2023

@apolinario :

I think you can use the generic DiffusionPipeline that should auto-detect the pipeline according to the model, so with DiffusionPipeline.from_pretrained(model_id) you could load both Stable Diffuson and AltDiffusion models.

That's definitely acceptable for personal use... as in code I write myself for eg. demos or prototypes.

But it's not a suitable approach for when I create a custom diffusion model (customized through Dreambooth, LoRa and/or merging checkpoints) that I want others to use. When you release something to broader public, you want it to be as simple / dummy-proof as possible. For those with little to no experience with AI, learning how to use Stable Diffusion can already be a steep learning curve. Having to learn to code in Python, use Google Colab & become familiar with the diffusers API only adds to the steepness of the learning curve.

With so much demo code & tutorial out out there using StableDiffusionPipeline rather than DiffusionPipeline, many - if not most - are going to use the StableDiffusionPipeline by default. It's going to leave a lot of people puzzled why their StableDiffusionPipeline throws errors when try to load my models and it's very likely the vast majority is going to throw their hands up in the air and just give up rather than do what it takes to figure out they should be using a DiffusionPipeline instead.

I think maintaining such a pipeline - also making sure all features are inter-compatible with one another - may be beyond the scope of what the library maintainers aim for the library as you can see in the pipeline philosophy docs @patrickvonplaten has shared above.

Fair point.

Having worked in R&D on a high-end JavaScript GIS library with a similar philosophy in the past, for a former employer, I very much appreciate and respect that you want to avoid over-engineering this library and turn it into yet another convoluted monolith that tries to do everything poorly. I appreciate that you want to keep things simple and that you want to empower the user of the library rather than give too much power to the library itself.

A good compromise, in my very humble opinion, would be to optimize error reporting and inform the user when they're using the wrong pipeline and which pipeline they should be using instead. So, for example, if they're using a StableDiffusionPipeline wheren trying to load an AltDiffusion model, it could thrown an error with a message like Error : You are trying to load an AltDiffusion with StableDiffusionPipeline. Please use AltDiffusionPipeline or the generic DiffusionPipeline instead.

This way, users at least know what they're doing wrong and how to fix it without having to waste lots of precious time wading through Github issues or posts on the Huggingface community. Basically, what I'm saying is that errors associated with trying to load the wrong kind of diffusion model into a particular pipeline should be dummy-proof enough for the error message to be all the documentation needed for the users to figure out how to fix their mistake.

However if you (or anyone in the community!) would be interested in maintaining an all-in-one interoperable pipeline with as many features possible - one can feel free to add it as a community pipeline

I'll consider it if and when I have the time for it. Right now, creating my own custom Stable Diffusion Models, testing my models, creating demo apps for them, marketing them, experimenting with chatGPT and its API, and trying to come up with a business model that allows me to monetize my experience with all this shit is already more than full-time job. I'm already biting way more than I can chew and struggling to prioritize my ongoing activities, all while still learning to find my way in what was a completely unknown ecosystem to me just a few months ago.

Also, I'm not convinced adding an additional community pipeline alongside existing pipelines, that supports all of the models supported by every other pipeline, is the way to go. IMO this would result in nothing but unnecessary redundancy & bloat.

Based on my understanding of the current diffusers API, my approach would be 2-fold :

  1. Specific models for specific diffusion models : StableDiffusionPipeline for StableDiffusion model, AltDiffusionPipeline for AltDiffusion model, etc.
  2. A singular generic DiffusionPipeline that detects which type of diffusion model is being loaded and then dynamically loads the corresponding diffusion model and delegates the implementation of its methods to the methods of that model.

Additionally, ...

  1. Each pipeline should detect whether the right type of model is loaded and throw an error message in a consistent format informing the user of what they're doing wrong and how to fix it
  2. The name of each diffusion model & each diffusion pipeline should match, to make it self-evident which pipeline to use in which case
  3. Each diffusion model should implement at least each of the methods of the DiffusionPipeline as a shared interface
  4. N00b users should be encouraged to use the generic DiffusionPipeline because this makes it easier to switch models without having to worry about the differences between different models. More expert users should be uncouraged to use model-specific pipelines, because it means one less of abstraction and it grants them access to methods specific for that model, which wouldn't be available in a more generic DiffusionPipeline

But hey...

image

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Feb 3, 2023

Very much agree with all of your 4 points here:

  1. Each pipeline should detect whether the right type of model is loaded and throw an error message in a consistent format informing the user of what they're doing wrong and how to fix it

Very good point! We could/should think about a nice system here. Note that when loading with DiffusionPipeline we'll always use the correct pipeline class as defined here: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/39593d5650112b4cc580433f6b0435385882d819/model_index.json#L2

We could nevertheless make sure better error messages are thrown when something like:

from diffusers import StableDiffusionInpaintPipeline

pipeline = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

which is probably probs not giving a good error message

  1. The name of each diffusion model & each diffusion pipeline should match, to make it self-evident which pipeline to use in which case

Models and Pipeline are a bit on different levels as models are sub-components of pipelines. See here for more explanation:
https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy

  1. Each diffusion model should implement at least each of the methods of the DiffusionPipeline as a shared interface

Yes they abstract from the DiffusionPipeline so they have to

  1. N00b users should be encouraged to use the generic DiffusionPipeline because this makes it easier to switch models without having to worry about the differences between different models. More expert users should be uncouraged to use model-specific pipelines, because it means one less of abstraction and it grants them access to methods specific for that model, which wouldn't be available in a more generic DiffusionPipeline

100% agree - we should probs change all docs to just always use DiffusionPipeline

@patrickvonplaten
Copy link
Contributor

Think 1. and 4. are quite actionable, in case anybody wants to pick them up to open PRs

@patrickvonplaten patrickvonplaten added the good first issue Good for newcomers label Feb 3, 2023
@dg845
Copy link
Contributor

dg845 commented Mar 22, 2023

Hi, new contributor here. I've investigated 1. a bit and found that if we have something like

from diffusers import StableDiffusionInpaintPipeline

pipeline = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

this doesn't actually throw an error: it succeeds and pipeline is of type StableDiffusionInpaintPipeline (as expected).

However, if we try to generate an image from this pipeline via __call__ following the StableDiffusionInpaintPipeline example:

import PIL
import requests
import torch
from io import BytesIO
from diffusers import StableDiffusionInpaintPipeline

def download_image(url):
	response = requests.get(url)
	return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
)

pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

we get an error on the last line:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│ /home/tamamo/miniconda3/envs/diffusers-dev/lib/python3.10/site-packages/torch/autograd/grad_mode │
│ .py:27 in decorate_context                                                                       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/tamamo/code/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_i │
│ npaint.py:834 in __call__                                                                        │
│                                                                                                  │
│   831 │   │   num_channels_mask = mask.shape[1]                                                  │
│   832 │   │   num_channels_masked_image = masked_image_latents.shape[1]                          │
│   833 │   │   if num_channels_latents + num_channels_mask + num_channels_masked_image != self.   │
│ ❱ 834 │   │   │   raise ValueError(                                                              │
│   835 │   │   │   │   f"Incorrect configuration settings! The config of `pipeline.unet`: {self   │
│   836 │   │   │   │   f" {self.unet.config.in_channels} but received `num_channels_latents`: {   │
│   837 │   │   │   │   f" `num_channels_mask`: {num_channels_mask} + `num_channels_masked_image   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Incorrect configuration settings! The config of `pipeline.unet`: FrozenDict([('sample_size', 64),
('in_channels', 4), ('out_channels', 4), ('center_input_sample', False), ('flip_sin_to_cos', True), ('freq_shift', 0),
('down_block_types', ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D']),
('mid_block_type', 'UNetMidBlock2DCrossAttn'), ('up_block_types', ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D',
'CrossAttnUpBlock2D']), ('only_cross_attention', False), ('block_out_channels', [320, 640, 1280, 1280]),
('layers_per_block', 2), ('downsample_padding', 1), ('mid_block_scale_factor', 1), ('act_fn', 'silu'), ('norm_num_groups',
32), ('norm_eps', 1e-05), ('cross_attention_dim', 768), ('attention_head_dim', 8), ('dual_cross_attention', False),
('use_linear_projection', False), ('class_embed_type', None), ('num_class_embeds', None), ('upcast_attention', False),
('resnet_time_scale_shift', 'default'), ('time_embedding_type', 'positional'), ('timestep_post_act', None),
('time_cond_proj_dim', None), ('conv_in_kernel', 3), ('conv_out_kernel', 3), ('projection_class_embeddings_input_dim',
None), ('_class_name', 'UNet2DConditionModel'), ('_diffusers_version', '0.6.0'), ('_name_or_path',
'/home/tamamo/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d
819/unet')]) expects 4 but received `num_channels_latents`: 4 + `num_channels_mask`: 1 + `num_channels_masked_image`: 4 = 9.
Please verify the config of `pipeline.unet` or your `mask_image` or `image` input.

which makes sense, because the inpainting pipeline expects the initial layer of the UNet to have channels for the mask and masked image as well as the latent base image, whereas the base stable diffusion model checkpoint runwayml/stable-diffusion-v1-5 only has channels for the latents.

Perhaps I'm not familiar enough with the code, but it's unclear to me how we could generically catch errors of this sort from inside from_pretrained(...). Perhaps there is a good way to expose these type of shape requirements so that they can be checked from within from_pretrained(...)? (I could also imagine that there might be other sorts of "semantic" errors beside the model being the wrong shape that could result from a model-pipeline mismatch).

As for the error message itself, perhaps adding something like "please verify that the model and pipeline are consistent" might be helpful?

(I'm also working on 4. but haven't made enough progress to submit a PR.)

@patrickvonplaten
Copy link
Contributor

@dg845 could you open a new issue? Happy to answer there :-)

@dg845
Copy link
Contributor

dg845 commented Mar 23, 2023

I have opened a new issue at #2799.

@mcgeestocks
Copy link

Since the merging of #2799 and #2809 is there anything else left to do here or should this be closed @patrickvonplaten?

@patrickvonplaten
Copy link
Contributor

Good call think we can close this one no @jslegers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants