-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Come on, come on, let's adapt the conversion script to SD 2.0 #1388
Comments
🤗 Diffusers with Stable Diffusion 2 is live!anton-l commented (#1388 (comment)) Installation Release Information Contributors 📰 News
✏️ Notes & InformationRelated huggingface/diffusers Pull Requests:
👇 Quick Links:
👁️ User Submitted Resources:
💭 User Story (Prior to Huggingface Diffusers 0.9.0 Release)Stability-AI has released Stable Diffusion 2.0 models/workflow. When you run
Output: Traceback (most recent call last):
File "convert_original_stable_diffusion_to_diffusers.py", line 720, in <module>
unet.load_state_dict(converted_unet_checkpoint)
File "lib\site-packages\torch\nn\modules\module.py", line 1667, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel:
size mismatch for down_blocks.0.attentions.0.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for down_blocks.0.attentions.0.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
.... blocks.1.attentions blocks.2.attentions etc. etc. |
trying to but likely I won't be able to do it lol |
after looking at it I'm not sure it has anything to do with the script, seems like the u2net on diffusers needs to have 4 dimensions on the tensor size. |
So I guess this will take time... |
maybe not, I'm not that knowledgeable on the subject but I assume a unet2D needs to be 4D, or maybe you can just artificially add it idk |
|
Notes:
|
Here's an example of accessing the penultimate text embedding layer https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b34bb27cf30940f6a6a41f4b77c5b77bea11fd76/sdgrpcserver/pipeline/text_embedding/basic_text_embedding.py#L33 |
doesn't seem to work for me on the 768-v model using the v2 config for v TypeError: EulerDiscreteScheduler.init() got an unexpected keyword argument 'prediction_type' |
Appears I'm also having unexpected argument error, but of a different arg: Command: python convert.py --checkpoint_path="models/512-base-ema.ckpt" --dump_path="outputs/" --original_config_file="v2-inference.yaml" Result:
I can't seem to find a resolution to this one. |
You need to use the absolute latest Diffusers and merge this PR (or use my branch which has it in it) #1386 |
(My branch is at https://github.com/hafriedlander/diffusers/tree/stable_diffusion_2) |
Amazing to see the excitement here! We'll merge #1386 in a bit :-) |
@patrickvonplaten the problems I've run into so far:
|
That's super helpful @hafriedlander - thanks! BTW, weights for the 512x512 are up:
Looking into the 768x768 model now |
Nice. Do you have a solution in mind for how to flag to the pipeline to use the penultimate layer in the CLIP model? (I just pass it in as an option at the moment) |
Can you send me a link? Does the pipeline not work out of the box? cc @anton-l @patil-suraj |
It works but I don't think it's correct. The Stability configuration files explicitly say to use the penultimate CLIP layer https://github.com/Stability-AI/stablediffusion/blob/33910c386eaba78b7247ce84f313de0f2c314f61/configs/stable-diffusion/v2-inference-v.yaml#L68 |
It's relatively easy to get access to the penultimate layer. I do it in my custom pipeline like this: The problem is knowing when to do it and when not to. |
I see! Thanks for the links - so they do this for both the 512x512 SD 2 and 768x768 SD 2 model? |
Both |
It's a technique NovelAI discovered FYI (https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac) |
@patrickvonplaten how sure are you that your conversion is correct? I'm trying to diagnose a difference I get between your 768 weights and my conversion script. There's a big difference, and in general I much prefer the results from my conversion. It seems specific to the unet - if I replace my unet with yours I get the same results. |
OK, differential diagnostic done, it's the Tokenizer. How did you create the Tokenizer at https://huggingface.co/stabilityai/stable-diffusion-2/tree/main/tokenizer? I just built a Tokenizer using |
@hamzafar In one of the last cells (that sets up
|
|
I've put "my" version of the Tokenizer at https://huggingface.co/halffried/sd2-laion-clipH14-tokenizer/tree/main. You can just replace the tokenizer in any pipeline to test it if you're interested. |
@hafriedlander Given that is the official stabilityai repo, presumably noone here in huggingface/diffusers made it, and that was just what was released with SDv2? |
@0xdevalias not sure. @patrickvonplaten said that the penultimate layer fix was invented by @patil-suraj, who's a HuggingFace person, not a Stability person. Anyway, I'm not saying mine is correct or anything, just that, in the limited testing I've done, I like the result way more, and that's weird. |
Thanks, will take a look. Also, could you post some results here so we could see the differences ? I'm compare the results with original repo and they seemed to match, I'll take a look again. |
Also could you post the prompts that gave you bad results ? |
The whole model seems very sensitive to style shifts. https://imgur.com/a/dUb93fD is three images with the standard tokenizer. The prompt for the first is "A full portrait of a teenage smiling, beautiful post apocalyptic female princess, intricate, elegant, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha" The prompt for the second is exactly the same, but with the addition of a negative prompt "bad teeth, missing teeth" The third is the first prompt, but without the word smiling Here is the same with my version of the tokenizer https://imgur.com/a/Wr5Sw9P The second version with the original tokenizer is great. But I would not normally expect to see a big shift in quality from the addition of a negative prompt like that. I'll track down another of my recent prompts where I much preferred my tokenizer, and see if adding a negative prompt helps. |
Thank you! Will also compare using these prompts. |
I noticed one difference, the original open_clip tokenizer that is used to train SD2 uses 0 as from transformers import CLIPTokenizer, AutoTokenizer
from open_clip import tokenize
tok = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2", subfolder="tokenizer")
tok2 = AutoTokenizer.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
prompt = "A full portrait of a teenage smiling, beautiful post apocalyptic female princess, intricate, elegant, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha"
tok_orig = tokenize(prompt)
tok_current = tok(prompt, padding="max_length", max_length=77, return_tensors="pt").input_ids
tok_auto = tok2(prompt, padding="max_length", max_length=77, return_tensors="pt", truncation=True).input_ids
assert torch.all(tok_orig == tok_current) # True
assert torch.all(tok_orig == tok_auto) # False |
|
@0xdevalias I have generated images with and without ftfy. I can't observe any difference in the results: |
Sorry the warning is misleading and coming from |
|
UPDATE: the issue is gone with the newer build of xformers Hi, I'm using diffusers==0.9.0 and xformers==0.0.15.dev0+1515f77.d20221129, and for me, xformers makes SD 2.0 roughly x1.5 slower with xformers than without it (while it indeed saves some VRAM). At the same time, SD 1.5 runs about x1.5 faster with xformers, so it's unlikely that there's something wrong with my setup :) Is it a known issue? Here are some code samples to reproduce the issue: # SD2, xformers disabled -> 5.02it/s
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.disable_xformers_memory_efficient_attention()
prompt = "An oil painting of white De Tomaso Pantera parked in the forest by Ivan Shishkin"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0] # warmup
image = pipe(prompt, guidance_scale=9, num_inference_steps=250, width=1024, height=576).images[0]
Fetching 12 files: 100%|##########| 12/12 [00:00<00:00, 52648.17it/s]
100%|##########| 25/25 [00:05<00:00, 4.70it/s]
100%|##########| 250/250 [00:49<00:00, 5.02it/s] # SD2, xformers enabled -> 2.93it/s
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention() # explicitly enable xformers just in case
prompt = "An oil painting of white De Tomaso Pantera parked in the forest by Ivan Shishkin"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0] # warmup
image = pipe(prompt, guidance_scale=9, num_inference_steps=250, width=1024, height=576).images[0]
Fetching 12 files: 100%|##########| 12/12 [00:00<00:00, 43804.74it/s]
100%|##########| 25/25 [00:08<00:00, 2.90it/s]
100%|##########| 250/250 [01:25<00:00, 2.93it/s] # SD1.5, xformers disabled -> 5.66it/s
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")
pipe.disable_xformers_memory_efficient_attention()
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image = pipe(prompt, width=880, num_inference_steps=150).images[0]
Fetching 15 files: 100%|##########| 15/15 [00:00<00:00, 56987.83it/s]
100%|##########| 51/51 [00:04<00:00, 10.85it/s]
100%|##########| 151/151 [00:26<00:00, 5.66it/s] # SD1.5, xformers enabled -> 7.94it/s
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image = pipe(prompt, width=880, num_inference_steps=150).images[0]
Fetching 15 files: 100%|##########| 15/15 [00:00<00:00, 54660.78it/s]
100%|##########| 51/51 [00:04<00:00, 12.42it/s]
100%|##########| 151/151 [00:19<00:00, 7.94it/s] |
Hi @hafriedlander , It is helpful for me to fine-tune the SD 2.0 model.
I tell you the details. I generate the former attach image by the following prompt.
I generated the former image by the following command.
Next, I converted the model by the following command.
I generated the latter image by the following code.
Thanks in advance. |
@alfredplpl I haven't kept the converter updated for the recent Diffusers changes, since there's an official release of the model. The tokenizer specifically is definitely wrong, and it uses "velocity" instead of "v_predict". It's probably broken in other ways too. I'd start of by just trying to copy the tokenizer from the official SD2 version and see if that helps |
I finetuned stabilityai/stable-diffusion-2-base with the diffusers repo. I put this yaml besides the converted checkpoint to use it in the webui:
What can I do to get the same results? Do we need an updated conversion script for SD2.0? |
Two things here: |
@hafriedlander Kohya S., a Japanse depevelor, have written the conversion code. https://note.com/kohya_ss/n/n374f316fe4ad |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is your feature request related to a problem? Please describe.
It would be great if we could run SD 2 with cpu_offload, attention slicing, xformers, etc...
Describe the solution you'd like
Adapt the conversion script to SD 2.0
Describe alternatives you've considered
Stability AI's repo is not as flexible.
The text was updated successfully, but these errors were encountered: