Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add load lora weights implementation for 'lora_' prefix LoRA we… #3294

Closed
wants to merge 7 commits into from

Conversation

showxu
Copy link

@showxu showxu commented Apr 30, 2023

add load lora weights implementation for 'lora_' prefix LoRA weights format to LoraLoaderMixin, this should fix #3064, use case:

checkpoint_path = os.path.abspath('/path/to/loar_models/lora.safetensors')
pipeline.load_lora_weights(checkpoint_path, lora_weight=1.0)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@alexblattner
Copy link

@showxu will this work inside prompts? also how does Lora differ from TIs?

@showxu
Copy link
Author

showxu commented May 1, 2023

@showxu will this work inside prompts? also how does Lora differ from TIs?

It will work with prompt but not in a1111's format which used in sd-webui, this pr added a way to load LoRA weights with 'lora' prefix format, and Text Inversion should used directly by pipe.load_textual_inversion in pipeline AFAIK

@sayakpaul
Copy link
Member

I don't think this is a format that you get from diffusers. Could you shed more details on how did you obtain the LoRA weights?

Cc: @patrickvonplaten

@showxu
Copy link
Author

showxu commented May 1, 2023

I don't think this is a format that you get from diffusers. Could you shed more details on how did you obtain the LoRA weights?

Cc: @patrickvonplaten

It‘s the LoRA weights trained from a1111 sd-webui and most commonly distributed at civitai, which discussed in #3064 (comment), #3064 (comment) is also helpful

@alexblattner
Copy link

It will work with prompt but not in a1111's format which used in sd-webui, this pr added a way to load LoRA weights with 'lora' prefix format, and Text Inversion should used directly by pipe.load_textual_inversion in pipeline AFAIK

thank you for your answer, I understand what you are saying, but I am deeply confused by the fact that a1111 includes loras in the prompt. I am wondering about the mechanisms behind things because I've seen many use loras as if they were TIs inside the prompt. Could you perhaps shed some light on that whole thing?

@RustyKettle
Copy link

Would this work with the controlnet pipelines like StableDiffusionControlNetInpaintImg2ImgPipeline?

@BEpresent
Copy link

BEpresent commented May 2, 2023

It will work with prompt but not in a1111's format which used in sd-webui, this pr added a way to load LoRA weights with 'lora' prefix format, and Text Inversion should used directly by pipe.load_textual_inversion in pipeline AFAIK

thank you for your answer, I understand what you are saying, but I am deeply confused by the fact that a1111 includes loras in the prompt. I am wondering about the mechanisms behind things because I've seen many use loras as if they were TIs inside the prompt. Could you perhaps shed some light on that whole thing?

I am not an expert in the a1111 codebase, but looking at this https://github.com/AUTOMATIC1111/stable-diffusion-webui/tree/5ab7f213bec2f816f9c5644becb32eb72c8ffb89/extensions-builtin/Lora it appears that specifying LoRAs in the prompt is just a convenient way to load the LoRAs and then just call the .loadLora functions to replace the weights.

I wonder if this is how it could be done here too? The prompt would need to be parsed for the LoRAs in a special syntax, and then a load_lora_weights would be called for each of those (with appropriate weights). If "Compel" does not support this, it might require a custom function to parse the prompt and load LoRAs into the pipeline.

@alexblattner
Copy link

@BEpresent I doubt that's all because there's something called latent couples which apply multiple masks and prompts at the same time and you can apply Loras in some masks and not others and it works

@BEpresent
Copy link

@BEpresent I doubt that's all because there's something called latent couples which apply multiple masks and prompts at the same time and you can apply Loras in some masks and not others and it works

I see, do you mean multiple loras at the same time ? Do you have an example where this is being used in the prompt ? I could imagine in the short term some custom solutions would need to be written to have something like this in diffusers.

@alexblattner
Copy link

@BEpresent I am actually working on latent couples as we speak, here's what I am using as starting points:
https://github.com/miZyind/sd-webui-latent-couple (look at the colab at the bottom too)
https://www.youtube.com/watch?v=uR89wZMXiJ8

@BEpresent
Copy link

BEpresent commented May 2, 2023

@BEpresent I am actually working on latent couples as we speak, here's what I am using as starting points: https://github.com/miZyind/sd-webui-latent-couple (look at the colab at the bottom too) https://www.youtube.com/watch?v=uR89wZMXiJ8

thanks will have a look at this - naively I would have just loaded a lora when a parser finds it in the prompt. There might be more to it - edit: I remember that vid yeah - this seems to be one level up in terms of complexity, having different masked regions in an image. For the moment I would be happy to load a single Lora or multiple Loras from the prompt, which should be doable just by inserting them into the pipeline. This particular extension could be ported too of course.

@sayakpaul
Copy link
Member

Sorry for the delay here. Will get back to this next week, for sure. Also, seeking reviews from @patrickvonplaten and @pdoane.

Meanwhile, if you could provide additional details on how you obtained the LoRA weights and the weights themselves, and a code snippet to test the results, that would be genuinely helpful.

@alexblattner
Copy link

alexblattner commented May 4, 2023

@BEpresent I think I found how they load the loras in A111: https://github.com/opparco/stable-diffusion-webui-composable-lora

edit: this one is even better: https://github.com/a2569875/stable-diffusion-webui-composable-lora

@pdoane
Copy link
Contributor

pdoane commented May 4, 2023

I understand what you are saying, but I am deeply confused by the fact that a1111 includes loras in the prompt

@alexblattner - we've answered the prompt question elsewhere. Unlike textual inversions, LoRAs are not related to the prompt at all - #3064. Compel provides parsing support - damian0815/compel#30. It is purely a UI/design decision that A1111 uses the prompt.

For the moment I would be happy to load a single Lora or multiple Loras from the prompt, which should be doable just by inserting them into the pipeline.

@BEpresent - I don't think that is a good direction for diffusers library. Complex prompt weighting is already handled externally (e.g. with Compel) and LoRAs/textual inversions management is another layer higher than that.

@pdoane
Copy link
Contributor

pdoane commented May 4, 2023

@sayakpaul - I haven't tried the code here yet but it looks like a reasonable start from a functionality perspective. There is a design issue to sort out though.

This PR directly modifies the weights of the layers rather than creating attention processors like the existing LoRA implementation. The attention processor framework provides a way to change what is currently active, but also has increased memory usage and slightly slower performance with the forwarding overhead. Directly modifying the weight as in this PR is simpler, has no extra memory or performance penalty, but currently there is no way to undo it.

Continuing with the strategy in the PR, there could be an optional dictionary as a parameter to save the original layer weights and another function to reapply.

Another option would be to update the PR to create attention processors and follow existing design.

Either way, I think the LoRA implementations should align on one approach eventually.

@alexblattner
Copy link

@alexblattner - we've answered the prompt question elsewhere. Unlike textual inversions, LoRAs are not related to the prompt at all - #3064. Compel provides parsing support - damian0815/compel#30. It is purely a UI/design decision that A1111 uses the prompt.

With all due respect @pdoane, it seems to me like there's a knowledge gap regarding Loras. It seems like there's a way to apply Loras weights to specific areas of the image and not other areas (which is not currently possible in diffusers).

As I pointed out here:

@BEpresent I think I found how they load the loras in A111: https://github.com/opparco/stable-diffusion-webui-composable-lora

edit: this one is even better: https://github.com/a2569875/stable-diffusion-webui-composable-lora

It seems to me like there's a place for Loras in prompts as they can be key to more complex and interesting implementations.

By being too strict you may be dismissing a larger picture.

@pdoane
Copy link
Contributor

pdoane commented May 4, 2023

Attention processors are a better fit for more dynamic use of LoRAs and puts things on a path towards extensions like that. It's a big refactor on the PR though.

@BEpresent
Copy link

BEpresent commented May 4, 2023

For the moment I would be happy to load a single Lora or multiple Loras from the prompt, which should be doable just by inserting them into the pipeline.

@BEpresent - I don't think that is a good direction for diffusers library. Complex prompt weighting is already handled externally (e.g. with Compel) and LoRAs/textual inversions management is another layer higher than that.

I see, I would also be fine to exclude any Loras from prompts and for those who want to use a1111 prompts with loras in diffusers (I'm one of those) would need to come up with a conversion script to e.g. parse them manually and then use the diffusers library to load the lora weights with some kind of load_lora_weights function (that is being posted in some issues and which seems to be an open issue still for the safetensors case? ) - At least this is how I would do it.

For textual inversion I would use pipeline.load_textual_inversion(...) which seems to be the recommended way to load them. Anyone feel free to correct me if this is not the case.

@patrickvonplaten
Copy link
Contributor

Sorry a bit lost here in the context - is this needed for A1111?

@pdoane
Copy link
Contributor

pdoane commented May 6, 2023

Sorry a bit lost here in the context - is this needed for A1111?

@patrickvonplaten - I did not open the PR but will try my best to summarize. This PR adds support for LoRAs often found in community usage (e.g. CivitAI). I suspect this is A1111 format but hard for me to say as none of these formats are well documented.

The implementation is based on work done on an issue thread that is functional, but does not match the rest of the design of the diffusers library. It directly modifies the layer weights rather than creating attention processors.

Most of the discussion is about the API and how LoRAs should be exposed. Tools like InvokeAI and A1111 support LoRAs and their parameters in the prompt so some users/developers will expect that even if it's an implementation detail. FYI - Compel provides parsing support for the scope of this PR. The composable LoRA extension discussed allows a LoRA to operate on a subset of a prompt, but that would impact an attention processor implementation and not the prompt embeds.

My recommendations and thoughts:

  • Attention processors are likely a better long-term fit for how to implement the behavior.
  • I would hold on merging until this is using attention processors to avoid implementation divergence. However, the whole API surface area is still experimental so maybe fine.
  • Can multiple LoRAs work in the current API? Eventually there is a call to set_attn_processor and maybe that can be called multiple times but the semantics aren't clear.
  • Related question - what is the right way to disable some attention processors while keeping others active?

@patrickvonplaten
Copy link
Contributor

Thanks @pdoane!

Ok, I would suggest we first tackle loading LoRA in A1111 format to diffusers, but we should not do this by introducing that many changes, instead we should rename the A1111 state dict correctly to diffusers state dict format on the fly IMO

@sayakpaul
Copy link
Member

Ok, I would suggest we first tackle loading LoRA in A1111 format to diffusers, but we should not do this by introducing that many changes, instead we should rename the A1111 state dict correctly to diffusers state dict format on the fly IMO

Agreed with Patrick here. @showxu could we maybe try to amend this PR to follow this design instead?

Most of the discussion is about the API and how LoRAs should be exposed. Tools like InvokeAI and A1111 support LoRAs and their parameters in the prompt so some users/developers will expect that even if it's an implementation detail. FYI - Compel provides parsing support for the scope of this PR. The composable LoRA extension discussed allows a LoRA to operate on a subset of a prompt, but that would impact an attention processor implementation and not the prompt embeds.

I think this part should be separated from this PR. Also, I am not too sure if it's something we want to allow from diffusers. @patrickvonplaten what are your thoughts? IIUC it boils down to a few questions:

  • Should / can we support passing LoRA params in the input prompts? On the surface, I think it might disturb the core designs of Diffusers so, I say no.
  • Same for the supporting the ability to allow LoRA to operate on a subset of a prompt. If there are simpler alternatives e.g., with a better prompt and with inference-time hyperparameters, I'd prefer that rather than making changes to the attention processors.

Can multiple LoRAs work in the current API? Eventually there is a call to set_attn_processor and maybe that can be called multiple times but the semantics aren't clear.

Not yet, I think. Let's discuss that in a separate issue.

@sayakpaul
Copy link
Member

sayakpaul commented May 8, 2023

Also, this is a very important consideration:

This PR directly modifies the weights of the layers rather than creating attention processors like the existing LoRA implementation. The attention processor framework provides a way to change what is currently active, but also has increased memory usage and slightly slower performance with the forwarding overhead. Directly modifying the weight as in this PR is simpler, has no extra memory or performance penalty, but currently there is no way to undo it.

When directly operating on a pipeline where users can load a particular LoRA which, under the hood, initializes a LoRA attention processor for the UNet (which is what is done when pipeline.load_lora_weights() is called:

if is_lora:
is_new_lora_format = all(
key.startswith(self.unet_name) or key.startswith(self.text_encoder_name) for key in state_dict.keys()
)
if is_new_lora_format:
# Strip the `"unet"` prefix.
is_text_encoder_present = any(key.startswith(self.text_encoder_name) for key in state_dict.keys())
if is_text_encoder_present:
warn_message = "The state_dict contains LoRA params corresponding to the text encoder which are not being used here. To use both UNet and text encoder related LoRA params, use [`pipe.load_lora_weights()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraLoaderMixin.load_lora_weights)."
warnings.warn(warn_message)
unet_keys = [k for k in state_dict.keys() if k.startswith(self.unet_name)]
state_dict = {k.replace(f"{self.unet_name}.", ""): v for k, v in state_dict.items() if k in unet_keys}
lora_grouped_dict = defaultdict(dict)
for key, value in state_dict.items():
attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:])
lora_grouped_dict[attn_processor_key][sub_key] = value
for key, value_dict in lora_grouped_dict.items():
rank = value_dict["to_k_lora.down.weight"].shape[0]
cross_attention_dim = value_dict["to_k_lora.down.weight"].shape[1]
hidden_size = value_dict["to_k_lora.up.weight"].shape[0]
attn_processors[key] = LoRAAttnProcessor(
hidden_size=hidden_size, cross_attention_dim=cross_attention_dim, rank=rank
)
attn_processors[key].load_state_dict(value_dict)
elif is_custom_diffusion:
custom_diffusion_grouped_dict = defaultdict(dict)
for key, value in state_dict.items():
if len(value) == 0:
custom_diffusion_grouped_dict[key] = {}
else:
if "to_out" in key:
attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:])
else:
attn_processor_key, sub_key = ".".join(key.split(".")[:-2]), ".".join(key.split(".")[-2:])
custom_diffusion_grouped_dict[attn_processor_key][sub_key] = value
for key, value_dict in custom_diffusion_grouped_dict.items():
if len(value_dict) == 0:
attn_processors[key] = CustomDiffusionAttnProcessor(
train_kv=False, train_q_out=False, hidden_size=None, cross_attention_dim=None
)
else:
cross_attention_dim = value_dict["to_k_custom_diffusion.weight"].shape[1]
hidden_size = value_dict["to_k_custom_diffusion.weight"].shape[0]
train_q_out = True if "to_q_custom_diffusion.weight" in value_dict else False
attn_processors[key] = CustomDiffusionAttnProcessor(
train_kv=True,
train_q_out=train_q_out,
hidden_size=hidden_size,
cross_attention_dim=cross_attention_dim,
)
attn_processors[key].load_state_dict(value_dict)
else:
raise ValueError(
f"{model_file} does not seem to be in the correct format expected by LoRA or Custom Diffusion training."
)

But users also have the flexibility to quickly set the attention processor to something else by doing:

from diffusers.models.attention_processor import AttnProcessor

pipeline.unet.set_attn_processor(AttnProcessor())

Modifying weights on the fly (the way it's done in this PR), removes this flexibility.

@pdoane
Copy link
Contributor

pdoane commented May 9, 2023

As a point of comparison, A1111 has switched to directly modifying weights to improve performance:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Seed-breaking-changes#2023-03-26---apply-lora-by-altering-layers-weights

@patrickvonplaten
Copy link
Contributor

Related to: #3064

@JemiloII
Copy link

JemiloII commented May 13, 2023

Just ignore this comment; I found the error in the PR that fixed my CUDA issue.

Just curious. I'm trying this PR out with version 0.16.1. Just merged the changes with it. I noticed that there is extra code floating in main, so I decided to not use main currently. So the issue I have, may or may not actually be an issue. But I'm just sharing my observation.

Here is my loaders.py I've merged this PR with.
https://gist.github.com/JemiloII/64043d8a162e86f7af979890cb950acd

Here is the code I was running that caused the issue.
https://gist.github.com/JemiloII/9ed9f02cc741ef355d3d0936d2b557ea

Traceback (most recent call last):
  File "D:\diffusion\generate.py", line 85, in <module>
    run()
  File "D:\diffusion\generate.py", line 63, in run
    pipeline.load_lora_weights(lycoris_path, lora_weight=1.0)
  File "D:\diffusion\.venv\lib\site-packages\diffusers\loaders.py", line 829, in load_lora_weights
    self._load_lora_weights(state_dict, weight=lora_weight)
  File "D:\diffusion\.venv\lib\site-packages\diffusers\loaders.py", line 909, in _load_lora_weights
    curr_layer.weight.data += weight * alpha * torch.mm(weight_up, weight_down)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Found addmm_impl_cpu_ in Pytorch. Not sure if it helps, thought I'd link it.
https://github.com/pytorch/pytorch/blob/2361f7f0cec081ed7ab5b4508b85a2e3b8c47e46/aten/src/ATen/native/LinearAlgebra.cpp#L1312

Not sure if this matters but my system is Win 10; I'm using CUDA, with an AMD 5950x & RTX 4090.

@JemiloII
Copy link

@sayakpaul @pdoane

How does one undo/unload an AttnProcessor? I see that it's mentioned. Does that mean there is a way to undo/unload a Lora while keeping the rest of the diffuser in memory? I currently do not unload from memory to speed up my prompt-to-generation flow.

This PR directly modifies the weights of the layers rather than creating attention processors like the existing LoRA implementation. The attention processor framework provides a way to change what is currently active, but also has increased memory usage and slightly slower performance with the forwarding overhead. Directly modifying the weight as in this PR is simpler, has no extra memory or performance penalty, but currently there is no way to undo it.

@cmdr2
Copy link
Contributor

cmdr2 commented May 19, 2023

I have a strong hunch that this approach misses some of the keys in the models. The diffusers format uses slightly different key names for the same things, and the ckpt-to-diffusers script does this mapping. But this one doesn't do any mapping, so I suspect it's missing a lot of keys.

I think supporting A1111 format will require looking at the ckpt-to-diffusers script, and ensuring that all the keys are accounted for.

It's definitely possible that I'm wrong, but it's worth confirming that all the keys are mapped correctly.


PS: I agree that we should just convert the state dict to diffusers format, and then load it via the usual function. No changes will be needed to the LoRA mixin.

This conversion function could be in convert_from_ckpt.py, similar to how it has functions to convert other checkpoint modules (like UNet, VAE, etc) to diffusers.

@frankjoshua
Copy link

I tested this branch and I'm having issues with
torch_dtype=torch.float16

It works with float32 but of course that's very slow.

File "/root/app/models.py", line 203, in loadModel
model_holder = ModelHolder(model_data)
File "/root/app/model_holder.py", line 57, in init
self.model.load_lora_weights("./lora/", weight_name="light_and_shadow.safetensors")
File "/opt/conda/lib/python3.9/site-packages/diffusers/loaders.py", line 845, in load_lora_weights
self._load_lora_weights(state_dict, weight=lora_weight)
File "/opt/conda/lib/python3.9/site-packages/diffusers/loaders.py", line 926, in load_lora_weights
curr_layer.weight.data += weight * alpha * torch.mm(weight_up, weight_down)
RuntimeError: "addmm_impl_cpu
" not implemented for 'Half'

@sayakpaul
Copy link
Member

We recently added limited support for A1111 LoRAs, so this PR is preceded by that. Cc: @takuma104.

It might be worth closing this PR and discuss other options maybe.

@showxu

@patrickvonplaten
Copy link
Contributor

Think it'd be ok to close this PR here

@sayakpaul sayakpaul closed this Jun 12, 2023
@JemiloII
Copy link

JemiloII commented Jun 14, 2023

While the PR is closed, this method still worked best for my use cases with the modified bit of code I provided. The current implementation doesn't load lycoris models correctly and I keep having this error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) Making loras just unusable.

I have to move my pipeline.to() command in order to get rid of the cuda:0 and cpu error; however, no lora is ever applied.

@sayakpaul
Copy link
Member

While the PR is closed, this method still worked best for my use cases with the modified bit of code I provided. The current implementation doesn't load lycoris models correctly and I keep having this error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) Making loras just unusable.

I have to move my pipeline.to() command in order to get rid of the cuda:0 and cpu error; however, no lora is ever applied.

As stated in the earlier comments, we cannot go with this implementation because it directly merges the LoRA params into the UNet params. This is something we want to avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading .safetensors Lora