Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding weighted adapter as LoRAs combination gives unexpected result with StableDiffusion compared to webui #643

Closed
2 of 4 tasks
kovalexal opened this issue Jun 27, 2023 · 7 comments

Comments

@kovalexal
Copy link
Contributor

kovalexal commented Jun 27, 2023

System Info

python3.8, diffusers, transformers, accelerate, peft versions from main branches of each library (I used slightly modified version of your peft-gpu Dockerfile)

Who can help?

@pacman100 @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Hi!

I've discovered some unexpected results when combining multiple LoRA adapters together for StableDiffusion with PEFT, compared to webui results.

web UI setting:

peft setting:

Sanity check

Let's check that both checkpoints give the same results without using LoRAs.

candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight

Negative prompt: bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic

Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1428928479, Size: 512x512, Model hash: 9aba26abdf, Model: deliberate_v2
  • webui output:

webui_nolora_output

  • diffusers + peft sample code:
torch.manual_seed(1428928479)
image = pipe(
    prompt = "candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight",
    negative_prompt = "bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic",
    num_inference_steps=20,
    guidance_scale=7,
).images[0]
  • diffusers + peft output:

diffusers_peft_nolora_output

The results are quite similar, so this test have passed.

Single LoRA

Let's check that both checkpoints give the same results when using single LoRAs (let's use Detail Tweaker aka add-detail).

<lora:add_detail:1> candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight

Negative prompt: bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic

Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1428928479, Size: 512x512, Model hash: 9aba26abdf, Model: deliberate_v2
  • webui output:

webui_singlelora_output

  • diffusers + peft sample code:
lora_path = ...
lora_name = "add_detail"

pipe.unet = PeftModel.from_pretrained(
    pipe.unet,
    f"{lora_path}/unet",
    lora_name
)

pipe.text_encoder = PeftModel.from_pretrained(
    pipe.text_encoder,
    f"{lora_path}/text_encoder",
    lora_name
)

pipe.unet.set_adapter(lora_name)
pipe.text_encoder.set_adapter(lora_name)

torch.manual_seed(1428928479)
image = pipe(
    prompt = "candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight",
    negative_prompt = "bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic",
    num_inference_steps=20,
    guidance_scale=7,
).images[0]
  • diffusers + peft output:

diffusers_peft_singlelora_output

The results are also quite similar, so this test have passed.

Mixture of two LoRAs

Let's check that both checkpoints give the same results when using mixture of multiple LoRAs (let's use Detail Tweaker aka add-detail and 3D rendering style aka 3DMM_V11, both with weights 1.0).

<lora:add_detail:1> <lora:3DMM_V11:1> candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight

Negative prompt: bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic

Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1428928479, Size: 512x512, Model hash: 9aba26abdf, Model: deliberate_v2
  • webui output:

webui_twolora_output

  • diffusers + peft sample code:
# Add first LoRA
lora_path = ...
lora_name = "add_detail"
pipe.unet = PeftModel.from_pretrained(
    pipe.unet,
    f"{lora_path}/unet",
    lora_name
)
pipe.text_encoder = PeftModel.from_pretrained(
    pipe.text_encoder,
    f"{lora_path}/text_encoder",
    lora_name
)

# Add second LoRA
lora_path = ...
lora_name = "3DMM_V11"
pipe.unet.load_adapter(
    f"{lora_path}/unet",
    adapter_name=lora_name
)
pipe.text_encoder.load_adapter(
    f"{lora_path}/text_encoder",
    adapter_name=lora_name
)

# Mix two LoRAs together
pipe = create_weighted_lora_adapter(pipe, ["add_detail", "3DMM_V11"], [1.0, 1.0], "combined")
pipe.unet.set_adapter("combined")
pipe.text_encoder.set_adapter("combined")

torch.manual_seed(1428928479)
image = pipe(
    prompt = "candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight",
    negative_prompt = "bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic",
    num_inference_steps=20,
    guidance_scale=7,
).images[0]
  • diffusers + peft output:

diffusers_peft_twolora_output

We can see that the results differ dramatically.

Mixture of two LoRAs - what is going on?

Let's try this approach in peft:

  1. Load first LoRA, merge it with base model
  2. Load second LoRA, apply it to base model, investigate results
  • diffusers + peft sample code:
# Add first lora
lora_path = ...
lora_name = "add_detail"
pipe.unet = PeftModel.from_pretrained(
    pipe.unet,
    f"{lora_path}/unet",
    lora_name
)
pipe.text_encoder = PeftModel.from_pretrained(
    pipe.text_encoder,
    f"{lora_path}/text_encoder",
    lora_name
)

# Merge first LoRA to model weights
pipe.unet = pipe.unet.merge_and_unload()
pipe.text_encoder = pipe.text_encoder.merge_and_unload()

# Load second LoRA
lora_path = ...
lora_name = "3DMM_V11"
pipe.unet = PeftModel.from_pretrained(
    pipe.unet,
    f"{lora_path}/unet",
    lora_name
)
pipe.text_encoder = PeftModel.from_pretrained(
    pipe.text_encoder,
    f"{lora_path}/text_encoder",
    lora_name
)

torch.manual_seed(1428928479)
image = pipe(
    prompt = "candid RAW portrait photo of a woman (Crystal Simmerman:1.0) with (dark hair:1.0) and a (purple colored suit:1.0) on a dark street with shopping windows (at night:1.2), bokeh, Ilford Delta 3200 film, dof, high definition, detailed, intricate, flashlight",
    negative_prompt = "bad-hands-5, asian, cropped, lowres, poorly drawn face, out of frame, blurry, blurred, text, watermark, disfigured, closed eyes, ugly, cartoon, render, 3d, plastic, 3d (artwork), rendered, comic",
    num_inference_steps=20,
    guidance_scale=7,
).images[0]
  • diffusers + peft output:

diffusers_peft_twolora_step_by_step_output

We can see that the results are quite similar to what we are getting in webui. So, we can definitely say that there is a problem in creating a weighted adapter for two LoRAs.

Mixture of two LoRAs - what is going on? - diving deeper

So, from my perspective, I see that there is a possible error inside method LoraModel.add_weighted_adapter.

continue
target.lora_A[adapter_name].weight.data += (
target.lora_A[adapter].weight.data * weight * target.scaling[adapter]
)
target.lora_B[adapter_name].weight.data += target.lora_B[adapter].weight.data * weight

A LoRA is an addition to the base weights:

$h = W_0 x + B A x $

So, from my perspective, a mixture of multiple LoRAs should be calculated like this:

$h = W_0 x + \alpha_1 B_1 A_1 x + \alpha_2 B_2 A_2 x + \ldots$

But currently a mixture for the same rank LoRAs is calculated like this:

$h = W_0 x + (B_1 + B_2 + \ldots) (\alpha_1 A_1 + \alpha_2 A_2 + \ldots) x$

Mixture of multiple LoRAs - possible solutions:

I see the following possible solutions to overcome this issue:

  1. Perform concatenation instead of a sum (use different dims for $B$ and $A$):

    • pros: easy to implement, we can mix LoRAs with different ranks;
    • cons: output LoRA rank can increase significantly if mixing a lot of LoRAs, at the end we can get a LoRA with very big rank, which will lead to serious performance drawback.
  2. Perform some sort of decomposition (like SVD) of just LoRAs mixture $\alpha_1 B_1 A_1 x + \alpha_2 B_2 A_2 x + \ldots$ and drop least important components:

    • pros: we can get any output rank that we want, we can mix LoRAs with different ranks;
    • cons: there will definitely be a model accuracy loss if rank is two small, also an interface to add_weighted_adapter changes.
  3. Replace base weights with merged LoRAs, store a copy of base weights for unmerge/unmix:

    • pros: the most reliable solution, we would be able to merge/unmerge/mix and unmix everything we want, as far as I understand, webui does this;
    • cons: need to store a copy of base weights, also a lot of code should be rewritten, will break current interfaces.

@pacman100 I am not sure if this is applicable only to my case or not (maybe it works differently for text models), but I would be happy to help your team with fixing this issue.

Expected behavior

From my perspective, merge of multiple LoRAs in peft should work just like merge in webui.

@kovalexal kovalexal changed the title Adding weighted adapter as LoRAs combination gives unexpected result on StableDiffusion compared to webui Adding weighted adapter as LoRAs combination gives unexpected result with StableDiffusion compared to webui Jun 27, 2023
@pacman100
Copy link
Contributor

Hello @kovalexal, very detailed issue description, insightful and helpful, Thank you!

Yes, I know the weighted adapter method isn't mathematically equivalent of merging loras one after another. I have mentioned the consecutive merging here in #280 (comment)

The current implementation is inspired by https://github.com/cloneofsimo/lora/tree/master which seems to work in practice:
Screenshot 2023-06-28 at 12 58 31 PM

I agree that it is incorrect mathematically but an easier way of mixing LoRAs.

I believe point 2 would fit properly without much changes:

Perform some sort of decomposition (like SVD) of just LoRAs mixture
and drop least important components:

pros: we can get any output rank that we want, we can mix LoRAs with different ranks;
cons: there will definitely be a model accuracy loss if rank is two small, also an interface to add_weighted_adapter changes.

@kovalexal
Copy link
Contributor Author

Hello, @pacman100, thanks for clarification!

I'll dig into it if I have capacity one day.

@pacman100
Copy link
Contributor

Hello, the merged PR #695 should address this by using point 2 you suggested of using SVD decomposition. The new rank is the max of the ranks of the LoRAs being combined.

@pacman100
Copy link
Contributor

Also, I have been working on adding PEFT support in Khoya-ss for training and webui extensions for inference.

PEFT training of DreamBooth: pacman100/peft-dreambooth (Branch: peft-dreambooth/ at smangrul/add-peft-support)

Extension to use PEFT in webui: pacman100/peft-sd-webui-additional-networks (Branch: peft-sd-webui-additional-networks/ at smangrul/add-peft-support)

Sample output trying it out:

Screenshot 2023-07-15 at 2 25 05 PM

@kovalexal
Copy link
Contributor Author

@pacman100 Wow, great, thank you, very useful addition!

I've also worked on my own version of SVD decomposition for LoRAs weights, I assumed that it can be useful for somebody to also specify an output rank for combined adapter (so somebody can create adapter with similar characteristics but with loss of precision). Would you mind if I create a PR for this?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@kovalexal
Copy link
Contributor Author

This issue was fully addressed in #817, now we can get identical results to what we are getting in webui!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants