Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LoftQ docs and tests #1532

Merged

Conversation

BenjaminBossan
Copy link
Member

@BenjaminBossan BenjaminBossan commented Mar 4, 2024

Relates to #1525

Don't merge this, some GPU tests are failing

Unfortunately, the docs I wrote about how to use LoftQ were incorrect, based on a misunderstanding I had. In reality, it is quite a bit more involved to get LoftQ working, requiring a complete roundtrip first loading a non-quantized model with LoftQ, saving the LoRA weights and the modified base model, loading the just stored base model again but this time with quantization, and finally loading the LoftQ-initialized adapter on top. The docs now link to the example which demonstrates how to move through these steps.

The unit tests have been adjusted to go through these same steps but now most of them fail, i.e. the quantization error is greater with LoftQ than without. This needs to be investigated.

Relates to huggingface#1525

Unfortunately, the docs I wrote about how to use LoftQ were incorrect,
based on a misunderstanding I had. In reality, it is quite a bit more
involved to get LoftQ working, requiring a complete roundtrip first
loading a non-quantized model with LoftQ, saving the LoRA weights and
the modified base model, loading the just stored base model again but
this time with quantization, and finally loading the LoftQ-initialized
adapter on top. The docs now link to the example which demosthenes how
to move through these steps.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member Author

@yxli2123 As described above, I made a mistake in how LoftQ needs to be used. This PR adjust the docs and the unit tests to apply LoftQ correctly (from my understanding).

However, after I changed the unit tests, 8 of them started failing, only 4 passed:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃  File                        ┃  Function                                       ┃  Function Line  ┃  Error Line  ┃  Error           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit[cuda]         │  1244           │  1260        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit[cpu]          │  1244           │  1260        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit_iter_5[cuda]  │  1263           │  1275        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit_iter_5[cpu]   │  1263           │  1275        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_8bit_iter_5[cuda]  │  1293           │  1305        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_8bit_iter_5[cpu]   │  1293           │  1305        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_t5_loftq_4bit[cuda]             │  1308           │  1320        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_t5_loftq_4bit[cpu]              │  1308           │  1320        │  AssertionError  │
└──────────────────────────────┴─────────────────────────────────────────────────┴─────────────────┴──────────────┴──────────────────┘

In these unit tests, I measure the model output from the base model without any quantization, then the model output of the bnb-quantized model, and finally the model output of the quantized model using LoftQ. The expectation is that the model output with LoftQ should be closer to the base model output than the bnb-quantized model output (without LoftQ).

(Note that I'm only testing the LoftQ should lead to closer results, even if it's only marginally closer. Ideally, however, we want it be X% better, i.e. there should be a certain margin)

I'm not sure why the error with LoftQ is greater and not smaller. Am I still applying LoftQ incorrectly? Please check the unit test if I make some mistake there. For debugging purposes, I added a check that on each individual layer, the residual error does indeed decrease when applying the LoRA weights initialized with LoftQ, so this part seems to be correct.

Moreover, while working on this, some questions came up:

  1. Why do we need to override the weights with the dequantized weights (here). Why can we not just load the quantized base model with the LoftQ-LoRA weights, i.e. skip this step? I tried that as well but test are failing no matter what.
  2. Is it really working correctly when we pass num_iter > 1? I feel like this line is not correct, because we are iteratively changing the residual but lora_A and lora_B are only determined by the very last step. Maybe those need to be incremented iteratively as well?

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Mar 7, 2024
Related to huggingface#1532

At the moment, using LoftQ is quite cumbersome, as shown in this
example:

https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning

Essentially, users have to:

1. Load the non-quantized model with LoftQ (which can be quite huge)
2. Modify the PEFT config
3. Save the adapter
4. Unwrap the base model with custom functions
5. Save the base model with modified weights (i.e. a whole copy of the
   base model)
6. Load the base model from step 5 with bnb quantization
7. Load the adapter from step 3

Yes, there is a helper script to do this, but this still has the
advantage that we need to load the non-quantized model and that we have
to create a completely new model checkpoint with the modified weights.

This PR aims to make this process more convenient by adding a single
function replace_lora_weights_loftq. This function takes the
bnb-quantized LoRA model as input. Then it goes through each module with
LoRA weights, lazily loads the corresponding non-quantized weights one
at a time using safetensors, computes the quantization error, and
replaces the LoRA weights with LoftQ-initialized LoRA weights.

This is much more convenient because we only require very little extra
memory thanks to lazy loading, and we don't have to keep an extra copy
of the weights.

While working on this, I still found that LoftQ initialization often did
not seem to help a lot, as mentioned in huggingface#1532. I measured this by
creating (1) logits with the base model, (2) with the quantized+LoRA
model, and (3) with the quantized+LoRA+LoftQ model. The expectation is
that (1) should be closer to (3) than to (2). This was often not the
case.

I therefore added the possibility to run a check each time that we
replace a LoRA weight with the LoftQ weights. If this check returns
True, we proceed to the next weight, otherwise we discard the change.
That way, we only make the replacement with LoftQ weights if we see a
real improvement. Of course, this is only a form of greedy optimization,
but it seems to work in practice. And since it's optional, users can
choose not to use it.

This PR is not yet finished since I ran into an issue with matching the
key names from safetensors not matching.

Furthermore, for now this doesn't support 8bit quantization and the
num_iter arguments of LoftQ, which I'm not sure is really working.
However, I guess the replace_lora_weights_loftq function could be called
multiple times in a row.
Copy link
Collaborator

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Benjamin. After going through the tests, the examples for LoftQ. I believe that the discrepancy and failing tests happen due to the casting of activation to float16 by bitsandbytes when loading the loftq base and adapters.

@pacman100
Copy link
Collaborator

So, the difference between original weight W and quantized+adapters weights (Q+BA) would be less when compared to just using Q, it isn't reflecting in logits because activations at each layer are being casted to float16 by bnb.

@BenjaminBossan BenjaminBossan marked this pull request as ready for review March 19, 2024 16:26
Copy link
Collaborator

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @BenjaminBossan for all the work on rectifying LoftQ tests and uncovering best practices for applying it! 🤗

@@ -44,6 +44,8 @@ config = LoraConfig(init_lora_weights=False, ...)

When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).

In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice point to note in the docs.

@BenjaminBossan BenjaminBossan changed the title WIP Fix LoftQ docs and tests Fix LoftQ docs and tests Mar 20, 2024
@BenjaminBossan BenjaminBossan merged commit a86b29a into huggingface:main Mar 20, 2024
14 checks passed
@BenjaminBossan BenjaminBossan deleted the fix-loftq-docs-and-tests branch March 20, 2024 09:37
BenjaminBossan added a commit that referenced this pull request Mar 20, 2024
Related to #1532

At the moment, using LoftQ is quite cumbersome, as shown in this
example:

https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning

Essentially, users have to:

1. Load the non-quantized model with LoftQ (which can be quite huge)
2. Modify the PEFT config
3. Save the adapter
4. Unwrap the base model
5. Save the base model with modified weights (i.e. a whole copy of the
   base model)
6. Load the base model from step 5 with bnb quantization
7. Load the adapter from step 3

Yes, there is a helper script to do this, but this still has the
advantage that we need to load the non-quantized model and that we have
to create a completely new model checkpoint with the modified weights.

This PR aims to make this process more convenient by adding a single
function replace_lora_weights_loftq. This function takes the
bnb-quantized LoRA model as input. Then it goes through each module with
LoRA weights, lazily loads the corresponding non-quantized weights one
at a time using safetensors, computes the quantization error, and
replaces the LoRA weights with LoftQ-initialized LoRA weights.

This is much more convenient because we only require very little extra
memory thanks to lazy loading, and we don't have to keep an extra copy
of the weights.

While working on this, I still found that LoftQ initialization often did
not seem to help a lot, as mentioned in #1532. I measured this by
creating (1) logits with the base model, (2) with the quantized+LoRA
model, and (3) with the quantized+LoRA+LoftQ model. The expectation is
that (1) should be closer to (3) than to (2). This was often not the
case.

I therefore added the possibility to run a check each time that we
replace a LoRA weight with the LoftQ weights. If this check returns
True, we proceed to the next weight, otherwise we discard the change.
That way, we only make the replacement with LoftQ weights if we see a
real improvement. Of course, this is only a form of greedy optimization,
but it seems to work in practice. And since it's optional, users can
choose not to use it.

This doesn't support 8bit quantization and the num_iter arguments of LoftQ.
However, the replace_lora_weights_loftq function can be called multiple
times in a row for slightly improved results.

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants