Fix LoftQ docs and tests #1532

BenjaminBossan · 2024-03-04T16:55:23Z

Relates to #1525

Don't merge this, some GPU tests are failing

Unfortunately, the docs I wrote about how to use LoftQ were incorrect, based on a misunderstanding I had. In reality, it is quite a bit more involved to get LoftQ working, requiring a complete roundtrip first loading a non-quantized model with LoftQ, saving the LoRA weights and the modified base model, loading the just stored base model again but this time with quantization, and finally loading the LoftQ-initialized adapter on top. The docs now link to the example which demonstrates how to move through these steps.

The unit tests have been adjusted to go through these same steps but now most of them fail, i.e. the quantization error is greater with LoftQ than without. This needs to be investigated.

Relates to huggingface#1525 Unfortunately, the docs I wrote about how to use LoftQ were incorrect, based on a misunderstanding I had. In reality, it is quite a bit more involved to get LoftQ working, requiring a complete roundtrip first loading a non-quantized model with LoftQ, saving the LoRA weights and the modified base model, loading the just stored base model again but this time with quantization, and finally loading the LoftQ-initialized adapter on top. The docs now link to the example which demosthenes how to move through these steps.

HuggingFaceDocBuilderDev · 2024-03-04T16:59:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-03-04T17:11:25Z

@yxli2123 As described above, I made a mistake in how LoftQ needs to be used. This PR adjust the docs and the unit tests to apply LoftQ correctly (from my understanding).

However, after I changed the unit tests, 8 of them started failing, only 4 passed:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃  File                        ┃  Function                                       ┃  Function Line  ┃  Error Line  ┃  Error           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit[cuda]         │  1244           │  1260        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit[cpu]          │  1244           │  1260        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit_iter_5[cuda]  │  1263           │  1275        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_4bit_iter_5[cpu]   │  1263           │  1275        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_8bit_iter_5[cuda]  │  1293           │  1305        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_bloomz_loftq_8bit_iter_5[cpu]   │  1293           │  1305        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_t5_loftq_4bit[cuda]             │  1308           │  1320        │  AssertionError  │
│  tests/test_gpu_examples.py  │  TestLoftQ.test_t5_loftq_4bit[cpu]              │  1308           │  1320        │  AssertionError  │
└──────────────────────────────┴─────────────────────────────────────────────────┴─────────────────┴──────────────┴──────────────────┘

In these unit tests, I measure the model output from the base model without any quantization, then the model output of the bnb-quantized model, and finally the model output of the quantized model using LoftQ. The expectation is that the model output with LoftQ should be closer to the base model output than the bnb-quantized model output (without LoftQ).

(Note that I'm only testing the LoftQ should lead to closer results, even if it's only marginally closer. Ideally, however, we want it be X% better, i.e. there should be a certain margin)

I'm not sure why the error with LoftQ is greater and not smaller. Am I still applying LoftQ incorrectly? Please check the unit test if I make some mistake there. For debugging purposes, I added a check that on each individual layer, the residual error does indeed decrease when applying the LoRA weights initialized with LoftQ, so this part seems to be correct.

Moreover, while working on this, some questions came up:

Why do we need to override the weights with the dequantized weights (here). Why can we not just load the quantized base model with the LoftQ-LoRA weights, i.e. skip this step? I tried that as well but test are failing no matter what.
Is it really working correctly when we pass num_iter > 1? I feel like this line is not correct, because we are iteratively changing the residual but lora_A and lora_B are only determined by the very last step. Maybe those need to be incremented iteratively as well?

Related to huggingface#1532 At the moment, using LoftQ is quite cumbersome, as shown in this example: https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning Essentially, users have to: 1. Load the non-quantized model with LoftQ (which can be quite huge) 2. Modify the PEFT config 3. Save the adapter 4. Unwrap the base model with custom functions 5. Save the base model with modified weights (i.e. a whole copy of the base model) 6. Load the base model from step 5 with bnb quantization 7. Load the adapter from step 3 Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights. This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights. This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the weights. While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in huggingface#1532. I measured this by creating (1) logits with the base model, (2) with the quantized+LoRA model, and (3) with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case. I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we proceed to the next weight, otherwise we discard the change. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it. This PR is not yet finished since I ran into an issue with matching the key names from safetensors not matching. Furthermore, for now this doesn't support 8bit quantization and the num_iter arguments of LoftQ, which I'm not sure is really working. However, I guess the replace_lora_weights_loftq function could be called multiple times in a row.

pacman100

Thank you Benjamin. After going through the tests, the examples for LoftQ. I believe that the discrepancy and failing tests happen due to the casting of activation to float16 by bitsandbytes when loading the loftq base and adapters.

pacman100 · 2024-03-19T09:23:00Z

So, the difference between original weight W and quantized+adapters weights (Q+BA) would be less when compared to just using Q, it isn't reflecting in logits because activations at each layer are being casted to float16 by bnb.

Add reco to use all-linear and nf4.

pacman100

Thank you @BenjaminBossan for all the work on rectifying LoftQ tests and uncovering best practices for applying it! 🤗

pacman100 · 2024-03-20T07:07:07Z

docs/source/developer_guides/lora.md

@@ -44,6 +44,8 @@ config = LoraConfig(init_lora_weights=False, ...)

 When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).

+In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.


Nice point to note in the docs.

Related to #1532 At the moment, using LoftQ is quite cumbersome, as shown in this example: https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning Essentially, users have to: 1. Load the non-quantized model with LoftQ (which can be quite huge) 2. Modify the PEFT config 3. Save the adapter 4. Unwrap the base model 5. Save the base model with modified weights (i.e. a whole copy of the base model) 6. Load the base model from step 5 with bnb quantization 7. Load the adapter from step 3 Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights. This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights. This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the weights. While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in #1532. I measured this by creating (1) logits with the base model, (2) with the quantized+LoRA model, and (3) with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case. I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we proceed to the next weight, otherwise we discard the change. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it. This doesn't support 8bit quantization and the num_iter arguments of LoftQ. However, the replace_lora_weights_loftq function can be called multiple times in a row for slightly improved results. --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

BenjaminBossan mentioned this pull request Mar 7, 2024

More convenient way to initialize LoftQ #1543

Merged

pacman100 reviewed Mar 19, 2024

View reviewed changes

BenjaminBossan and others added 6 commits March 19, 2024 11:53

Merge branch 'main' into fix-loftq-docs-and-tests

d4e8ca7

Fix bnb_4bit_quant_type, mark xfailing tests

a503c71

fix tests

8a69692

Update test_gpu_examples.py

ed98c63

fixes and quality

d4372c5

Small adjustments to tests, docs

1b31024

Add reco to use all-linear and nf4.

BenjaminBossan marked this pull request as ready for review March 19, 2024 16:26

BenjaminBossan requested review from pacman100 and younesbelkada March 19, 2024 16:27

pacman100 approved these changes Mar 20, 2024

View reviewed changes

BenjaminBossan changed the title ~~WIP Fix LoftQ docs and tests~~ Fix LoftQ docs and tests Mar 20, 2024

BenjaminBossan merged commit a86b29a into huggingface:main Mar 20, 2024
14 checks passed

BenjaminBossan deleted the fix-loftq-docs-and-tests branch March 20, 2024 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LoftQ docs and tests #1532

Fix LoftQ docs and tests #1532

BenjaminBossan commented Mar 4, 2024 •

edited

HuggingFaceDocBuilderDev commented Mar 4, 2024

BenjaminBossan commented Mar 4, 2024

pacman100 left a comment

pacman100 commented Mar 19, 2024

pacman100 left a comment •

edited

pacman100 Mar 20, 2024

		@@ -44,6 +44,8 @@ config = LoraConfig(init_lora_weights=False, ...)

		When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).

		In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.

Fix LoftQ docs and tests #1532

Fix LoftQ docs and tests #1532

Conversation

BenjaminBossan commented Mar 4, 2024 • edited

HuggingFaceDocBuilderDev commented Mar 4, 2024

BenjaminBossan commented Mar 4, 2024

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 commented Mar 19, 2024

pacman100 left a comment • edited

Choose a reason for hiding this comment

pacman100 Mar 20, 2024

Choose a reason for hiding this comment

BenjaminBossan commented Mar 4, 2024 •

edited

pacman100 left a comment •

edited