Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different hyper-parameter between the paper and the code? (lora_alpha and a global batch size) #37

Open
t-hyun opened this issue Nov 22, 2022 · 3 comments

Comments

@t-hyun
Copy link

t-hyun commented Nov 22, 2022

Hello, thank you for sharing the source code. While trying to reproduce SST2 task result with RoBERTa-base model, I've encountered some questions regarding the hyper-parameters, lora_alpha, and a global batch size.
Since the paper's hyper-parameter setting and the reproducing script which does both training and evaluation (examples/NLU/roberta_base_sst2.sh) had some conflict.

First of all, is the reproducing script the actual script that you used for creating the numbers for the paper?

스크린샷 2022-11-22 오전 10 38 19

  1. lora-alpha (8 or 16?)
    I'd like to know the exact lora-alpha that you used in training.
    In Appendix D, lora_alpha is 8. However, in examples/NLU/roberta_base_sst2.sh, it is written that lora-alpha is 16.

스크린샷 2022-11-22 오전 10 58 43

https://github.com/microsoft/LoRA/blob/70ca1efd17b6ca4a45bbdba98554d5b312a8d48c/examples/NLU/roberta_base_sst2.sh#L24

When I tried evaluation, lora-alpha 16 gave the better result.

Maybe you used lora_alpha as 8 in training, but lora_alpha was 16 in evaluation or else... it's a little bit confusing.

  1. global batch size while training (16? 64? 128? or else?)
    In Appendix D, it is written that the batch size is 16, so I thought 16 was the global batch size while training. However, in examples/NLU/roberta_base_sst2.sh, it is written that per_device_train_batch_size is 16 and the number of gpu is 8. (So the global batch size should be 128) Moreover, the explanation in https://github.com/microsoft/LoRA/tree/main/examples/NLU#adapting-to-the-glue-benchmark said that the number of gpu used is 4. (So the global batch size should be 64)

When the global batch size was 128, my reproduction result was lower than in paper. (94.5 accuracies) Thanks.

  1. weight decay of AdamW optimizer
    The weight decay hyperparameter was in the script examples/NLU/roberta_base_sst2.sh, but was not present in the paper (for the GLUE tasks)
    Did you use the weight decay parameter?

I wrote down your hyper-parameter setting like this, and I'd appreciate the specification.
스크린샷 2022-11-22 오후 12 03 18

@Bannng
Copy link

Bannng commented May 19, 2023

I have some exact same questions just like @t-hyun mentioned.
It would be really appreciate to respond his questions especially in terms of lora_alpha!

I cannot clearly understand the effect of lora_alpha / lora rank r ratio in 'merging parameters at training'
isn't it just multiplying learning rate twice when we set (lora_alpha / rank r) into 2?

Some other posts are set this ratio 1.0 as the default in training (fully-same as finetuning) and then use lower/higher ratio at the inference time in order to interpolate the effect of fusing updated parameters into originally pre-trained one.

Thus, can you explain the effect of alpha/rank r ratio more clearly?
Thanks!

@roshan-gopalakrishnan
Copy link

Is there any reply to these questions ?

@thusinh1969
Copy link

I used rank 32 and alpha 16, lr 1e-4 and global batch 128. Works well.
Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants