You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for sharing the source code. While trying to reproduce SST2 task result with RoBERTa-base model, I've encountered some questions regarding the hyper-parameters, lora_alpha, and a global batch size.
Since the paper's hyper-parameter setting and the reproducing script which does both training and evaluation (examples/NLU/roberta_base_sst2.sh) had some conflict.
First of all, is the reproducing script the actual script that you used for creating the numbers for the paper?
lora-alpha (8 or 16?)
I'd like to know the exact lora-alpha that you used in training.
In Appendix D, lora_alpha is 8. However, in examples/NLU/roberta_base_sst2.sh, it is written that lora-alpha is 16.
When I tried evaluation, lora-alpha 16 gave the better result.
Maybe you used lora_alpha as 8 in training, but lora_alpha was 16 in evaluation or else... it's a little bit confusing.
global batch size while training (16? 64? 128? or else?)
In Appendix D, it is written that the batch size is 16, so I thought 16 was the global batch size while training. However, in examples/NLU/roberta_base_sst2.sh, it is written that per_device_train_batch_size is 16 and the number of gpu is 8. (So the global batch size should be 128) Moreover, the explanation in https://github.com/microsoft/LoRA/tree/main/examples/NLU#adapting-to-the-glue-benchmark said that the number of gpu used is 4. (So the global batch size should be 64)
When the global batch size was 128, my reproduction result was lower than in paper. (94.5 accuracies) Thanks.
weight decay of AdamW optimizer
The weight decay hyperparameter was in the script examples/NLU/roberta_base_sst2.sh, but was not present in the paper (for the GLUE tasks)
Did you use the weight decay parameter?
I wrote down your hyper-parameter setting like this, and I'd appreciate the specification.
The text was updated successfully, but these errors were encountered:
I have some exact same questions just like @t-hyun mentioned.
It would be really appreciate to respond his questions especially in terms of lora_alpha!
I cannot clearly understand the effect of lora_alpha / lora rank r ratio in 'merging parameters at training'
isn't it just multiplying learning rate twice when we set (lora_alpha / rank r) into 2?
Some other posts are set this ratio 1.0 as the default in training (fully-same as finetuning) and then use lower/higher ratio at the inference time in order to interpolate the effect of fusing updated parameters into originally pre-trained one.
Thus, can you explain the effect of alpha/rank r ratio more clearly?
Thanks!
Hello, thank you for sharing the source code. While trying to reproduce SST2 task result with RoBERTa-base model, I've encountered some questions regarding the hyper-parameters, lora_alpha, and a global batch size.
Since the paper's hyper-parameter setting and the reproducing script which does both training and evaluation (
examples/NLU/roberta_base_sst2.sh
) had some conflict.First of all, is the reproducing script the actual script that you used for creating the numbers for the paper?
I'd like to know the exact lora-alpha that you used in training.
In Appendix D, lora_alpha is 8. However, in
examples/NLU/roberta_base_sst2.sh
, it is written that lora-alpha is 16.https://github.com/microsoft/LoRA/blob/70ca1efd17b6ca4a45bbdba98554d5b312a8d48c/examples/NLU/roberta_base_sst2.sh#L24
When I tried evaluation, lora-alpha 16 gave the better result.
Maybe you used lora_alpha as 8 in training, but lora_alpha was 16 in evaluation or else... it's a little bit confusing.
In Appendix D, it is written that the batch size is 16, so I thought 16 was the global batch size while training. However, in
examples/NLU/roberta_base_sst2.sh
, it is written thatper_device_train_batch_size
is 16 and the number of gpu is 8. (So the global batch size should be 128) Moreover, the explanation in https://github.com/microsoft/LoRA/tree/main/examples/NLU#adapting-to-the-glue-benchmark said that the number of gpu used is 4. (So the global batch size should be 64)When the global batch size was 128, my reproduction result was lower than in paper. (94.5 accuracies) Thanks.
The weight decay hyperparameter was in the script
examples/NLU/roberta_base_sst2.sh
, but was not present in the paper (for the GLUE tasks)Did you use the weight decay parameter?
I wrote down your hyper-parameter setting like this, and I'd appreciate the specification.
The text was updated successfully, but these errors were encountered: