Description
I am working on fine-tuning LLMs (6B to 40B parameters) using the LoRA framework on an instruction tuning dataset comprising of instructions corresponding to ~20 tasks (a mix of factual as well as open-ended tasks). The input to the model consists of a conversation snippet between two individuals along with a task-specific prompt. The results I am observing do not align with the performance improvements reported in the paper. Specifically, the paper reports that fine-tuning using LoRA generally results in performance at par with or better than full fine-tuning of the model, however, throughout our experiments I observe a performance lower than full fine-tuning by an absolute margin of ~4-6% in terms of RougeL score.
Sharing some of the training details below:
[Framework versions]
Python: 3.8
PyTorch: 1.13.1
Transformers: 4.27.4
PEFT: 0.3.0
[Infrastructure]
8 X A100 40 GB GPUs
[Hyper-parameter Range]
Learning rate: 5e-5 to 3e-3
Learning rate scheduler: [Constant, Linear]
Epochs: [1, 2]
Batch size: [2, 4, 8]
Weight decay: 0.0
Precision: bf16
Specifically, I tried fine-tuning of google/flan-t5-xxl
model in following two scenarios:
-
Scenario 1
Full fine-tuning with constantlearning rate = 5e-5
,batch size = 8
,epochs = 1
-
Scenario 2
Fine-tuning using LoRA with constantlearning rate = 1e-3
,batch size = 8
,epochs = 1
and LoraConfig as follows:
LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias='none', task_type="SEQ_2_SEQ_LM")
Observation: Scenario 2 resulted in 4% lower RougeL as compared to scenario 1. I have also tried tuning the hyper-parameters in Scenario 2 as per the range specified above, however, the best I could get is to a gap of ~4% RougeL.
Thank you very much for your time and consideration. Looking forward to any relevant insights here.