Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

CrazyElements · 2024-03-16T13:36:47Z

Has anyone successfully replicated the results of fine-tuning tasks?
I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper.
And here are best performances of my runs

mrpc: 0.8971(92.25)
cola: 0.6274(0.6035),
where numbers in parentheses are the results of the paper.
I would appreciate any assistance from someone who can provide insights on this matter.

jiaweizzhao · 2024-03-16T20:42:13Z

I am happy to help @CrazyElements
Can you provide the full training script and hyperparameters you are using?
You can also join our slack for quick discussion

CrazyElements · 2024-03-17T05:54:07Z

Thanks @jiaweizzhao. For mrpc, I just used the hyperparameters listed in README.

python run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mrpc \
    --enable_galore \
    --lora_all_modules \
    --max_length 512 \
    --seed=1234 \
    --lora_r 4 \
    --galore_scale 4 \
    --per_device_train_batch_size 16 \
    --update_proj_gap 500 \
    --learning_rate 3e-5 \
    --num_train_epochs 30 \
    --output_dir results/ft/roberta_base/mrpc

For the other tasks, I modified learning_rate and num_train_epochs. And I trained with run_glue.py.

jiaweizzhao · 2024-03-18T02:12:20Z

I tried and it works as expected. The issue might be we report F1 score of mrpc in the paper and causes the confusion. I will change it back to accuracy in the new revision.

CrazyElements · 2024-03-18T07:15:06Z

Thank you for your response. But I'm still unable to replicate the results. The final f1 score of mrpc is 91.93, and the matthews_correlation of cola is 59.6.
By the way, did you use the results of the eval dataset of the last epoch as the final outcomes? The above results I mentioned extracted from all_result.json, which acctually corresponds to the eval dataset results of the last epoch.

jiaweizzhao · 2024-03-18T07:49:46Z

This might be due to the choice of the random seed. I did a quick sweep using my previous setup (based on the config you provided):
python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 16 \ --per_device_train_batch_size 32 \ --update_proj_gap 500 \ --learning_rate 2e-5 \ --num_train_epochs 20 \ --output_dir results/ft/roberta_base/mrpc
This gives {"eval_accuracy": 0.8970588235294118, "eval_f1": 0.925531914893617}

CrazyElements · 2024-03-18T09:53:45Z

This might be due to the choice of the random seed

So I wonder if you use a different seed(not 1234)? Maybe I mistakenly assumed that the example script in the README would yield the same results. If indeed you did, would it be possible for you to consider open-sourcing the fine-tuning script?

--galore_scale 16 \ --per_device_train_batch_size 32 \

And here I used hyperparameters listed in table 7.

jiaweizzhao · 2024-04-01T03:38:20Z

We use the avg score of repeated runs. We will release the fine-tuning scripts later, along with a few more fine-tuning experiments.

CrazyElements changed the title D Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" Mar 16, 2024

jiaweizzhao closed this as completed Apr 1, 2024

JamesSand mentioned this issue May 4, 2024

Questions about reproducing the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

CrazyElements commented Mar 16, 2024 •

edited

Loading

jiaweizzhao commented Mar 16, 2024 •

edited

Loading

CrazyElements commented Mar 17, 2024 •

edited

Loading

jiaweizzhao commented Mar 18, 2024

CrazyElements commented Mar 18, 2024 •

edited

Loading

jiaweizzhao commented Mar 18, 2024

CrazyElements commented Mar 18, 2024 •

edited

Loading

jiaweizzhao commented Apr 1, 2024

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

Comments

CrazyElements commented Mar 16, 2024 • edited Loading

jiaweizzhao commented Mar 16, 2024 • edited Loading

CrazyElements commented Mar 17, 2024 • edited Loading

jiaweizzhao commented Mar 18, 2024

CrazyElements commented Mar 18, 2024 • edited Loading

jiaweizzhao commented Mar 18, 2024

CrazyElements commented Mar 18, 2024 • edited Loading

jiaweizzhao commented Apr 1, 2024

CrazyElements commented Mar 16, 2024 •

edited

Loading

jiaweizzhao commented Mar 16, 2024 •

edited

Loading

CrazyElements commented Mar 17, 2024 •

edited

Loading

CrazyElements commented Mar 18, 2024 •

edited

Loading

CrazyElements commented Mar 18, 2024 •

edited

Loading