Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

Closed
CrazyElements opened this issue Mar 16, 2024 · 7 comments

Comments

@CrazyElements
Copy link

CrazyElements commented Mar 16, 2024

Has anyone successfully replicated the results of fine-tuning tasks?
I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper.
And here are best performances of my runs

  • mrpc: 0.8971(92.25)
  • cola: 0.6274(0.6035),
    where numbers in parentheses are the results of the paper.
    I would appreciate any assistance from someone who can provide insights on this matter.
@CrazyElements CrazyElements changed the title D Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" Mar 16, 2024
@jiaweizzhao
Copy link
Owner

jiaweizzhao commented Mar 16, 2024

I am happy to help @CrazyElements
Can you provide the full training script and hyperparameters you are using?
You can also join our slack for quick discussion

@CrazyElements
Copy link
Author

CrazyElements commented Mar 17, 2024

Thanks @jiaweizzhao. For mrpc, I just used the hyperparameters listed in README.

python run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mrpc \
    --enable_galore \
    --lora_all_modules \
    --max_length 512 \
    --seed=1234 \
    --lora_r 4 \
    --galore_scale 4 \
    --per_device_train_batch_size 16 \
    --update_proj_gap 500 \
    --learning_rate 3e-5 \
    --num_train_epochs 30 \
    --output_dir results/ft/roberta_base/mrpc

For the other tasks, I modified learning_rate and num_train_epochs. And I trained with run_glue.py.

@jiaweizzhao
Copy link
Owner

I tried and it works as expected. The issue might be we report F1 score of mrpc in the paper and causes the confusion. I will change it back to accuracy in the new revision.

@CrazyElements
Copy link
Author

CrazyElements commented Mar 18, 2024

Thank you for your response. But I'm still unable to replicate the results. The final f1 score of mrpc is 91.93, and the matthews_correlation of cola is 59.6.
By the way, did you use the results of the eval dataset of the last epoch as the final outcomes? The above results I mentioned extracted from all_result.json, which acctually corresponds to the eval dataset results of the last epoch.

@jiaweizzhao
Copy link
Owner

This might be due to the choice of the random seed. I did a quick sweep using my previous setup (based on the config you provided):
python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 16 \ --per_device_train_batch_size 32 \ --update_proj_gap 500 \ --learning_rate 2e-5 \ --num_train_epochs 20 \ --output_dir results/ft/roberta_base/mrpc
This gives {"eval_accuracy": 0.8970588235294118, "eval_f1": 0.925531914893617}

@CrazyElements
Copy link
Author

CrazyElements commented Mar 18, 2024

This might be due to the choice of the random seed

So I wonder if you use a different seed(not 1234)? Maybe I mistakenly assumed that the example script in the README would yield the same results. If indeed you did, would it be possible for you to consider open-sourcing the fine-tuning script?

--galore_scale 16 \ --per_device_train_batch_size 32 \

And here I used hyperparameters listed in table 7.

@jiaweizzhao
Copy link
Owner

We use the avg score of repeated runs. We will release the fine-tuning scripts later, along with a few more fine-tuning experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants