Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Albert Hyperparameters for Fine-tuning SQuAD 2.0 #1974

Closed
ahotrod opened this issue Nov 27, 2019 · 4 comments
Closed

Albert Hyperparameters for Fine-tuning SQuAD 2.0 #1974

ahotrod opened this issue Nov 27, 2019 · 4 comments
Labels

Comments

@ahotrod
Copy link
Contributor

ahotrod commented Nov 27, 2019

❓ Questions & Help

I want to fine-tune albert-xxlarge-v1 on SQuAD 2.0 and am in need of optimal hyperparameters. I did not find any discussion in the Albert original paper regarding suggested fine-tuning hyperparameters, as is provided in the XLNet original paper. I did find the following hard-coded parameters in the Google-research Albert run_squad_sp.py code:

'do_lower_case' = True
'train_batch_size' = 32
'predict_batch_size' = 8
'learning_rate' = 5e-5
'num_train_epochs' = 3.0
'warmup_proportion' = 0.1

With fine-tuning on my 2x GPUs taking ~69 hours, I'd like to shrink the number of fine-tuning iterations necessary to attain optimal model performance. Anyone have a bead on the optimal hyperparameters?

Also, Google-research comments in run_squad_sp.py state that warmup_proportion is "Proportion of training to perform linear learning rate warmup for." "E.g., 0.1 = 10% of training". Since 3 epochs, batch size = 32 while fine-tuning SQuAD 2.0 results in approximately 12.5K total optimization steps, would I set --warmup_steps = 1250 when calling Transformers' run_squad.py?

Thanks in advance for any input.

@frankfka
Copy link

Wondering this as well but for GLUE tasks. There don't seem to be a good consensus on hyperparameters such as weight decay and such

@ahotrod
Copy link
Contributor Author

ahotrod commented Dec 7, 2019

Results using hyperparameters from my first post above, varying only batch size:

albert_xxlargev1_squad2_512_bs32:
{
  "exact": 83.67725090541565,
  "f1": 87.51235434089064,
  "total": 11873,
  "HasAns_exact": 81.86572199730094,
  "HasAns_f1": 89.54692697189559,
  "HasAns_total": 5928,
  "NoAns_exact": 85.48359966358284,
  "NoAns_f1": 85.48359966358284,
  "NoAns_total": 5945
}

albert_xxlargev1_squad2_512_bs48:
{
  "exact": 83.65198349195654,
  "f1": 87.4736247587816,
  "total": 11873,
  "HasAns_exact": 81.73076923076923,
  "HasAns_f1": 89.38501126197984,
  "HasAns_total": 5928,
  "NoAns_exact": 85.5677039529016,
  "NoAns_f1": 85.5677039529016,
  "NoAns_total": 5945
}

lr
loss

@fgksgf
Copy link

fgksgf commented Dec 20, 2019

@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:
image

@stale
Copy link

stale bot commented Feb 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 18, 2020
@stale stale bot closed this as completed Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants