-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding release of best current model #36
Comments
Hi, Have you been fine-tuning XLM-R or which model have you been fine-tuning to achieve a comparatively low performance on PAWS-X? We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task. Even though fine-tuned models may be helpful to some extent for certain downstream tasks (see for instance this recent paper) we believe that the original pre-trained models should generally be used as the starting point for further experiments. |
Hi @sebastianruder, thank you for the quick response. Fine-tuning instancesAs I just started fine-tuning recently, I have only managed to test a few instances. Essentially, I set up this repo (as per the readme) and ran the following commands with the corresponding best (train/dev/test) results:
Do you think these results were due to an "unlucky" initial configuration and that I should re-run these commands a few more times? I believe there is still some stochasticity here with the optimizer despite the model being loaded to a fixed initial checkpoint. Fine-tuning models
I understand. Hmm, would it be possible to share the hyperparameters and training arguments that were used to fine-tune the best performing model so far for PAWS-X (eg. for those listed on the leaderboard)? I could perhaps try to reproduce the best model with those values. |
Just to add more information for this issue: I ran
Observations and questionsIn Is there a reason why the first few checkpoints (above) all show an accuracy of |
Hi @atreyasha , the testing score is around 0.5 because we remove the true class label for each example in the test set, and use a fake placeholder ("0") as the label for all the testing examples. Because we want to encourage the users to get their predictions and submit to our XTREME benchmark. Removing the label is intentional in order to avoid trivial submissions. At the beginning of the training, the model is not well trained and predict all zeros for all the examples. That's why you would observe testing accuracy to be 1 at the beginning. If you want to use the best system for PAWS-X, you may refer to the dev set accuracy which is accurate, and we also find that the scores on the dev and test sets are well correlated. |
Hi @JunjieHu. Thank you, this makes sense now. Closing the issue. |
Hello Google Research Team,
Thank you for this awesome repo and for the baseline code. As part of a downstream task in machine translation, I require a well-performing model on the PAWS-X dataset. I have been attempting to fine-tune some models using the code here, but my test accuracies on PAWS-X are still in the mid 50's.
I was wondering when the current best performing XLM-R model would be released for downstream usage?
Thank you.
The text was updated successfully, but these errors were encountered: