Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error in structure prediction tasks #56

Closed
tonytan48 opened this issue Jan 4, 2021 · 2 comments
Closed

Possible error in structure prediction tasks #56

tonytan48 opened this issue Jan 4, 2021 · 2 comments

Comments

@tonytan48
Copy link
Contributor

tonytan48 commented Jan 4, 2021

Hi xtreme team,
Thank you for your work on proposing the leaderboard. However, it seems the evaluation mode reported for UDPOS is inconsistent with the current release in the code. According to Table.20 POS accuracy results in the paper https://arxiv.org/pdf/2003.11080.pdf. The evaluation metric for POS is accuracy, and the average result for XLM-R is 73.8. However, in the code third_party/run_tag.py it only imports f1_score-related measurements from seqeval and the default eval for UDPOS is actually f1 score. I reproduced the experiment on UDPOS and used different measurements on the test set(Sorry I used leaked test set on my local machine for quicker evaluation). By default script and XLM-R large, I can get average of 74.2 f1_score, which is in line with 73.8 reported. For English, the F1 score is 96.15. However if I evaluate with accuracy score. I got 96.7 accuracy and 78.23 on average. Hence I suspect the evaluation on the leaderboard and in the paper for UDPOS is actually f1 score. Could you help to address this issue? I have a reproduced experiment result here: https://docs.google.com/spreadsheets/d/16Cv0IIdZGOyx6xUawcKScb38Cl3ofy0tHJSdWrt07LI/edit?usp=sharing

@sebastianruder
Copy link
Collaborator

Thanks for flagging this, @tonytan48. @JunjieHu, could you take a look?

@tonytan48
Copy link
Contributor Author

@sebastianruder Thanks for the prompt reply. I noticed that in the main table Table 2. The metric for POS is F1. So maybe its just a typo in table 20. Out of curiosity, seems the evaluation metric for POS is mostly accuracy in previous works. Is there some intuition for you to use F1 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants