You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi xtreme team,
Thank you for your work on proposing the leaderboard. However, it seems the evaluation mode reported for UDPOS is inconsistent with the current release in the code. According to Table.20 POS accuracy results in the paper https://arxiv.org/pdf/2003.11080.pdf. The evaluation metric for POS is accuracy, and the average result for XLM-R is 73.8. However, in the code third_party/run_tag.py it only imports f1_score-related measurements from seqeval and the default eval for UDPOS is actually f1 score. I reproduced the experiment on UDPOS and used different measurements on the test set(Sorry I used leaked test set on my local machine for quicker evaluation). By default script and XLM-R large, I can get average of 74.2 f1_score, which is in line with 73.8 reported. For English, the F1 score is 96.15. However if I evaluate with accuracy score. I got 96.7 accuracy and 78.23 on average. Hence I suspect the evaluation on the leaderboard and in the paper for UDPOS is actually f1 score. Could you help to address this issue? I have a reproduced experiment result here: https://docs.google.com/spreadsheets/d/16Cv0IIdZGOyx6xUawcKScb38Cl3ofy0tHJSdWrt07LI/edit?usp=sharing
The text was updated successfully, but these errors were encountered:
@sebastianruder Thanks for the prompt reply. I noticed that in the main table Table 2. The metric for POS is F1. So maybe its just a typo in table 20. Out of curiosity, seems the evaluation metric for POS is mostly accuracy in previous works. Is there some intuition for you to use F1 ?
Hi xtreme team,
Thank you for your work on proposing the leaderboard. However, it seems the evaluation mode reported for UDPOS is inconsistent with the current release in the code. According to Table.20 POS accuracy results in the paper https://arxiv.org/pdf/2003.11080.pdf. The evaluation metric for POS is accuracy, and the average result for XLM-R is 73.8. However, in the code third_party/run_tag.py it only imports f1_score-related measurements from seqeval and the default eval for UDPOS is actually f1 score. I reproduced the experiment on UDPOS and used different measurements on the test set(Sorry I used leaked test set on my local machine for quicker evaluation). By default script and XLM-R large, I can get average of 74.2 f1_score, which is in line with 73.8 reported. For English, the F1 score is 96.15. However if I evaluate with accuracy score. I got 96.7 accuracy and 78.23 on average. Hence I suspect the evaluation on the leaderboard and in the paper for UDPOS is actually f1 score. Could you help to address this issue? I have a reproduced experiment result here: https://docs.google.com/spreadsheets/d/16Cv0IIdZGOyx6xUawcKScb38Cl3ofy0tHJSdWrt07LI/edit?usp=sharing
The text was updated successfully, but these errors were encountered: