Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy of baseline values #13

Closed
MinionAttack opened this issue Oct 28, 2021 · 4 comments
Closed

Accuracy of baseline values #13

MinionAttack opened this issue Oct 28, 2021 · 4 comments

Comments

@MinionAttack
Copy link

MinionAttack commented Oct 28, 2021

Hi,

I am using the latest code available in the repo and have trained both baselines (graph parser and sequence labeling). Then I have measured the Sentiment Tuple F1 with the dev.json file and I am getting this values:

Graph Parsing Sequence Labeling
Darmstadt Unis 0.077 0.107
MPQA 0.141 0.016
Multibooked (CA) 0.534 0.286
Multibooked (EU) 0.545 0.372
Norec 0.296 0.190
Opener (EN) 0.546 0.339
Opener (ES) 0.536 0.328

For Graph Parsing, the values are around 0.5xx except for Darmstadt Unis, MPQA and Norec which are lower, specially Darmstadt Unis.
For Sequence Labeling, the values are around 0.3xx except for Darmstadt Unis, MPQA and Norec which are lower, specially MPQA.

Are those values correct to be taken as a reference or I am doing something wrong training the models or when I do the inference to get the scores?

Regards.

@jerbarnes
Copy link
Owner

Hi Iago,

The values look pretty normal, except for Darmstadt with the Graph Parsing approch. The OpeNER/Multibooked datasets are right where I would expect, Norec is always a bit lower because it's a more diverse dataset and MPQA is quite hard because of the ambiguity of many of the polar expressions and size of the holders and targets. But Darmstadt using the graph parser should be higher than MPQA.

@MinionAttack
Copy link
Author

Thanks for the clarification, the issue could be because of this change? #9

@jerbarnes
Copy link
Owner

I doubt it. These issues aren't so common as to cause a large drop in performance, and the original paper where we take the baseline from (https://aclanthology.org/2021.acl-long.263/) used the same data. Perhaps it's the effect of a particularly poor random seed, as there is a bit of variance (+-2.0 in the paper)??

@MinionAttack
Copy link
Author

Ah, ok, I'll try to train it again. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants