-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ToDo CLEF scorer #1
Comments
@simon-clematide @aflueckiger @mromanello I was just wondering: Columns are treated separately to compute elements, this is normal, but for the 'fine-grained' setting, wouldn't it make sense to have a score considering fine, components and nested columns together? Or perhaps fine+components together at least, to capture the fine case? |
That's worth thinking about. For the components case, it make sense to take into account what it is a component for. One way to think about it could be that the components are always thought with the fine-grained type attached. As if
would actually be
Meaning: the components always have the fine-grained type attached to them. |
Super convenient, that would definitely ease the eval script. I think it makes sense to consider fine type + components for the NERC-fine task. Summarizing eval settings for NERC, there would be:
In terms of concrete output (but you might have discussed/checked it already), we need to think of how to communicate results, since all this has to fit neatly somewhere. What about 1 table (csv) per bundle and per team? Happy to sketch it if needed. |
Evaluation sampleThe script produces the following tsv output when evaluating for the coarse format, covering all regimes.
shortened |
@e-maud @mromanello @simon-clematide
Following Batista's official definition of FP/FN, the example would result in 1 FN and 1 FP. Unfortunately, the strict scenario severely punishes wrong boundaries. Moreover, we reward systems that miss entities over systems that predict the wrong boundaries. Do we really want to follow this? source: http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/ Moreover, I am glad that we do this as there are severe miscalculations in the original code. 🙄 |
Can you check what the okd conll eval does here?
Von meinem iPhone gesendet
… Am 10.02.2020 um 15:41 schrieb aflueckiger ***@***.***>:
@e-maud @mromanello @simon-clematide
I am currently writing the unit tests for the evaluation. I want to ask a quick question to make sure we are all on the same page concerning FP/FN. Consider the artificial example:
TOKEN PRED GOLD
. B-pers.ind O
Herr I-pers.ind B-pers.ind
Pasitsch B-pers.ind I-pers.ind
er I-pers.ind O
In the fuzzy scenario, this would lead to a single FP as one predicted entity overlaps with gold and the other is spurious (over-generated).
In the strict scenario, this would lead to a threefold error. Both predictions are wrong resulting in two FP and the correct one is missing yielding another FN. In short, systems that over-generate entities while missing the correct ones are severely punished.
Moreover, I am glad that we do this as there are severe miscalculations in the original code. 🙄
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think the CoNLL shared task 2003 evaluated similar:
paper: https://www.aclweb.org/anthology/W03-0419.pdf I could not find the evaluation script, so I am not entirely sure. |
@e-maud @simon-clematide @mromanello Ok, we are getting to a ground in our dive for fishy metrics. 😆 CoNLL2000 also punishes wrong boundaries twice, one FP and one FN. I think this is suboptimal as conservative systems predicting nothing are better off than systems predicting entities even in cases of unclear boundaries. Nevertheless, I suggest to follow this standard. Yet, we need to keep this in mind when evaluating the systems. We could also raise participant's attention for this peculiarity in the README of the scorer. What's your take? PS: our numbers are in line with the CoNLL2000 standard. |
Many thanks @aflueckiger for this diving! |
Simon also shares this view. Thus, we keep the double punishment. |
can be closed |
decisions
--> compare F1-score with CONLL script as a sanity check
programming
The text was updated successfully, but these errors were encountered: