Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToDo CLEF scorer #1

Closed
11 of 12 tasks
aflueckiger opened this issue Jan 20, 2020 · 11 comments
Closed
11 of 12 tasks

ToDo CLEF scorer #1

aflueckiger opened this issue Jan 20, 2020 · 11 comments

Comments

@aflueckiger
Copy link
Collaborator

aflueckiger commented Jan 20, 2020

decisions

  • document-level F1-score to weight long/short docs equally,
  • treat nested entities (columns) individually
  • treat every column separate; NEL & NERC & components
  • no fuzziness on the type, only for the boundary (one token needs to match)
  • no macro_type in TSV

--> compare F1-score with CONLL script as a sanity check

programming

  • implement script scaffolding
  • output relevant metrics as tsv
  • output all metrics as json
  • implement document-level F1 macro score with std. dev. as a stability measure
  • unit test for complicated cases (double hits including TP and FP)
  • include original numbers TP / FP / FN
  • check the probably erroneous formula the F1 score in the official formula for partial matches
  • adding argument to glue comp and fine label: ToDo CLEF scorer #1 (comment)
  • sanity_check if provided labels are all available in gold standard
  • add system name as separate column extracting from filename
  • check how to make a union of two columns (or lists) to evaluate against gold standard
  • implement slot error rate
    • also per type
    • compute F1/P/R
@e-maud
Copy link
Contributor

e-maud commented Jan 31, 2020

@simon-clematide @aflueckiger @mromanello I was just wondering:

Columns are treated separately to compute elements, this is normal, but for the 'fine-grained' setting, wouldn't it make sense to have a score considering fine, components and nested columns together? Or perhaps fine+components together at least, to capture the fine case?
Then there would be the question of how to do fuzzy vs exact, and it might multiply too much eval cases...

@simon-clematide
Copy link
Contributor

That's worth thinking about. For the components case, it make sense to take into account what it is a component for. One way to think about it could be that the components are always thought with the fine-grained type attached. As if

# segment_iiif_link = _
M	B-pers	O	B-pers.ind	O	B-comp.title	O	Q2853810	_	NoSpaceAfter
.	I-pers	O	I-pers.ind	O	I-comp.title	O	Q2853810	_	_
Thibaudeau	I-pers	O	I-pers.ind	O	B-comp.name	O	Q2853810	_	_

would actually be

# segment_iiif_link = _
M	B-pers	O	B-pers.ind	O	B-pers.ind.comp.title	O	Q2853810	_	NoSpaceAfter
.	I-pers	O	I-pers.ind	O	I-pers.ind.comp.title	O	Q2853810	_	_
Thibaudeau	I-pers	O	I-pers.ind	O	B-pers.ind.comp.name	O	Q2853810	_	_

Meaning: the components always have the fine-grained type attached to them.

@e-maud
Copy link
Contributor

e-maud commented Jan 31, 2020

Super convenient, that would definitely ease the eval script. I think it makes sense to consider fine type + components for the NERC-fine task.

Summarizing eval settings for NERC, there would be:

  • 3 metrics: P, R, F1
  • 3 way to compute them: micro, doc-macro, type-macro
  • 2 scenarios: fuzzy, exact
  • 2 tasks: NERC-coarse (lit + meto = 2 figures), NERC-fine (lit + meto + nested = 3 figures)

In terms of concrete output (but you might have discussed/checked it already), we need to think of how to communicate results, since all this has to fit neatly somewhere. What about 1 table (csv) per bundle and per team? Happy to sketch it if needed.

@aflueckiger
Copy link
Collaborator Author

aflueckiger commented Feb 4, 2020

Evaluation sample

The script produces the following tsv output when evaluating for the coarse format, covering all regimes.
What do you think @e-maud @mromanello @simon-clematide?
I also computed the type-based macro scores. Should I include them as well here even though this leads to another amazing bloat of the file?

Evaluation Label P R F1 F1_std P_std R_std TP FP FN
NE_COARSE_LIT-micro-fuzzy LOC 1 0.987012987012987 0.993464052287582       76 0 1

shortened

@aflueckiger
Copy link
Collaborator Author

aflueckiger commented Feb 10, 2020

@e-maud @mromanello @simon-clematide
I am currently writing the unit tests for the evaluation. I want to ask a quick question to make sure we are all on the same page concerning FP/FN. Consider the artificial example:

TOKEN PRED GOLD
Winthertur B-loc.adm.town B-loc.adm.town
Test I-loc.adm.town O

Following Batista's official definition of FP/FN, the example would result in 1 FN and 1 FP. Unfortunately, the strict scenario severely punishes wrong boundaries. Moreover, we reward systems that miss entities over systems that predict the wrong boundaries. Do we really want to follow this?

source: http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

Moreover, I am glad that we do this as there are severe miscalculations in the original code. 🙄

@simon-clematide
Copy link
Contributor

simon-clematide commented Feb 10, 2020 via email

@aflueckiger
Copy link
Collaborator Author

I think the CoNLL shared task 2003 evaluated similar:

“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”

paper: https://www.aclweb.org/anthology/W03-0419.pdf

I could not find the evaluation script, so I am not entirely sure.

@aflueckiger
Copy link
Collaborator Author

aflueckiger commented Feb 11, 2020

@e-maud @simon-clematide @mromanello

Ok, we are getting to a ground in our dive for fishy metrics. 😆

CoNLL2000 also punishes wrong boundaries twice, one FP and one FN.
source: https://www.clips.uantwerpen.be/conll2000/chunking/output.html

I think this is suboptimal as conservative systems predicting nothing are better off than systems predicting entities even in cases of unclear boundaries. Nevertheless, I suggest to follow this standard. Yet, we need to keep this in mind when evaluating the systems.

We could also raise participant's attention for this peculiarity in the README of the scorer. What's your take?

PS: our numbers are in line with the CoNLL2000 standard.

@e-maud
Copy link
Contributor

e-maud commented Feb 11, 2020

Many thanks @aflueckiger for this diving!
I would also vote for aligning ourselves on CoNNL (or are we fostering evaluation script error propagation through the years ? in any case, it is nice to be able to compare, even at a high level). Regarding warning the participants, at first I thought that by doing so we might encourage them not to predict when they are unsure, and therefore that it would be better not emphasizing this point. However, systems behaving as such would have bad fuzzy scores, so participants might not tune their systems in this direction. An in any case, it is also a matter of fairness so I think it is good if we mention it.

@aflueckiger
Copy link
Collaborator Author

Simon also shares this view. Thus, we keep the double punishment.

@mromanello
Copy link
Contributor

can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants