Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly optimistic F1 scoring #87

Closed
mkroutikov opened this issue Jan 25, 2019 · 10 comments
Closed

Overly optimistic F1 scoring #87

mkroutikov opened this issue Jan 25, 2019 · 10 comments

Comments

@mkroutikov
Copy link

mkroutikov commented Jan 25, 2019

Your code is overly optimistic when comparing two different labelings. This becomes important when CRF decoding is not used and predicted BIOES labels can be invalid.

For example, your function get_ner_fmeasure will return F1=1.0, but token accuracy acc=0.5. This does not make any sense to me.

Here is the test code that fails, because computed f1 is 1.

from utils.metric import get_ner_fmeasure

def test_metric():
    acc, p, r, f1 = get_ner_fmeasure(
        [['O', 'B-PER', 'I-PER', 'E-PER']],
        [['I-PER', 'B-PER', 'O', 'E-PER']],
    )

    assert f1 < 1.0
@jiesutd
Copy link
Owner

jiesutd commented Jan 26, 2019

@mkroutikov My code does not overly optimistic the F1 scoring.

  1. We need to make it clear why we want to use F1 to evaluate the NER task. The goal is to evaluate the precision and recall of recognized entities from the predicted model. In your example, the invalid label cannot generate new entities. If you have a predict label sequence ['I-PER', 'B-PER', 'O', 'E-PER'], then the first I-PER is meaningless, it will not be regarded as an entity. So the predicted model will not generate a new entity in the first word. So it will not affect the F1 result. Again, the F1 cares about the recognized entities.

  2. We also have the accuracy metric in the output.

  3. The cases you raised are almost never occuring in our CRF decoder, as they never exist in training data. The transition probability of this case is almost close to 0. Anyway, even it occurs, my evaluation is reasonable and correct as explained in Unable to replicate the reported numbers on CoNLL dataset #1.

@jiesutd jiesutd closed this as completed Jan 26, 2019
@mkroutikov
Copy link
Author

  1. When you report F1 on NER tasks (in your paper), you should note that F1 score depends on the way you resolve invalid tags in BIO/BIOES sequence. Scores for all models without top CRF will be affected.
  2. Accuracy per token does not make much sense, because it depends on the way sentence is tokenized. Using BPE will produce totally different (uncomparable) result. Suggestion: report record-level (aka sentence-level) accuracy.
  3. Invalid predicted sequences DO happen (when there is no top CRF layer). In fact, they can happen with CRF layer as well - just unlikely. Changing the way you resolve those invalid sequences will change the F1, as explained in the item 1.

@jiesutd
Copy link
Owner

jiesutd commented Jan 26, 2019

@mkroutikov

  1. The F1 score is definitely affected by the way you resolve invalid tags (not only the invalid case you raised, but there also exists other invalid cases such as B-ORG I-PER). You can't say my measure is overly optimistic (I can also give examples to show it's overly pessimistic, see below example). It is not "overly" anything, It just a choice of dealing with the invalid cases. This is also a reason why we write our paper "
    Design Challenges and Misconceptions in Neural Sequence Labeling"
    to compare different models under the same settings.

  2. It's widely accepted to use F1 to evaluate the NER performance. The accuracy is just for reference. I think every evaluation methods have their own limitation, includes the 'sentence-level' accuracy you raised.

  3. It will happen more without CRF, but still very rare. It will also happen in very very rare cases with CRF. The different resolve ways will affect the F1 with a very limited scale (even negligible). Besides, we can not say one resolve way is overly optimistic than another one, the below example can also show my measurement is 'overly pessimistic' in some cases. They are just slightly different and if you want to compare different models, just make the models run in the same settings (not only the invalid tag resolving, but more importantly, the dataset split, embedding choice, pre-processing, etc.), as the work we did in the paper mentioned in item 1.

Example:
Gold: ['S-PER', 'B-PER', 'I-PER', 'E-PER']
Pred: ['I-PER', 'B-PER', 'I-PER', 'E-PER']

@mkroutikov
Copy link
Author

@jiesutd I hear you, and generally agree with your points.

To summarize: F1 depends on the way you resolve invalid tags. Your way is just one way of doing it, and it happened to produce better scores on CoNLL2003 task. That is why I called it "overly optimistic". Any rule can be shown to be "optimistic" or "pessimistic", so do not take this personally :) I am more used to the way they define entities in nltk, and it would show slightly different F1 on models without top CRF layer.

I am trying to make sense of comparing different NER models and that is why I am focusing on this issue right now. It would be nice if your code can provide different scoring implementations, just to compare. In my experiments, the difference is up to 1%, which is not negligible.

@jiesutd
Copy link
Owner

jiesutd commented Jan 26, 2019

"Your way is just one way of doing it, and it happened to produce better scores on CoNLL2003 task." How do you get this conclusion? Have you tried different ways of handling invalid cases using my code? Which kind of ways have you tried?

@jiesutd
Copy link
Owner

jiesutd commented Jan 26, 2019

@mkroutikov I am confident that the good performance of my model comes from the network framework rather than the invalid tags handle method. If you think "it (way of dealing invalid tags) happened to produce better scores on CoNLL2003 task.", it would be better to give the experiment comparisons on different invalid tag ways using my code. Otherwise, it is not fair to say that.

@mkroutikov
Copy link
Author

@jiesutd I will post here when I finish my experiments. I have BIO/BIOES decoder that uses a different resolution logic. Plan is to try it on CoNLL2003 and compare with your published results.

That said, I think we both understand that different entity decoding rules will produce different F1 result (could be negligible or not - a research point). The point of arguing which rules are better is moot. For some cases one rule will be better and for the other dataset - the other may be better. The lesson is that one can not compare two different systems because these systems may use different entity extraction logic. This is especially so in the early training stages where learning trajectory (in F1 terms) may be very different (for example your way of entity extraction is very forgiving on the early training stages - I often see F1 being greater than token accuracy!).

So, how to compare in a strict way? I think that the only way of extracting entities that does not depend on arbitrary rules is: use Viterbi decoding with hard transition rules (and no transition weights). This way learned logits will guide the entity resolution decisions, not rules.

Thanks for the discussion.

@mkroutikov
Copy link
Author

@jiesutd Here is the comparison: http://blog.innodatalabs.com/7776_ways_to_compute_f1_for_ner_task/

To conclude: there is no problem with your scoring, but it is not fair to compare with others who may have used a different method.

@jiesutd
Copy link
Owner

jiesutd commented Feb 1, 2019

@mkroutikov Very detailed comparison, impressive! It seems you compared different F1 on the LSTM without CRF model.

May I know what's the difference of the F1 computing method on the model with CRF? I believe with the CRF, the difference would be significantly reduced. And in better CRF models (e.g. adding character LSTM/CRF), the difference would be even less.

@mkroutikov
Copy link
Author

You are right, I only tried BidiLSTM (single layer), did not try attaching CRF layer (yet). Will post if I do.

I agree that with the more advanced models we will get much less token errors, and therefore less ambiguity in decoding entities.

With CRF layer on top I would suggest to always use constrained weights.
It automatically guarantees correct decoding and does not cost more in terms of CPU/GPU time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants