-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overly optimistic F1 scoring #87
Comments
@mkroutikov My code does not overly optimistic the F1 scoring.
|
|
Example: |
@jiesutd I hear you, and generally agree with your points. To summarize: F1 depends on the way you resolve invalid tags. Your way is just one way of doing it, and it happened to produce better scores on CoNLL2003 task. That is why I called it "overly optimistic". Any rule can be shown to be "optimistic" or "pessimistic", so do not take this personally :) I am more used to the way they define entities in nltk, and it would show slightly different F1 on models without top CRF layer. I am trying to make sense of comparing different NER models and that is why I am focusing on this issue right now. It would be nice if your code can provide different scoring implementations, just to compare. In my experiments, the difference is up to 1%, which is not negligible. |
"Your way is just one way of doing it, and it happened to produce better scores on CoNLL2003 task." How do you get this conclusion? Have you tried different ways of handling invalid cases using my code? Which kind of ways have you tried? |
@mkroutikov I am confident that the good performance of my model comes from the network framework rather than the invalid tags handle method. If you think "it (way of dealing invalid tags) happened to produce better scores on CoNLL2003 task.", it would be better to give the experiment comparisons on different invalid tag ways using my code. Otherwise, it is not fair to say that. |
@jiesutd I will post here when I finish my experiments. I have BIO/BIOES decoder that uses a different resolution logic. Plan is to try it on CoNLL2003 and compare with your published results. That said, I think we both understand that different entity decoding rules will produce different F1 result (could be negligible or not - a research point). The point of arguing which rules are better is moot. For some cases one rule will be better and for the other dataset - the other may be better. The lesson is that one can not compare two different systems because these systems may use different entity extraction logic. This is especially so in the early training stages where learning trajectory (in F1 terms) may be very different (for example your way of entity extraction is very forgiving on the early training stages - I often see F1 being greater than token accuracy!). So, how to compare in a strict way? I think that the only way of extracting entities that does not depend on arbitrary rules is: use Viterbi decoding with hard transition rules (and no transition weights). This way learned logits will guide the entity resolution decisions, not rules. Thanks for the discussion. |
@jiesutd Here is the comparison: http://blog.innodatalabs.com/7776_ways_to_compute_f1_for_ner_task/ To conclude: there is no problem with your scoring, but it is not fair to compare with others who may have used a different method. |
@mkroutikov Very detailed comparison, impressive! It seems you compared different F1 on the LSTM without CRF model. May I know what's the difference of the F1 computing method on the model with CRF? I believe with the CRF, the difference would be significantly reduced. And in better CRF models (e.g. adding character LSTM/CRF), the difference would be even less. |
You are right, I only tried BidiLSTM (single layer), did not try attaching CRF layer (yet). Will post if I do. I agree that with the more advanced models we will get much less token errors, and therefore less ambiguity in decoding entities. With CRF layer on top I would suggest to always use constrained weights. |
Your code is overly optimistic when comparing two different labelings. This becomes important when CRF decoding is not used and predicted BIOES labels can be invalid.
For example, your function
get_ner_fmeasure
will return F1=1.0, but token accuracy acc=0.5. This does not make any sense to me.Here is the test code that fails, because computed f1 is 1.
The text was updated successfully, but these errors were encountered: