You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you very much for sharing Wapiti. This is really awesome.
Before I describe a potential issue with the German model (my hypothesis) that can be downloaded from your homepage, let me say that I have little experience with NLP or linguistics. Thus, I might be completely wrong.
Having played around with Wapiti (and now also RFTagger), it seems Wapiti consistently swaps the classification of definite and indefinite articles. For example:
Der ART.Indef.Nom.Sg.Masc*
Mann N.Reg.Nom.Sg.Masc
heiratet VFIN.Full.3.Sg.Pres.Ind
die ART.Indef.Nom.Sg.Fem*
Schwester N.Reg.Nom.Sg.Fem
des ART.Indef.Gen.Sg.Masc*
Freundes N.Reg.Gen.Sg.Masc
. SYM.Pun.Sent
Ein ART.Def.Nom.Sg.Masc*
Mann N.Reg.Nom.Sg.Masc
heiratet VFIN.Full.3.Sg.Pres.Ind
eine ART.Def.Nom.Sg.Fem*
Frau N.Reg.Nom.Sg.Fem
eines ART.Def.Gen.Sg.Masc*
Freundes N.Reg.Gen.Sg.Masc
. SYM.Pun.Sent
In German, “der”, “die, and “des” are definite articles while “ein”, “eine”, and “eines” are indefinite.
Looking at the RFTagger homepage, this contains the following sample output of RFTagger (which seems to use the same tagset):
Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Indef.Nom.Sg.Masc
Testsatz N.Reg.Nom.Sg.Masc
. SYM.Pun.Sent
Having Wapiti tag the same sentence (using the German model I downloaded from your website) yields the following output:
Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Def.Nom.Sg.Masc*
Testsatz N.Reg.Nom.Sg.Masc
. SYM.Pun.Sent
While RFTagger correctly classified “ein” as an indefinite article, Wapiti classifies it as a definite article.
Am I right in assuming this is an issue with the German model? What would be the best way to correct this?
The text was updated successfully, but these errors were encountered:
@ThomasBarnekow each and every tagger (e.g. Wapiti, RFTagger, etc.) has a level of quality that depend heavily on the trained model. In any case, if you can tag a large test set and then compare accuracy, this will determine which tagger/model suits you best.
PS I did find that accuracy also depends on the implementation itself really. I trained CRF++, CRFSuite and Wapiti on the same data but end up with different results :)
@almasaud Thanks! I understand that there are differences between the trained models and the implementations. In this case, my assumption is that the training data might be wrong (e.g., because the tags for definite and indefinite articles might be swapped).
I'm just starting to play around with natural language processing. For example, I don't have any training corpora I could use to train Wapiti (or other taggers). Could you point me to something I could use for further testing?
Thank you very much for sharing Wapiti. This is really awesome.
Before I describe a potential issue with the German model (my hypothesis) that can be downloaded from your homepage, let me say that I have little experience with NLP or linguistics. Thus, I might be completely wrong.
Having played around with Wapiti (and now also RFTagger), it seems Wapiti consistently swaps the classification of definite and indefinite articles. For example:
In German, “der”, “die, and “des” are definite articles while “ein”, “eine”, and “eines” are indefinite.
Looking at the RFTagger homepage, this contains the following sample output of RFTagger (which seems to use the same tagset):
Having Wapiti tag the same sentence (using the German model I downloaded from your website) yields the following output:
While RFTagger correctly classified “ein” as an indefinite article, Wapiti classifies it as a definite article.
Am I right in assuming this is an issue with the German model? What would be the best way to correct this?
The text was updated successfully, but these errors were encountered: