Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect classification of definite and indefinite articles in German #16

Closed
ThomasBarnekow opened this issue Apr 17, 2016 · 3 comments

Comments

@ThomasBarnekow
Copy link

Thank you very much for sharing Wapiti. This is really awesome.

Before I describe a potential issue with the German model (my hypothesis) that can be downloaded from your homepage, let me say that I have little experience with NLP or linguistics. Thus, I might be completely wrong.

Having played around with Wapiti (and now also RFTagger), it seems Wapiti consistently swaps the classification of definite and indefinite articles. For example:

Der ART.Indef.Nom.Sg.Masc*
Mann    N.Reg.Nom.Sg.Masc
heiratet    VFIN.Full.3.Sg.Pres.Ind
die ART.Indef.Nom.Sg.Fem*
Schwester   N.Reg.Nom.Sg.Fem
des ART.Indef.Gen.Sg.Masc*
Freundes    N.Reg.Gen.Sg.Masc
.   SYM.Pun.Sent

Ein ART.Def.Nom.Sg.Masc*
Mann    N.Reg.Nom.Sg.Masc
heiratet    VFIN.Full.3.Sg.Pres.Ind
eine    ART.Def.Nom.Sg.Fem*
Frau    N.Reg.Nom.Sg.Fem
eines   ART.Def.Gen.Sg.Masc*
Freundes    N.Reg.Gen.Sg.Masc
.   SYM.Pun.Sent

In German, “der”, “die, and “des” are definite articles while “ein”, “eine”, and “eines” are indefinite.

Looking at the RFTagger homepage, this contains the following sample output of RFTagger (which seems to use the same tagset):

Das PRO.Dem.Subst.-3.Nom.Sg.Neut 
ist VFIN.Sein.3.Sg.Pres.Ind 
ein ART.Indef.Nom.Sg.Masc 
Testsatz    N.Reg.Nom.Sg.Masc 
.   SYM.Pun.Sent 

Having Wapiti tag the same sentence (using the German model I downloaded from your website) yields the following output:

Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Def.Nom.Sg.Masc*
Testsatz    N.Reg.Nom.Sg.Masc
.   SYM.Pun.Sent

While RFTagger correctly classified “ein” as an indefinite article, Wapiti classifies it as a definite article.

Am I right in assuming this is an issue with the German model? What would be the best way to correct this?

@ansdma
Copy link

ansdma commented Apr 17, 2016

@ThomasBarnekow each and every tagger (e.g. Wapiti, RFTagger, etc.) has a level of quality that depend heavily on the trained model. In any case, if you can tag a large test set and then compare accuracy, this will determine which tagger/model suits you best.

PS I did find that accuracy also depends on the implementation itself really. I trained CRF++, CRFSuite and Wapiti on the same data but end up with different results :)

@ThomasBarnekow
Copy link
Author

@almasaud Thanks! I understand that there are differences between the trained models and the implementations. In this case, my assumption is that the training data might be wrong (e.g., because the tags for definite and indefinite articles might be swapped).

I'm just starting to play around with natural language processing. For example, I don't have any training corpora I could use to train Wapiti (or other taggers). Could you point me to something I could use for further testing?

@ansdma
Copy link

ansdma commented Apr 17, 2016

@ThomasBarnekow That could be the case. However, there is still a chance that the model is just confused in this particular sentence.

Anyhow, just googled and found this dataset http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html never tried it before but it looks good candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants