New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384
Comments
Actually, 動詞-自立 一段 基本形 tag is not unlike the hierarchical POS tags from other languages, you can break them up or collapse them, e.g. see page 22 on the And the two orthographic forms are actually good because it's additional information. Possibly the In the case of JEITA, it's a "morphologically"-tagged corpus, so it's a little different from how we conceive of POS tagged corpus in European languages. Masato Hagiwara has a good blog post on this: http://lilyx.net/nltk-japanese-corpus/ |
I would still hold its an error to include orthographic information as part of a POS tag. |
Agreed that the API should been better. Maybe for word, tag in sent:
print tag.form1 # デル
print tag.form2 # 出る
print tag.coarse_pos # 動詞-自立
print tag.fine_pos # 動詞-自立 一段 基本形 |
That looks much better to me. |
I agree with @brendanpatrickmurphy that it's confusing from an NLTK perspective to mix orthographic information in with the POS tag. It seems like the word form flexibility is similar to some corpora which provide lemmas. How is this handled in other NLTK corpora? The corpus readers I'm familiar with include:
So it might make sense to have arguments indicating the type of POS and/or word-form desired: e.g., |
A possibly better naming scheme would be: for word, tag in sent:
print tag.pronunciation # デル
print tag.lemma # 出る
print tag.pos # 動詞-自立
print tag.paradigm # 一段 (or conjugation type)
print.tag.inflection # 基本形 (or conjugation form) Note that only inflecting words have paradigm/inflection |
There are still problems with how POS Tagging works for these corpora. This afternoon, I loaded up JEITA, and called jeita.tagged_words(). The problem is that the second half of each tuple in JEITA doesn't contain a tag that is easy to test against a POS-tagger. The second half of each tuple contains both orthographic information (each word in the corpus has a spelling for each syllabary in Japanese) and the tag information, so a word tagged as a noun won't have the same tag as another word tagged as a noun. This leads to quite a few problems when testing a tagger against the corpus.
Open up one of the .chasen files and you'll see what I mean. Here's line 6 of a0010.chasen in the jeita.zip file.
出る デル 出る 動詞-自立 一段 基本形
There's four ( or maybe five) elements here. The first three are ways of writing the word /deru/, and the last is the tag (verb, transitive, group 1, plain form.)
So I wrote this loop:
for sent in tagged_sents:
for(word, tag) in sent:
print(word)
print(tag)
And here's some sample output:
出る
デル 出る 動詞-自立 一段 基本形
As you can see, the tag includes two forms of orthography, which throws things off.
(Also, as a side note, it would be really great if we could have a "simple" pos tag version of these files, which didn't include some of the additional categories like "plain form" or which group (ichidan/godan) the verb belonged too, since I don't think a lot of parsers care too much about which is which, but doing this would probably take help from a Japanese fluent individual.)
I can check again with KNBC, the other Japanese corpus included in NLTK, but it does even funkier things with tags last I checked.
The text was updated successfully, but these errors were encountered: