Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

Open
brendanpatrickmurphy opened this issue Apr 29, 2016 · 6 comments

Comments

@brendanpatrickmurphy
Copy link

brendanpatrickmurphy commented Apr 29, 2016

There are still problems with how POS Tagging works for these corpora. This afternoon, I loaded up JEITA, and called jeita.tagged_words(). The problem is that the second half of each tuple in JEITA doesn't contain a tag that is easy to test against a POS-tagger. The second half of each tuple contains both orthographic information (each word in the corpus has a spelling for each syllabary in Japanese) and the tag information, so a word tagged as a noun won't have the same tag as another word tagged as a noun. This leads to quite a few problems when testing a tagger against the corpus.

Open up one of the .chasen files and you'll see what I mean. Here's line 6 of a0010.chasen in the jeita.zip file.

出る デル 出る 動詞-自立 一段 基本形

There's four ( or maybe five) elements here. The first three are ways of writing the word /deru/, and the last is the tag (verb, transitive, group 1, plain form.)

So I wrote this loop:

for sent in tagged_sents:
for(word, tag) in sent:
print(word)
print(tag)

And here's some sample output:

出る
デル 出る 動詞-自立 一段 基本形

As you can see, the tag includes two forms of orthography, which throws things off.

(Also, as a side note, it would be really great if we could have a "simple" pos tag version of these files, which didn't include some of the additional categories like "plain form" or which group (ichidan/godan) the verb belonged too, since I don't think a lot of parsers care too much about which is which, but doing this would probably take help from a Japanese fluent individual.)

I can check again with KNBC, the other Japanese corpus included in NLTK, but it does even funkier things with tags last I checked.

@alvations
Copy link
Contributor

alvations commented Apr 29, 2016

Actually, 動詞-自立 一段 基本形 tag is not unlike the hierarchical POS tags from other languages, you can break them up or collapse them, e.g. see page 22 on the NTU-MC, and also this: https://www.sketchengine.co.uk/xdocumentation/wiki/tagsets/jpwac

And the two orthographic forms are actually good because it's additional information. Possibly the NLTK API needs some more work to split them up but the information in the corpus is good and we shouldn't remove them.

In the case of JEITA, it's a "morphologically"-tagged corpus, so it's a little different from how we conceive of POS tagged corpus in European languages. Masato Hagiwara has a good blog post on this: http://lilyx.net/nltk-japanese-corpus/

@brendanpatrickmurphy
Copy link
Author

I would still hold its an error to include orthographic information as part of a POS tag.

@alvations
Copy link
Contributor

alvations commented Apr 29, 2016

Agreed that the API should been better. Maybe namedtuple on the tag would be a better interface:

for word, tag in sent:
    print tag.form1 # デル
    print tag.form2 # 出る
    print tag.coarse_pos # 動詞-自立
    print tag.fine_pos # 動詞-自立 一段 基本形

@brendanpatrickmurphy
Copy link
Author

That looks much better to me.

@nschneid
Copy link
Contributor

I agree with @brendanpatrickmurphy that it's confusing from an NLTK perspective to mix orthographic information in with the POS tag.

It seems like the word form flexibility is similar to some corpora which provide lemmas. How is this handled in other NLTK corpora? The corpus readers I'm familiar with include:

  • SemCor: semcor.tagged_chunks(tag='pos') vs. semcor.tagged_chunks(tag='sense'). The latter returns a WordNet Lemma object as the tag.
  • CHILDESCorpusReader: words(), tagged_words(), etc. have a boolean stem argument.

So it might make sense to have arguments indicating the type of POS and/or word-form desired: e.g., tagged_words(form='form1', pos='fine').

@fcbond
Copy link
Contributor

fcbond commented Oct 4, 2017

A possibly better naming scheme would be:

for word, tag in sent:
    print tag.pronunciation # デル
    print tag.lemma # 出る
    print tag.pos # 動詞-自立
    print tag.paradigm # 一段  (or conjugation type)
    print.tag.inflection # 基本形 (or conjugation form)

Note that only inflecting words have paradigm/inflection
https://osdn.net/projects/chasen-legacy/docs/chasen-2.4.0-manual-en.pdf/en/1/chasen-2.4.0-manual-en.pdf.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants