Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

brendanpatrickmurphy · 2016-04-29T20:41:46Z

There are still problems with how POS Tagging works for these corpora. This afternoon, I loaded up JEITA, and called jeita.tagged_words(). The problem is that the second half of each tuple in JEITA doesn't contain a tag that is easy to test against a POS-tagger. The second half of each tuple contains both orthographic information (each word in the corpus has a spelling for each syllabary in Japanese) and the tag information, so a word tagged as a noun won't have the same tag as another word tagged as a noun. This leads to quite a few problems when testing a tagger against the corpus.

Open up one of the .chasen files and you'll see what I mean. Here's line 6 of a0010.chasen in the jeita.zip file.

出るデル出る動詞-自立一段基本形

There's four ( or maybe five) elements here. The first three are ways of writing the word /deru/, and the last is the tag (verb, transitive, group 1, plain form.)

So I wrote this loop:

for sent in tagged_sents:
for(word, tag) in sent:
print(word)
print(tag)

And here's some sample output:

出る
デル出る動詞-自立一段基本形

As you can see, the tag includes two forms of orthography, which throws things off.

(Also, as a side note, it would be really great if we could have a "simple" pos tag version of these files, which didn't include some of the additional categories like "plain form" or which group (ichidan/godan) the verb belonged too, since I don't think a lot of parsers care too much about which is which, but doing this would probably take help from a Japanese fluent individual.)

I can check again with KNBC, the other Japanese corpus included in NLTK, but it does even funkier things with tags last I checked.

alvations · 2016-04-29T21:40:02Z

Actually, 動詞-自立一段基本形 tag is not unlike the hierarchical POS tags from other languages, you can break them up or collapse them, e.g. see page 22 on the NTU-MC, and also this: https://www.sketchengine.co.uk/xdocumentation/wiki/tagsets/jpwac

And the two orthographic forms are actually good because it's additional information. Possibly the NLTK API needs some more work to split them up but the information in the corpus is good and we shouldn't remove them.

In the case of JEITA, it's a "morphologically"-tagged corpus, so it's a little different from how we conceive of POS tagged corpus in European languages. Masato Hagiwara has a good blog post on this: http://lilyx.net/nltk-japanese-corpus/

brendanpatrickmurphy · 2016-04-29T21:48:36Z

I would still hold its an error to include orthographic information as part of a POS tag.

alvations · 2016-04-29T21:55:17Z

Agreed that the API should been better. Maybe namedtuple on the tag would be a better interface:

for word, tag in sent:
    print tag.form1 # デル
    print tag.form2 # 出る
    print tag.coarse_pos # 動詞-自立
    print tag.fine_pos # 動詞-自立 一段 基本形

brendanpatrickmurphy · 2016-04-29T22:30:56Z

That looks much better to me.

nschneid · 2016-04-29T22:37:45Z

I agree with @brendanpatrickmurphy that it's confusing from an NLTK perspective to mix orthographic information in with the POS tag.

It seems like the word form flexibility is similar to some corpora which provide lemmas. How is this handled in other NLTK corpora? The corpus readers I'm familiar with include:

SemCor: semcor.tagged_chunks(tag='pos') vs. semcor.tagged_chunks(tag='sense'). The latter returns a WordNet Lemma object as the tag.
CHILDESCorpusReader: words(), tagged_words(), etc. have a boolean stem argument.

So it might make sense to have arguments indicating the type of POS and/or word-form desired: e.g., tagged_words(form='form1', pos='fine').

fcbond · 2017-10-04T09:40:52Z

A possibly better naming scheme would be:

for word, tag in sent:
    print tag.pronunciation # デル
    print tag.lemma # 出る
    print tag.pos # 動詞-自立
    print tag.paradigm # 一段  (or conjugation type)
    print.tag.inflection # 基本形 (or conjugation form)

Note that only inflecting words have paradigm/inflection
https://osdn.net/projects/chasen-legacy/docs/chasen-2.4.0-manual-en.pdf/en/1/chasen-2.4.0-manual-en.pdf.pdf

brendanpatrickmurphy mentioned this issue Apr 29, 2016

Adding Japanese tagged & parsed corpora #123

Closed

nschneid added the corpus label Apr 29, 2016

alvations added enhancement good first issue nice idea labels Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

brendanpatrickmurphy commented Apr 29, 2016 •

edited

alvations commented Apr 29, 2016 •

edited

brendanpatrickmurphy commented Apr 29, 2016

alvations commented Apr 29, 2016 •

edited

brendanpatrickmurphy commented Apr 29, 2016

nschneid commented Apr 29, 2016

fcbond commented Oct 4, 2017 •

edited by alvations

Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

Japanese Corpus readers do not return properly formatted (word, tag) tuples #1384

Comments

brendanpatrickmurphy commented Apr 29, 2016 • edited

alvations commented Apr 29, 2016 • edited

brendanpatrickmurphy commented Apr 29, 2016

alvations commented Apr 29, 2016 • edited

brendanpatrickmurphy commented Apr 29, 2016

nschneid commented Apr 29, 2016

fcbond commented Oct 4, 2017 • edited by alvations

brendanpatrickmurphy commented Apr 29, 2016 •

edited

alvations commented Apr 29, 2016 •

edited

alvations commented Apr 29, 2016 •

edited

fcbond commented Oct 4, 2017 •

edited by alvations