nltk.pos_tag performance #1110

stevenbird · 2015-09-05T23:10:02Z

NLTK's built-in POS tagger doesn't perform so well:

>>> text = word_tokenize("The quick brown fox jumps over the lazy dog")
>>> pos_tag(text)
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

We need a better built-in model.

cf. http://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag

The text was updated successfully, but these errors were encountered:

alvations · 2015-09-06T20:38:35Z

@stevenbird , I think the simplest to "reimplement" with top quality accuracy is actually honnibal's perception tagger.

Next, the simplest to implement with unknown accuracy is ngram taggers with backoff ngram taggers using existing tagger corpora in NLTK or redistributable model that one can build with LDC's corpora (something like Spaghetti tagger.

There's one hand-crafted tagger with unknown accuracy too in nltk.parse.malt. Other rule-based tagger to consider is from CLIPS' pattern.en. The major caveats of pattern's tagger is the dependence on lexicon: https://raw.githubusercontent.com/clips/pattern/master/pattern/text/en/en-lexicon.txt

My last suggestion is a pre-trained Brill tagger, pattern has a list of rules: https://github.com/clips/pattern/blob/master/pattern/text/en/en-context.txt

stevenbird · 2015-09-07T03:18:47Z

@alvations: thanks for those suggestions. It looks like Honnibal's tagger uses an MIT license. @syllog1sm is there any objection to including this tagger in NLTK?

honnibal · 2015-09-07T07:30:16Z

Sure, go ahead. You might want my transition-based parser too, which is also under MIT I think.

stevenbird · 2015-09-07T18:02:32Z

@honnibal – great, thanks very much

stevenbird · 2015-09-15T01:11:16Z

Another example of bad behaviour: http://stackoverflow.com/questions/32571486/nltk-red-not-recognised-as-an-adjective

stevenbird · 2015-10-03T06:34:06Z

Resolved by #1122

stevenbird added this to the 3.1 milestone Sep 5, 2015

stevenbird mentioned this issue Sep 8, 2015

Perceptron tagger #1122

Closed

stevenbird closed this as completed Oct 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nltk.pos_tag performance #1110

nltk.pos_tag performance #1110

stevenbird commented Sep 5, 2015

alvations commented Sep 6, 2015

stevenbird commented Sep 7, 2015

honnibal commented Sep 7, 2015

stevenbird commented Sep 7, 2015

stevenbird commented Sep 15, 2015

stevenbird commented Oct 3, 2015

nltk.pos_tag performance #1110

nltk.pos_tag performance #1110

Comments

stevenbird commented Sep 5, 2015

alvations commented Sep 6, 2015

stevenbird commented Sep 7, 2015

honnibal commented Sep 7, 2015

stevenbird commented Sep 7, 2015

stevenbird commented Sep 15, 2015

stevenbird commented Oct 3, 2015