# Tagging
## or part-of-speach tagging
POS tagging, or simply tagging, is known as breaking down the text into word classes, or lexical categories, or part-of-speach.

For information on tagging, see http://www.nltk.org/book/ch05.html

For basic example of POS tagging and its place in the NLP workflow, see nltk.ipynb

In [18]:
import nltk
from nltk.data import load

## Example 1: Lookup tagger
Use lookup table of tagged words to tag text first, when that fails, use a default tagger.

We will use a brown corpus from nltk, because it is tagged for pos.

In [5]:
from nltk.corpus import brown

In [6]:
brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [8]:
fd = nltk.FreqDist(brown.words(categories='news'))
fd.most_common(15)

[('the', 5580),
 (',', 5188),
 ('.', 4030),
 ('of', 2849),
 ('and', 2146),
 ('to', 2116),
 ('a', 1993),
 ('in', 1893),
 ('for', 943),
 ('The', 806),
 ('that', 802),
 ('``', 732),
 ('is', 732),
 ('was', 717),
 ("''", 702)]

In [10]:
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
cfd

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'The': FreqDist({'AT': 775, 'AT-HL': 3, 'AT-TL': 28}),
                     'Fulton': FreqDist({'NP': 4, 'NP-TL': 10}),
                     'County': FreqDist({'NN-TL': 35}),
                     'Grand': FreqDist({'FW-JJ-TL': 1, 'JJ-TL': 5}),
                     'Jury': FreqDist({'NN-TL': 2}),
                     'said': FreqDist({'VBD': 382, 'VBN': 20}),
                     'Friday': FreqDist({'NR': 41}),
                     'an': FreqDist({'AT': 300}),
                     'investigation': FreqDist({'NN': 9}),
                     'of': FreqDist({'IN': 2716, 'IN-HL': 5, 'IN-TL': 128}),
                     "Atlanta's": FreqDist({'NP$': 4}),
                     'recent': FreqDist({'JJ': 20}),
                     'primary': FreqDist({'JJ': 4, 'NN': 13}),
                     'election': FreqDist({'NN': 38}),
                     'produced': FreqDist({'VBD': 5, 'VBN': 1}),
                     '``': FreqDist({'`

In [11]:
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
likely_tags

{"''": "''",
 '(': '(',
 ')': ')',
 ',': ',',
 '--': '--',
 '.': '.',
 ':': ':',
 ';': '.',
 '?': '.',
 'A': 'AT',
 'But': 'CC',
 'He': 'PPS',
 'I': 'PPSS',
 'In': 'IN',
 'It': 'PPS',
 'Mr.': 'NP',
 'Mrs.': 'NP',
 'New': 'JJ-TL',
 'President': 'NN-TL',
 'The': 'AT',
 '``': '``',
 'a': 'AT',
 'about': 'IN',
 'after': 'IN',
 'against': 'IN',
 'all': 'ABN',
 'also': 'RB',
 'an': 'AT',
 'and': 'CC',
 'any': 'DTI',
 'are': 'BER',
 'as': 'CS',
 'at': 'IN',
 'be': 'BE',
 'been': 'BEN',
 'before': 'IN',
 'but': 'CC',
 'by': 'IN',
 'can': 'MD',
 'could': 'MD',
 'first': 'OD',
 'for': 'IN',
 'from': 'IN',
 'had': 'HVD',
 'has': 'HVZ',
 'have': 'HV',
 'he': 'PPS',
 'her': 'PP$',
 'him': 'PPO',
 'his': 'PP$',
 'home': 'NN',
 'in': 'IN',
 'into': 'IN',
 'is': 'BEZ',
 'it': 'PPS',
 'its': 'PP$',
 'last': 'AP',
 'made': 'VBN',
 'more': 'AP',
 'new': 'JJ',
 'no': 'AT',
 'not': '*',
 'of': 'IN',
 'on': 'IN',
 'one': 'CD',
 'only': 'AP',
 'or': 'CC',
 'other': 'AP',
 'out': 'RP',
 'over': 'IN',
 'said':

In [13]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags)

brown_tagged_sents = brown.tagged_sents(categories='news')
#brown_sents = brown.sents(categories='news')
baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

In [14]:
backoff_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
backoff_tagger.evaluate(brown_tagged_sents)

0.5817769556656125

#### To use a default pos_tag model as a backoff, first load the model from a pickle file. See below.

In [20]:
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
default_pos_tagger = load(_POS_TAGGER)  

backoff_tagger2 = nltk.UnigramTagger(model=likely_tags, backoff=default_pos_tagger)
backoff_tagger2.evaluate(brown_tagged_sents)

0.7573045328878015

In [None]:
#backoff_tagger2.tag(...)