# Chapter 4: Part-of-Speech Tagging

## Default tagging

### Let's use the DefaultTagger! Write a code to tag the simple phrase "Hello World".

In [1]:
s = "Hello World"
words = s.split()

from nltk.tag import DefaultTagger
tagger = DefaultTagger('NN')
print(tagger.tag(words))

[('Hello', 'NN'), ('World', 'NN')]


## Training a unigram or n-gram tagger

### Write a code to train a unigram tagger using the first (already-tagged) 3000 sentences from the 'treebank' corpus. 

In [2]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]

## Here you can also use 'cutoff' as a second argument to set a
## minimum frequency threshold.
tagger = UnigramTagger(train_sents)

## The untagged words
print(treebank.sents()[0])

## The tagged words, using a tagger after being trained.
print(tagger.tag(treebank.sents()[0]))

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


### Adapt the above code using the BigramTagger and TrigramTagger in the backoff chain to gain some accuracy.

In [5]:
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger, DefaultTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]

from tag_util import backoff_tagger
backoff = DefaultTagger('NN')
tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=backoff)
tagger.evaluate(test_sents)

0.8812432549104252

## Creating a model of likely word tags

### Using the 200 most frequent words in 'treebank' as keys, write a code that will use the most frequent tag for each word to create a model.

In [6]:
## For reference, here's the code for word_tag_mode1... take a look!
## Note the "limit=200" default argument in the function.
## 
## from nltk.probability import FreqDist, ConditionalFreqDist
## def word_tag_model(words, tagged_words, limit = 200):
##     fd = FreqDist(words)
##     cfd = ConditionalFreqDist(tagged_words)
##     most_freq = (word for word, count in fd.most_common(limit))
##     return dict((word, cfd[word].max()) for word in most_freq)

## Here's the actual code when using UnigramTagger in your backoff chain.

from tag_util import word_tag_model
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]

model = word_tag_model(treebank.words(), treebank.tagged_words())
default_tagger = DefaultTagger('NN')
likely_tagger = UnigramTagger(model = model, backoff=default_tagger)
tagger =  backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=likely_tagger)

tagger.evaluate(test_sents)

0.8812432549104252

## Tagging with regular expressions, affix tagging

### After reading up on the RegexpTagger, write a code using the AffixTagger that trains it on three-character prefixes.

### How do you think the AffixTagger works relative to the RegexpTagger?

In [12]:
from nltk.tag import AffixTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]

prefix_tagger = AffixTagger(train_sents, affix_length=3)
prefix_tagger.evaluate(test_sents)

0.236088927260954

## Training a Brill tagger

### Train the Brill tagger using a backoff chain using NgramTagger classes (pass them into the train_brill_tagger()).

In [3]:
from nltk.tag import DefaultTagger
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger

from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:250]
test_sents = treebank.tagged_sents()[250:1000]
## Here we used fewer words than in the book. 
# Otherwise the training/proccessing takes a very long time.

from tag_util import backoff_tagger

default_tagger = DefaultTagger('NN')
initial_tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], 
                                backoff=default_tagger)
## print(initial_tagger.evaluate(test_sents))

from tag_util import train_brill_tagger

brill_tagger = train_brill_tagger(initial_tagger, train_sents)
brill_tagger.evaluate(test_sents)

0.8834016835743579

## Training the TnT tagger

### Write a code to train the TnT tagger on the training sentences.

In [2]:
## Old imports
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:250]
test_sents = treebank.tagged_sents()[250:1000]
## Here we used fewer words than in the book. 
# Otherwise the training/proccessing takes a very long time.

from nltk.tag import tnt

tnt_tagger = tnt.TnT()
tnt_tagger.train(train_sents)
tnt_tagger.evaluate(test_sents)

0.8757176775307576


## Using WordNet for tagging

### Write a code that uses Wordnet tagger as part of a backoff chain and check its accuracy.

In [6]:
from nltk.corpus import treebank
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
train_sents = treebank.tagged_sents()[:250]
test_sents = treebank.tagged_sents()[250:1000]
## Here we used fewer words than in the book. 
# Otherwise the training/proccessing takes a very long time.

# Save taggers.py to your folder where you're doing work.
from taggers import WordNetTagger
wn_tagger = WordNetTagger()

tagger = backoff_tagger(train_sents, [UnigramTagger,BigramTagger,TrigramTagger], backoff=wn_tagger)
tagger.evaluate(test_sents)

0.7569835369091875

## Tagging proper names

### Write a code that uses the NameTagger from taggers.py. Think about where you would use this in a backoff chain.

In [7]:
from taggers import NamesTagger
nt = NamesTagger()
nt.tag(['Jacob'])

[('Jacob', 'NNP')]

## Classifer-based tagging

### Write a code to train the classifier based tagger on the training sentences.

In [6]:
## Old imports...
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]

from nltk.tag.sequential import ClassifierBasedPOSTagger
tagger = ClassifierBasedPOSTagger(train=train_sents)
tagger.evaluate(test_sents)

0.9309734513274336

## Training a tagger with NLTK-Trainer

### Read about command-line NLTK-Trainer (https://github.com/japerk/nltk-trainer) and get some practice in.

### Train a classifier based tagger with the treebank corpus.