# Creating a PoS Tagger

Several of the corpora included with NLTK have been **tagged** for their parts-of-speech. These are represented as tagged tokens—tuples consisting of each token (word) and its corresponding tag (part of speech). In this notebook, we will be using `brown`, which is a tagged corpus.

We can train a classifier that uses word endings to classify words into their parts of speech. First, we need to work out which suffixes are most informative for PoS tagging. We can begin by simply finding out what the most common suffixes are.

In [1]:
from nltk.corpus import brown
from nltk import FreqDist

suffix_f_dist = FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_f_dist[word[-1:]] += 1
    suffix_f_dist[word[-2:]] += 1
    suffix_f_dist[word[-3:]] += 1

suffix_f_dist

FreqDist({'e': 202946, ',': 175002, '.': 152999, 's': 128722, 'd': 105687, 't': 94459, 'he': 92084, 'n': 87889, 'a': 74912, 'of': 72978, ...})

In [2]:
common_suffixes = [suffix for (suffix, _) in suffix_f_dist.most_common(100)]
common_suffixes[:10]

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of']

Next, we’ll define a feature extractor function which checks a given word for these suffixes.

In [3]:
def pos_suffix_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features


test_features = pos_suffix_features('test')

print('POS SUFFIX FEATURES FOR THE WORD "TEST"')
print('endswith(e):', test_features['endswith(e)'])
print('endswith(t):', test_features['endswith(t)'])

POS SUFFIX FEATURES FOR THE WORD "TEST"
endswith(e): False
endswith(t): True


Now that we’ve defined our feature extractor, we can use it to train a decision tree classifier.

First, we use the `tagged_words()` method to read the `brown` corpus as a list of tagged tokens.

In [4]:
tagged_words = brown.tagged_words(categories='news')
tagged_words[:10]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN')]

We then extract all suffix features for each word in the list of tagged tokens.

In [5]:
feature_sets = [(pos_suffix_features(n), g) for (n, g) in tagged_words]

print('POS SUFFIX FEATURES FOR THE WORD "THE"')
print('endswith(e):', feature_sets[0][0]['endswith(e)'])
print('endswith(t):', feature_sets[0][0]['endswith(t)'])

POS SUFFIX FEATURES FOR THE WORD "THE"
endswith(e): True
endswith(t): False


In [6]:
from nltk import DecisionTreeClassifier
from nltk.classify import accuracy

cutoff = int(len(feature_sets) * 0.1)
train_set, test_set = feature_sets[cutoff:], feature_sets[:cutoff]

NLTK is a learning toolkit not optimized for speed. The next line of code, which trains the classifier, can take up to 10 minutes to execute, depending on the processor. For a practical classifier, use a library like `scikit-learn` instead of the classifiers that come with NLTK.

In [7]:
classifier = DecisionTreeClassifier.train(train_set)

In [8]:
round(accuracy(classifier, test_set), 2)

0.63

In [9]:
classifier.classify(pos_suffix_features('goblins'))

'NNS'

A nice feature of decision tree models is that they are often easy to interpret. We can instruct NLTK to print them out as pseudocode:

In [10]:
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(is) == False: return 'PP$'
      if endswith(is) == True: return 'BEZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



To improve the classifier, we can add contextual features:

```py
def pos_features(sentence, i): [1]
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features
```

Then, instead of working with tagged words, we work with tagged sentences:

```py
tagged_sents = brown.tagged_sents(categories='news')
```

We can then improve this further by adding more features such as `prev-tag`, etc.