# Creating a Part-of-Speech Tagger
In this notebook I'll train a classifier to determine which suffixes are most informative for POS tagging. 

### What is POS tagging?
Part-of-Speech tagging (or POS for short) is labelling each word with their appropriate Part-of-Speech such Noun, Verb, Adjective, Adverb, Pronoun, etc. These word classes (also known as lexical categories) are useful categories for many language processing tasks. 

### Applications: Text-to-Speech
You might be wondering why POS tagging is needed. Let's discuss one example where POS tagging is applied. The word refuse can either be a verb or a noun. E.g. refUSE is a verb meaning _deny_, while REFuse is a noun meaning _trash_. They are not homophones so they have different pronunciations. Thus we need to know which word is being used in order to pronounce the text correctly. This is why text-to-speech applications perform POS-tagging. 

## Train the POS tagger using the [Brown](https://www.nltk.org/book/ch02.html#tab-brown-sources) corpus

### Why use the Brown Corpus? 
Because the Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. By the late 70s the tagging was nearly perfect. [[Source](https://en.wikipedia.org/wiki/Part-of-speech_tagging#The_Brown_Corpus)]

* It was the first of the modern, computer readable general corpora.
* For a long time, Brown and LOB (British) corpora were the only easily available online, so many studies have been done on these corpora.
* Studying the same data allows comparison of findings without having to take into consideration possible variation caused by the use of different data. 
* It consists of about 1 million words of American English text (printed in 1961), made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).

### Let's briefly explore the Brown corpus

In [2]:
from nltk.corpus import brown
from nltk import FreqDist

brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [3]:
brown.sents(categories=['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [4]:
brown.words()[:20]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that']

In [5]:
len(brown.words())

1161192

In [6]:
brown.readme().replace('\n', ' ')

'BROWN CORPUS  A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.  by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA  Revised 1971, Revised and Amplified 1979  http://www.hit.uib.no/icame/brown/bcm.html  Distributed with the permission of the copyright holder, redistribution permitted. '

In [7]:
brown.tagged_words()[:20] # List of tuples which conatain the word and its POS tag

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS')]

## Let's find out what the most common suffixes are
We can train a classifier to work out which suffixes are most informative for POS tagging.

Before starting training a classifier, we must agree first on what features to use. Let's use the **2-letter suffix** and the **3-letter suffix**. The 2-letter suffix is a great indicator of past-tense verbs, ending in “-ed”. And the 3-letter suffix helps recognize the present participle ending in “-ing”. 

I'd like to note that we can do better by also looking at the word itself, the word before and the word after. However, for the scope of this project we'll move forward with just the suffixes. 


### Create a Frequency Distribution for suffixes

In [8]:
from nltk import FreqDist

suffix_fdist = FreqDist()

# Need a refresher on python array slice notation? 
# Visit https://stackoverflow.com/questions/509211/understanding-slice-notation
for word in brown.words():
    suffix_fdist[word[-1:]] += 1 # Keep a count of suffixes containing only one letter
    suffix_fdist[word[-2:]] += 1 # Keep a count of suffixes containing two letters
    suffix_fdist[word[-3:]] += 1 # Keep a count of suffixes containing three letters
    
suffix_fdist

FreqDist({'e': 202808, ',': 175002, '.': 152999, 's': 128590, 'd': 105493, 't': 94237, 'n': 87776, 'he': 86119, 'of': 72314, 'a': 70852, ...})

### Let's preview some of the most common suffixes

In [9]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
common_suffixes

['e',
 ',',
 '.',
 's',
 'd',
 't',
 'n',
 'he',
 'of',
 'a',
 'the',
 'y',
 'r',
 'to',
 'in',
 'f',
 'o',
 'ed',
 'nd',
 'is',
 'on',
 'l',
 'g',
 'and',
 'ng',
 'er',
 'ing',
 'as',
 'h',
 'at',
 'es',
 'or',
 're',
 '``',
 "''",
 'an',
 'm',
 ';',
 'ly',
 'I',
 'it',
 'ion',
 'en',
 'al',
 '?',
 'nt',
 'be',
 'hat',
 'st',
 'th',
 'his',
 'll',
 'le',
 'ce',
 'ts',
 've',
 'me',
 'by',
 "'",
 'se',
 'ut',
 'was',
 'ent',
 'ch',
 'k',
 'w',
 'ld',
 'for',
 '`',
 'rs',
 'ted',
 'ere',
 'ne',
 'her',
 'ns',
 'ith',
 'ad',
 'ry',
 ')',
 '(',
 'The',
 'te',
 '--',
 'ay',
 'ty',
 'ot',
 'p',
 'nce',
 'He',
 "'s",
 'ter',
 'om',
 'ss',
 ':',
 'are',
 'ers',
 'uld',
 'had',
 'ey',
 'ow']

# Feature Extraction
### Define a feature extractor function that checks a given word for these suffixes

Feature extraction functions highlight some of the properties in our data however they also make it impossible to see other properties. The classifier will rely exclusively on these highlighted properties when determining how to label inputs. In this case, the classifier will make its decisions based only on information about which of the common suffixes (if any) a given word has.

In [10]:
def get_features(word):
    '''Extract features of given word'''
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix) # Returns a True/False if string ends with the specified suffix.
    return features

**Test it out**

In [11]:
get_features('test')

{'endswith(e)': False,
 'endswith(,)': False,
 'endswith(.)': False,
 'endswith(s)': False,
 'endswith(d)': False,
 'endswith(t)': True,
 'endswith(n)': False,
 'endswith(he)': False,
 'endswith(of)': False,
 'endswith(a)': False,
 'endswith(the)': False,
 'endswith(y)': False,
 'endswith(r)': False,
 'endswith(to)': False,
 'endswith(in)': False,
 'endswith(f)': False,
 'endswith(o)': False,
 'endswith(ed)': False,
 'endswith(nd)': False,
 'endswith(is)': False,
 'endswith(on)': False,
 'endswith(l)': False,
 'endswith(g)': False,
 'endswith(and)': False,
 'endswith(ng)': False,
 'endswith(er)': False,
 'endswith(ing)': False,
 'endswith(as)': False,
 'endswith(h)': False,
 'endswith(at)': False,
 'endswith(es)': False,
 'endswith(or)': False,
 'endswith(re)': False,
 'endswith(``)': False,
 "endswith('')": False,
 'endswith(an)': False,
 'endswith(m)': False,
 'endswith(;)': False,
 'endswith(ly)': False,
 'endswith(I)': False,
 'endswith(it)': False,
 'endswith(ion)': False,
 'endsw

# Train a Decision Tree Classifier
Here's a [cool video](https://www.youtube.com/watch?v=LDRbO9a6XPU) that talks about Decision Tree Classifiers.

In [13]:
tagged_words = brown.tagged_words(categories='news')
tagged_words

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [16]:
featuresets = [(get_features(n), g) for (n,g) in tagged_words]
featuresets[0] # Preview the first element

({'endswith(e)': True,
  'endswith(,)': False,
  'endswith(.)': False,
  'endswith(s)': False,
  'endswith(d)': False,
  'endswith(t)': False,
  'endswith(n)': False,
  'endswith(he)': True,
  'endswith(of)': False,
  'endswith(a)': False,
  'endswith(the)': True,
  'endswith(y)': False,
  'endswith(r)': False,
  'endswith(to)': False,
  'endswith(in)': False,
  'endswith(f)': False,
  'endswith(o)': False,
  'endswith(ed)': False,
  'endswith(nd)': False,
  'endswith(is)': False,
  'endswith(on)': False,
  'endswith(l)': False,
  'endswith(g)': False,
  'endswith(and)': False,
  'endswith(ng)': False,
  'endswith(er)': False,
  'endswith(ing)': False,
  'endswith(as)': False,
  'endswith(h)': False,
  'endswith(at)': False,
  'endswith(es)': False,
  'endswith(or)': False,
  'endswith(re)': False,
  'endswith(``)': False,
  "endswith('')": False,
  'endswith(an)': False,
  'endswith(m)': False,
  'endswith(;)': False,
  'endswith(ly)': False,
  'endswith(I)': False,
  'endswith(it)': 

**Split the train/test set**

In [26]:
cutoff = int(len(featuresets) * 0.2) 
train_set, test_set = featuresets[cutoff:], featuresets[:cutoff] # train on 80%, test on 20%

**Train the Classifier (took about 15 minutes to complete)**

NLTK is a teaching toolkit which is not really optimized for speed. Therefore, this may take a while. For speed, use [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the classifiers.

In [29]:
from nltk import DecisionTreeClassifier # https://www.nltk.org/book/ch06.html#sec-decision-trees
from nltk.classify import accuracy

classifier = DecisionTreeClassifier.train(train_set)
accuracy(classifier, test_set)

0.624763799104923

Accuracy = 62%. That's not great. To improve the classifier, if we worked with tagged sentences instead of tagged words we can add more contextual features as I mentioned before, like the word itself, the word before and the word after. As well as the previous tag! For the scope of this project, we'll stop right here.

**Now we can use our classifier:**

In [31]:
classifier.classify(get_features('cats'))

'NNS'

It correctly predicted that the Part-of-Speech for "cats" is 'NNS' which is a plural noun.

## Pseudocode
NLTK can print out the decision tree's steps as pseudocode so that it's fairly easy to interpret.

In [32]:
print(classifier.pseudocode(depth=4)) # depth=4 argument just displays the top portion of the decision tree.

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(was) == False: return 'PP$'
      if endswith(was) == True: return 'BEDZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



We can see that the classifier begins by checking whether a word end with "the". If so, it is tagged "AT". If it does not end eith "the" the classifier checks if the word does ends with a comma. If it does it will receive the special "," tag. If it does not end with a comma the classifier continues on to check if the word doesn't end in "s". If not, then either way it's most likely a punctuation mark ".".  If it does end with "s" it will check if the word is "was". If the word is not "was" then it will receive the Posessive Pronoun tag "PRP$". If the end word is "was" it will receive the special tag "BEDZ".