# NLP Learning Series Section 2 - POS Tagging

# Table of Contents
0. Notebook Setup

1. What is Parts-of-Speech (POS) tagging?
    - 1.1 Basic Introduction and use cases
    - 1.2 Experimenting with NLTK's built in POS tagger
    - 1.3 Building a very simple probabilistic POS tagger with NLTK
    

2. Feature Extraction for POS tagging
    - 2.1 shape funciton
    - 2.2 POS_features function
    

3. Build a custom POS tagger
    - Tagging and evaluating
    - Putting it all together
    - Training  and tuning the model
    

4. Build a POS tagger using NLTK


5. Basic sentiment anlaysis (progressively add more complex features)
    - Tweet sentiment analysis


# 0\. Notebook Setup

**Import Packages**

In [None]:
# Import packages
import nltk

# Download nltk packages
nltk.download('book')

**Load Data**

In this exercise, we will use the 'English Universal Dependencies' dataset for trianing and testing:
- Train data:  12544 sentences, runtime=~60s 
- Test data: 2078 sentences, runtime=~10s 

The data is loaded as a list of lists, where each individual list item represents a sentence, and each sentence is broken down into a set of token/tag tuples as follows:

```[('What', 'WP'),
 ('if', 'IN'),
 ('Google', 'NNP'),
 ('Morphed', 'VBD'),
 ('Into', 'IN'),
 ('GoogleOS', 'NNP'),
 ('?', '.')]```

In [None]:
# Run bash commands to get data from course site
!curl -o train_raw.conllu http://vsarangian.com/NLPLearningSeries/datasets/en_ewt-ud-train.conllu
!curl -o test_raw.conllu http://vsarangian.com/NLPLearningSeries/datasets/en_ewt-ud-test.conllu

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.6M  100 12.6M    0     0  9715k      0  0:00:01  0:00:01 --:--:-- 9708k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1622k  100 1622k    0     0  1756k      0 --:--:-- --:--:-- --:--:-- 1754k


In [None]:
import os
from nltk import conlltags2tree
from time import time
def read_pos_data(filename):
    """
    Iterate through the Universal Dependencies Corpus Part-Of-Speech data
    Yield sentences one by one, don't load all the data in memory
    """
    t0 = time()
    current_sentence = []
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            # ignore comments
            if line.startswith('#'):
                continue
            # empty line indicates end of sentence
            if not line:
                yield current_sentence
                # Reset current_sentence each time so yield works!
                current_sentence = []
                continue
            annotations = line.split('\t')
            # Get only the word and the part of speech
            current_sentence.append((annotations[1], annotations[4]))
    print("Runtime:", time()-t0)

train_data = read_pos_data('train_raw.conllu')
train_list = list(train_data) # ignore last empty value

test_data = read_pos_data('test_raw.conllu')
test_list = list(test_data) # ignore last empty value

print("Train data size:", len(train_list))
print("Test data size:", len(test_list))

Runtime: 0.3187830448150635
Runtime: 0.03853797912597656
Train data size: 12543
Test data size: 2077


In [None]:
# Take a sneak peak at the data
train_list[1]

[('[', '-LRB-'),
 ('This', 'DT'),
 ('killing', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('respected', 'JJ'),
 ('cleric', 'NN'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('causing', 'VBG'),
 ('us', 'PRP'),
 ('trouble', 'NN'),
 ('for', 'IN'),
 ('years', 'NNS'),
 ('to', 'TO'),
 ('come', 'VB'),
 ('.', '.'),
 (']', '-RRB-')]

# 1\. What is Parts-of-Speech (POS) Tagging?

## 1.1 POS Introduction


Part of Speech Tagging (or POS Tagging, for short) is probably the most popular challenge in the history of NLP. The process basically involves **assigning a grammatical label to every word in a sequence (usually a sentence).** When I say grammatical label, I mean: *Noun, Verb, Preposition, Pronoun, etc.* 

In NLP, a collection of labels is called a **tag set**. There are multiple version of POS tag sets for various languages, but for the enlgish language the most widley used one is the **Penn Treebank Tag Set**. Below is a subset of the alphabetical list of part-of-speech tags used in the Penn Treebank Project (the full list can be found [here](https://www.clips.uantwerpen.be/pages/mbsp-tags)):

![alt text](http://vsarangian.com/NLPLearningSeries/images/nlp2_pic1.jpg)


**TO BE INCORPORATED: Your instinct now might be to run to your high school grammar book but don’t
worry, you don’t really need to know what all those POS tags mean. In fact, not even
all corpora implement this exact tag set, but rather a subset of it. For example, I’ve
never encountered the LS (list item marker) or the PDT (predeterminer) anywhere.
Part-Of-Speech tagging also serves as a base of deeper NLP analyses. There are just
a few cases when you’ll work directly with the tagged sentence. A scenario that
comes to mind is keyword extraction, when usually you only want to extract the
adjectives and nouns. In this case, you use the tags to filter out those words that
can’t be a keyword.  NER IS A MAJOR USE CASE. Named Entity Recognition (NER for short) is almost as well-known and studied as
POS tagging. NER implies extracting named entities and their classes from a given
text. The usual named entities we’re dealing with stand for: People, Organizations,
Locations, Events, etc. Sometimes, things like currencies, numbers, percents, dates
and time expressions can be considered named entities even though they technically
aren’t. The entities are used in information extraction tasks and usually, these
entities can be attributed to a real-life object or concept.**


As you can imagine, it would be very difficult - and probably pretty ineffective - to assign a POS tag to each word in the english language. In fact, even if we could do that, some words have different tags based the context they are found in. 

To address this problem,  **we use machine learning to learn the important patterns associated with each POS tag.** The approach looks something like this:
1. Get some humans to annotate some texts with POS tags (we’ll call this the gold standard)
2. Extract a series of features from each word in the gold standard corpus
3. Build some mathematical models to predict tags based on the extracted features
3. Assess how well the model is performing on data the model hasn’t seen yet 

As you could probably tell, POS tagging is really just a simple a classification task. The most complicated part is often the process of extracting features from each word. A word's feature can include things like:
- position in sentence
- shape (capitalized, non-capitalized, CamelCase, etc.)
- previous word
- next word in sentence
- end of sentence
- begining of sentence

There are various other types of features that can be extract from a single word, and we will go over these in the next section. For now, we will test out the **pre-trained POS tagger** from NLTK, just to get a sense of the kinds of outputs we should expect.

## 1.2 Experimenting with the NLTK POS tagger

Using the pretrained POS tagger from NLTK is as simple as passing a tokenized sentence into the POS tagging function, just as in the code below. The output should be a list of tuples, each with the individual token and the corresponding (predicted)  POS tag.

In [None]:
# Import necessary packages
import nltk

# Create test sentence and tokenize
sentence = "Hey my name is Mike, and I am the smartest man alive!"
tokens = nltk.word_tokenize(sentence)

# Use nltk's pos_tag pre trained model to tag the words 
nltk.pos_tag(tokens)

[('Hey', 'NNP'),
 ('my', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Mike', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('the', 'DT'),
 ('smartest', 'JJS'),
 ('man', 'NN'),
 ('alive', 'JJ'),
 ('!', '.')]

Note that NLTK's pretrianed POS tagger is not perfect, as it is trained on a general dataset that may be slightly different from the dataset in your specific use case.  

To showcase this, let's try using **NLTK's POS tagger to tag all of the data in the English Universal Dependencies test dataset** we imported at the begining of this notebook. This should give us a good bench mark for our future, more customized models. We will use the 'test_list' data for this exercise. 

In [None]:
sent_tokens = []
pos_predict = []
pos_actual = []

for sent in test_list:
    # Get words/tags from sentences
    words = [token[0] for token in sent]
    tags = [token[1] for token in sent]
    # Predict tags using nltk.pos_tag
    predictions = [tag for word, tag in nltk.pos_tag(words)]
    # Join all predictions and actuals into a single list
    pos_predict.extend(predictions)
    pos_actual.extend(tags)
    
# Calculate nltk.pos_tag prediction accuracy
correct = [a==b for a,b in zip(pos_predict, pos_actual)]
print("Accuracy:", sum(correct)/len(correct))

Accuracy: 0.8495437701717337


Approximately 85% accuracy isn't bad! Especially for a pretrained model that has supposedly never seen our data before. That said, we can do much better by trained our own model with the English Universal Dependencies training dataset. 

## 1.3 Building a very simple n-gram POS tagger with NLTK
Before we jump into building a POS tagger from scratch, we are going to build one more simple tagger to add another benchmark when evaluating our own models. The model we will build will just be a simple **n-grams** tagger, in which each tagger chooses a token's label based on the preceding `n` words' tags.  The model works as follows:

1. **Train a trigram tagger with NLTK's `TrigramTagger()`:** When looking for a word w3, if we have already encountered a trigram of form: w1, w2, w3, with computed tags t1, t2, t3, we will output tag t3 for word w3. Otherwise, we fallback on the bigram tagger.
2. **Train a bigram tagger with  NLTK's `BigramTagger()`:** When looking for a word w2, if we have already encountered a bigram of form: w1, w2, with computed tags t1, t2, we will output tag t2 for word w2. Otherwise, we fallback on the unigram tagger.
3. **Train a unigram tagger with  NLTK's  `UnigramTagger()`:** When looking for a word w1, if we have already encountered that word and computed its tags t1, we will output tag t1 for word w1. Otherwise, we fallback on a default choice.
4. **Train a default tagger with  NLTK's  `DefaultTagger()`:** Since all the above methods failed, we can output the most common tag in the dataset. 

Note that the methodology used for training and testing the n-grams tagger below is the same for NLTK tagger classes:
- **Training:** the model is trained as soon as the class is initialized
- **Tagging:** a trained model can tag new sentences by passing a list of tokens through the `tag` method
- **Testing:** a trained model can be evaluated by passing unseen test data through the `evaluate` method




In [None]:
from time import time
from collections import Counter

# First establish most common tag to train the 'default tagger'
tag_counter = Counter() 
for sentence in train_list:
    tag_counter.update([t for _, t in sentence])
    
# Take a look at most common tags
print("Top 5 tags:", tag_counter.most_common(5))

most_common_tag = tag_counter.most_common()[0][0]
print("Most Common Tag is: ", most_common_tag) # NN

Top 5 tags: [('NN', 26906), ('IN', 20715), ('DT', 16819), ('NNP', 12449), ('PRP', 12198)]
Most Common Tag is:  NN


In [None]:

# Train the 4 taggers
print("Starting training ...")
t0 = time()
tag0 = nltk.DefaultTagger(most_common_tag)
tag1 = nltk.UnigramTagger(train_list, backoff=tag0)
tag2 = nltk.BigramTagger(train_list, backoff=tag1)
tag3 = nltk.TrigramTagger(train_list, backoff=tag2)
t1 = time()
print("Training complete. Time={0:.2f}s".format(t1 - t0))

# Compute test set accuracy
print(tag3.evaluate(test_list))
print()

# Here's how to use our new tagger
sentence = "Hey everyone! My name is Mike and I am the best at everything!"
sent_tokens = nltk.word_tokenize(sentence)
tag3.tag(sent_tokens)

Starting training ...
Training complete. Time=4.22s
0.8630912061202534



[('Hey', 'UH'),
 ('everyone', 'NN'),
 ('!', '.'),
 ('My', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Mike', 'NNP'),
 ('and', 'CC'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('at', 'IN'),
 ('everything', 'NN'),
 ('!', '.')]

Approximately 86% accuracy! **That's even better than our pretrained NLTK POS tagger!** However that is still not that great, especially if we want high accuracy to be able to parse entities (names, organization, places, etc) as we will see in the Named Entity Recognition section. That said, this is another useful benchmark. Next we will build our own model from scratch, and hopefully try to improve on this accuracy level! 

# 2\. Feature Extraction for POS

In this section, we will create a series of functions that will extract features from set of tokens and tags, and convert those features into a vectorized format that can be ingested by any machine learning algorithm. The three functions are:
1. `shape`: helper function that uses regex to extract information about individual word
2. `pos_features`: helper function that **slides over the sentence extracting features (including the `shape` funciton output) for every word, taking into account the the tags already assigned (the history parameter).** The function returns a dictionary with the keys and values as the feature names and feature values, respectively. 
3. `tagged_to_dataset`: using pos_features function to take in training data and returns vectorized X, y data (if you want vectorized data, you can also just return the list of feature dictionaries by keeping 'vectorize' set to false)

## 2.1 Extracting 'word shapes'



One of the most powerful features for POS tagging is the **shape of the word.** Our implementation uses regex (which we covered in the last section) to extract a word's shapes and classify it into one of 10 unique word shape groups:
- Numbers (1, 1.25, 100000)
- Punctuation (. , ;)
- Capitalized
- UPPERCASE
- lowercase
- CamelCase
- MiXEdCaSE
- EndingWithADot.
- ABB.RE.VI.ATION.
- Contains-Hyphen


In [None]:
# Import regex
import re

# Define shape function
def shape(word):
    if re.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$', word):
        return 'number'
    elif re.match('\W+$', word):
        return 'punct'
    elif re.match('[A-Z][a-z]+$', word):
        return 'capitalized'
    elif re.match('[A-Z]+$', word):
        return 'uppercase'
    elif re.match('[a-z]+$', word):
        return 'lowercase'
    elif re.match('[A-Z][a-z]+[A-Z][a-z]+[A-Za-z]*$', word):
        return 'camelcase'
    elif re.match('[A-Za-z]+$', word):
        return 'mixedcase'
    elif re.match('__.+__$', word):
        return 'wildcard'
    elif re.match('[A-Za-z0-9]+\.$', word):
        return 'ending-dot'
    elif re.match('[A-Za-z0-9]+\.[A-Za-z0-9\.]+\.$', word):
        return 'abbreviation'
    elif re.match('[A-Za-z0-9]+\-[A-Za-z0-9\-]+.*$', word):
        return 'contains-hyphen'
    return 'other'

# Test it out
sentence = "This is the 1st test of Mike's CrAzY word-shape FUNCTION."
tokens = nltk.word_tokenize(sentence)
for token in tokens:
    print(token,":", shape(token))

This : capitalized
is : lowercase
the : lowercase
1st : number
test : lowercase
of : lowercase
Mike : capitalized
's : other
CrAzY : camelcase
word-shape : contains-hyphen
FUNCTION : uppercase
. : punct


## 2.2 Extraction contextual features



In addition to word shapes, we need to extract features that correspond to a given words' context (eg. the previous/next word in the sentence). These types of features are useful for words which may have multiple tags based on the context that they appear in. For instance, in the sentences below, we can see that the word 'work' can be either a noun ('NN') or a verb ('VBP') based on the context it is found in! 

In [None]:
sent1 = nltk.word_tokenize("Mike does his work!")
sent2 = nltk.word_tokenize("Mike, you work so hard!")

print("'Work' as a NOUN:",nltk.pos_tag(sent1))
print("'Work' as a VERB:",nltk.pos_tag(sent2))

'Work' as a NOUN: [('Mike', 'NN'), ('does', 'VBZ'), ('his', 'PRP$'), ('work', 'NN'), ('!', '.')]
'Work' as a VERB: [('Mike', 'NNP'), (',', ','), ('you', 'PRP'), ('work', 'VBP'), ('so', 'RB'), ('hard', 'JJ'), ('!', '.')]


**THIS SECTION NEEDS SOME WORK:**
To address this issue, for any given word in the sentence we will extract: 
- a) the two preceding and following words 
- b) the tags of the two preceding words (note that we only use the preceeding word tags, as we will NOT have the following word tags when we need to make predictions)

We will create a function that loops through each word, stores the contextual features, and stores them into a dictionary. There are a few nuances that we will need to address here:
- To handle the first and second words in the sentence, we will need to pad the sentence with `__START__` and `__END__` tokens.  
- We will also need to pad the tags associated with each of the preceeding words. Note that we do not directly feed the tags in, but create a `history` variable that grows one tag at a time as the function loops through the words in the sentence. This is because when we need to make predictions (tag an untagged sentence) we will be predicting one tag at a time, and then adding that tag to the history variable. In effect we will be using our tag predictions for the first couple words to predict tags of tags of the following words. A basic implementaion of this idea is as follow: 


In [None]:
# Create context extractor
def extract_context(tokens, tag_history, index):
  # Pad sentence with start and end tokens
  tokens = ['__START2__', '__START1__'] + list(tokens) + ['__END1__', '__END2__']

  # We will be looking two words back in history, so need to make sure we do not go out of bounds
  tag_history = ['__START2__', '__START1__'] + list(tag_history)
  # shift the index with 2, to accommodate the padding
  index += 2
  # Store features in a dictionary
  context_features = {# Context words
                      'prev-word': tokens[index - 1],
                      'prev-prev-word': tokens[index - 2],
                      'next-word': tokens[index + 1],
                      'next-next-word': tokens[index + 2],
                      # Historical tags
                      'prev-pos': tag_history[-1],
                      'prev-prev-pos': tag_history[-2]}
  return context_features

# Create tagged sentence (pretend this is our human tagged training data!)
raw_sentence = "Mike is the smartest and most handsome man on the planet!"
tokens = nltk.word_tokenize(raw_sentence)
tags = [tag for token, tag in nltk.pos_tag(sent_tokens)]

# Loop through each word and extract context
tag_history = []
for i, token in enumerate(tokens):
  # Add another historical tag on each iteration
  tag_history = tags[0:i]
  # Extract features (add 2 to index to adjust for start/end tokens)
  feats = extract_context(tokens, tag_history, i)
  print(token, feats)

Mike {'prev-word': '__START1__', 'prev-prev-word': '__START2__', 'next-word': 'is', 'next-next-word': 'the', 'prev-pos': '__START1__', 'prev-prev-pos': '__START2__'}
is {'prev-word': 'Mike', 'prev-prev-word': '__START1__', 'next-word': 'the', 'next-next-word': 'smartest', 'prev-pos': 'NNP', 'prev-prev-pos': '__START1__'}
the {'prev-word': 'is', 'prev-prev-word': 'Mike', 'next-word': 'smartest', 'next-next-word': 'and', 'prev-pos': 'NN', 'prev-prev-pos': 'NNP'}
smartest {'prev-word': 'the', 'prev-prev-word': 'is', 'next-word': 'and', 'next-next-word': 'most', 'prev-pos': '.', 'prev-prev-pos': 'NN'}
and {'prev-word': 'smartest', 'prev-prev-word': 'the', 'next-word': 'most', 'next-next-word': 'handsome', 'prev-pos': 'PRP$', 'prev-prev-pos': '.'}
most {'prev-word': 'and', 'prev-prev-word': 'smartest', 'next-word': 'handsome', 'next-next-word': 'man', 'prev-pos': 'NN', 'prev-prev-pos': 'PRP$'}
handsome {'prev-word': 'most', 'prev-prev-word': 'and', 'next-word': 'man', 'next-next-word': 'on'

## 2.3 Creating the Feature Extraction function

Now that we know how to extract word shapes as well as contextual features for each word in a sentence, we are going to put these things together to create our `pos_features` function. We will also add some additional features like the stem of words (we discussed word stemming and lemmatizing in the last section), as well as word suffixes, just to capute as much information as possible. The below implementation is very similar to the POS feature extractor used by NLTK.

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

def pos_features(tokens, index, tag_history):
    """
    sentence = list of words: [word1, word2, ...]
    index = the index of the word we want to extract features for
    history = the list of predicted tags of the previous tokens
    """
    # Pad the sequence with placeholders
    # We will be looking at two words back and forward, so need to make sure we do not go out of bounds
    tokens = ['__START2__', '__START1__'] + list(tokens) + ['__END1__', '__END2__']
    # We will be looking two words back in history, so need to make sure we do not go out of bounds
    tag_history = ['__START2__', '__START1__'] + list(tag_history)
    # shift the index with 2, to accommodate the padding
    index += 2
    return {
        # Intrinsic features
        'word': tokens[index],
        'stem': stemmer.stem(tokens[index]),
        'shape': shape(tokens[index]),
        # Suffixes
        'suffix-1': tokens[index][-1],
        'suffix-2': tokens[index][-2:],
        'suffix-3': tokens[index][-3:],
        # Context
        'prev-word': tokens[index - 1],
        'prev-stem': stemmer.stem(tokens[index - 1]),
        'prev-prev-word': tokens[index - 2],
        'prev-prev-stem': stemmer.stem(tokens[index - 2]),
        'next-word': tokens[index + 1],
        'next-stem': stemmer.stem(tokens[index + 1]),
        'next-next-word': tokens[index + 2],
        'next-next-stem': stemmer.stem(tokens[index + 2]),
        # Historical features
        'prev-pos': tag_history[-1],
        'prev-prev-pos': tag_history[-2],
        # Composite
        'prev-word+word': tokens[index - 1].lower() + '+' + tokens[index]}

# Create tagged sentence (pretend this is our human tagged training data!)
raw_sentence = "Mike is the smartest and most handsome man on the planet!"
tokens = nltk.word_tokenize(raw_sentence)
tags = [tag for token, tag in nltk.pos_tag(sent_tokens)]

# Loop through each word and extract context
tag_history = []
for i, token in enumerate(tokens):
  # Add another historical tag on each iteration
  tag_history = tags[0:i]
  # Extract features (add 2 to index to adjust for start/end tokens)
  feats = pos_features(tokens, i, tag_history)
  print(token, feats)

Mike {'word': 'Mike', 'stem': 'mike', 'shape': 'capitalized', 'suffix-1': 'e', 'suffix-2': 'ke', 'suffix-3': 'ike', 'prev-word': '__START1__', 'prev-stem': '__start1__', 'prev-prev-word': '__START2__', 'prev-prev-stem': '__start2__', 'next-word': 'is', 'next-stem': 'is', 'next-next-word': 'the', 'next-next-stem': 'the', 'prev-pos': '__START1__', 'prev-prev-pos': '__START2__', 'prev-word+word': '__start1__+Mike'}
is {'word': 'is', 'stem': 'is', 'shape': 'lowercase', 'suffix-1': 's', 'suffix-2': 'is', 'suffix-3': 'is', 'prev-word': 'Mike', 'prev-stem': 'mike', 'prev-prev-word': '__START1__', 'prev-prev-stem': '__start1__', 'next-word': 'the', 'next-stem': 'the', 'next-next-word': 'smartest', 'next-next-stem': 'smartest', 'prev-pos': 'NNP', 'prev-prev-pos': '__START1__', 'prev-word+word': 'mike+is'}
the {'word': 'the', 'stem': 'the', 'shape': 'lowercase', 'suffix-1': 'e', 'suffix-2': 'he', 'suffix-3': 'the', 'prev-word': 'is', 'prev-stem': 'is', 'prev-prev-word': 'Mike', 'prev-prev-stem':

## 2.3 Vecotizer the data for ML ingestion

Now that we have built our POS feature extractor, we need to build two functions to extract the features from an entire corpus of training data, and then vectorize the feature dictionaries to convert them to a format that can be ingested by our machine learning algorithm!

To start, we will build a tags_to_dataset function, which will do the following:
1. Take in a list of tokenized and tagged sentences as well as a 'feature_detector' (which will be our `pos_features` function)
2. Loop through each sentence in the corpus
3. Loop through each word in each sentence and extract the features

The result should be a list of 204507 tuples which contain the features of the individual word (stored in a dictionary) and its corresponding tag (our classifier label, or y value). Note there are 204507 NON-UNIQUE tokens in the corpus, and we treat each one as a unique observation to be fed into our classifier!

To start we will only use **1000 sentences** just to speed up the process. We will use the whole training set when we actually want to test our model.

In [None]:
# Extract the number of non-unique words
n_sents = 1000
total_words = 0
for i in train_list[:n_sents]:
  total_words+=len(i)

print("Total Sentences in subset:", len(train_list[:n_sents]))
print("Total Words:", total_words)

# As a reminder, print out what the first sentence in the trainin set looks like
print(train_list[0])

Total Sentences in subset: 1000
Total Words: 21859
[('Al', 'NNP'), ('-', 'HYPH'), ('Zaman', 'NNP'), (':', ':'), ('American', 'JJ'), ('forces', 'NNS'), ('killed', 'VBD'), ('Shaikh', 'NNP'), ('Abdullah', 'NNP'), ('al', 'NNP'), ('-', 'HYPH'), ('Ani', 'NNP'), (',', ','), ('the', 'DT'), ('preacher', 'NN'), ('at', 'IN'), ('the', 'DT'), ('mosque', 'NN'), ('in', 'IN'), ('the', 'DT'), ('town', 'NN'), ('of', 'IN'), ('Qaim', 'NNP'), (',', ','), ('near', 'IN'), ('the', 'DT'), ('Syrian', 'JJ'), ('border', 'NN'), ('.', '.')]


For our purposes, we will only use **1000 sentences** just to speed up the process. We will use the whole training set when we actually want to test our model.

In [None]:
# Build the function to convert tags sentence to (features, tag) tuples
def tags_to_dataset(tagged_sentences, feature_detector):
  """
  Helper function:
    Take in train data (tagged sentences) and return feature dict (pre-vectorized dataset)
    eg. tagged_sentences = [[(word1, tag1),(word2, tag2)],[(word1, tag1),(word2, tag2)]]
  """
  # Initialize empty featursets list
  classifier_corpus = []
  for sentence in tagged_sentences:
      # Initialize empty history list (will be updated as loop through each word in sent)
      tag_history = []
      # Use zip(* ) to zip tokens & tags into two separate lists
      sentence_tokens, sentence_tags = zip(*sentence)
      # Loop through each word in sentence
      # Duplicate words are kept because contexts (prev/post words) may differ
      for index in range(len(sentence)):
          # Use the feature detector (eg. pos_features) initialized with the class
          featureset = feature_detector(sentence_tokens, index, tag_history)
          # Update featursets list with tuple (featureset, tag)
          classifier_corpus.append((featureset, sentence_tags[index]))
          # Update history for next index word
          tag_history.append(sentence_tags[index])
  return classifier_corpus

# Try extracting the features from a subset of the training data
prevec_data = tags_to_dataset(train_list[:n_sents], pos_features)

# Print the first two words!
print(len(prevec_data))
prevec_data[0:2]

21859


[({'next-next-stem': 'zaman',
   'next-next-word': 'Zaman',
   'next-stem': '-',
   'next-word': '-',
   'prev-pos': '__START1__',
   'prev-prev-pos': '__START2__',
   'prev-prev-stem': '__start2__',
   'prev-prev-word': '__START2__',
   'prev-stem': '__start1__',
   'prev-word': '__START1__',
   'prev-word+word': '__start1__+Al',
   'shape': 'capitalized',
   'stem': 'al',
   'suffix-1': 'l',
   'suffix-2': 'Al',
   'suffix-3': 'Al',
   'word': 'Al'},
  'NNP'),
 ({'next-next-stem': ':',
   'next-next-word': ':',
   'next-stem': 'zaman',
   'next-word': 'Zaman',
   'prev-pos': 'NNP',
   'prev-prev-pos': '__START1__',
   'prev-prev-stem': '__start1__',
   'prev-prev-word': '__START1__',
   'prev-stem': 'al',
   'prev-word': 'Al',
   'prev-word+word': 'al+-',
   'shape': 'punct',
   'stem': '-',
   'suffix-1': '-',
   'suffix-2': '-',
   'suffix-3': '-',
   'word': '-'},
  'HYPH')]

Looks great! Now we can convert our whole training corpus int a list of features/tag tuples! The next step is to take this data and vectorize it so it can be ingested by a machine algorith. To do this, we will use sklearn's DictVectorizer (which we discussed in the last section). Note that when we want to make predictions, we need to use 'transform' and not 'fit_transform',  as we want our prediction data matrixes to have the same shape as the data we trained the model on!

In [None]:
# Initialize sklearn DictVectorizer
from sklearn.feature_extraction import DictVectorizer
dictvect = DictVectorizer()

# Extract featursets and tags
featuresets, y = zip(*prevec_data)

# Vectorize the data
X = dictvect.fit_transform(featuresets)
print("X shape:",X.shape)
print("X type:",type(X))

X shape: (21859, 56941)
X type: <class 'scipy.sparse.csr.csr_matrix'>


# 3\. Build a custom POS tagger

## 3.1 Tagging and evaluating

Once we vectorized our data (the X values) and extracted the labels (the y values), all we really have to do is feed this data into a machine learning model. We assume you have a strong understanding of ML algos, and know how to propely train and fine tune a model, so we will not go over that here. 

The more challenging component of POS models is making predictions, which we will go over now. But first, lets train a simple model so we have something to make predictions with!

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Use X and y values generated above to train model
model = RandomForestClassifier()
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Great, now that we have a model, let's figure out how to make predictions and evaluate the accuracy of our model. To do this, we will need four functions:
- `tag`: tag all words in a sentence
- `tag_sents`: tag multiple sentences
- `accuracy`: calculate the accuracy (correct predictions / total observations)
-  `evaluate`: use above functions to predict tags of test set and then return accuracy value

The trickiest component of this process involves the `tag` function. After we predict a new tag, we append that tag to the `tag_history` argument. This argument is then fed back into the `pos_features` function for the next word in the sentenace. So when we train a model, we use actual human-tagged POS features to fill the `prev-pos` and `prev-prev-pos` elements of our feature dictionary. But when we predict tags, the predicted previous tags fill those feature slots!

Note that because we have to tag each word individually, we also have to apply to the pos_features and vectorizing functions to each word individually, which can make the tagging process extremely slow. In the final version of the model, we will implement a option to turn the 'tag_history' parameter off. But for now we will stick with the basic implementation.

Lets start by building the `tag` and `tag_sents` functions.

In [None]:
# Create function to tag single word
def tag(tokens):
    """
    Tag untagged tokens, use predictions from previous tags as features. 
    """
    tags_predicted = []
    for i in range(len(tokens)):
        # Extract features from single word
        single_feats_dict = pos_features(tokens=tokens, index=i, tag_history=tags_predicted)
        # Vectorize single word feats
        single_feats_vector = dictvect.transform(single_feats_dict)
        # Use trained model to predict new tag
        pred_tag = model.predict(single_feats_vector)[0]
        # Append predicted tag to 'tags_predicted' arg to be fed into next feature set
        tags_predicted.append(pred_tag)
    return list(zip(tokens,tags_predicted))

# Create function to tag multiple sentences
def tag_sents(sent_tokens):
    return [tag(tokens) for tokens in sent_tokens]
  

# Test out the tag function and compare the results to nltk's pos tagger!
raw_sentences = ["Mike is the smartest and most handsome man on the planet!",
                "It is also true that Mike is the best Data Scientist of all time!",
                "Oh and don't forget about how athletic and strong he is, obviously"]
tokens = [nltk.word_tokenize(sent) for sent in raw_sentences]

print("Custom tagger:", tag_sents(tokens)[0])
print("NLTK tagger:", nltk.pos_tag(tokens[0]))

Custom tagger: [('Mike', 'IN'), ('is', 'VBZ'), ('the', 'DT'), ('smartest', 'NN'), ('and', 'CC'), ('most', 'JJS'), ('handsome', 'DT'), ('man', 'NN'), ('on', 'IN'), ('the', 'DT'), ('planet', 'NN'), ('!', '.')]
NLTK tagger: [('Mike', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('smartest', 'JJS'), ('and', 'CC'), ('most', 'RBS'), ('handsome', 'JJ'), ('man', 'NN'), ('on', 'IN'), ('the', 'DT'), ('planet', 'NN'), ('!', '.')]


Not bad! Remember, we are using a model trained on only 1000 sentences (which is not a lot of data in an NLP context). Let's now build the model accuracy and evaluate functions.

In [None]:
from nltk.tag.util import untag
from itertools import chain

# Create accuracy function 
def accuracy(reference, test):
    """
    Given a list of reference values and a corresponding list of test
    values, return the fraction of corresponding values that are equal.
    """
    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
    return sum(x == y for x, y in zip(reference, test)) / len(test)

# Create evaluate function 
def evaluate(test_set, verbose=True):
    """
    Evaluation function:
      Returns 'accuracy' metric using sklearn built-in 'score' method.
    """
    t0=time()
    untagged_sents = [untag(sent) for sent in test_set]
    # Use tagging function created above to retag (predcit)
    retagged_sents = tag_sents(untagged_sents)
    preds = list(chain(*retagged_sents))
    actual = list(chain(*test_set))
    score = accuracy(actual, preds)
    if verbose:
      print("Accuracy: {}, Eval Runtime: {:.2f}".format(score, time()-t0))
    return score

tokens_test = [nltk.pos_tag(sent) for sent in tokens]
evaluate(tokens_test)

Accuracy: 0.8048780487804879, Eval Runtime: 0.09


0.8048780487804879

## 3.2 Putting it all together

Now that we have learned all the building blocks for POS tagging, lets put everything together into one big glorious `Custom_POS_Tagger` class! We have done so below using the functions we created above, with a few minor changes:


In [None]:
from time import time
import itertools
import nltk
from nltk.tag.util import untag
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction import DictVectorizer

class Custom_POS_Tagger:
    def __init__(self, 
                 feature_detector,
                 classifier,
                 vectorizer=None,
                 sparse=True,
                 scaler=None):
   
        # Save params
        self.feature_detector = feature_detector
        self.classifier = classifier
        if vectorizer is None:
          self.vectorizer = DictVectorizer(sparse=sparse)
        else:
          self.vectorizer = vectorizer 
        self.fitted = False
        self.scaler=None
        
    def _tags_to_dataset(self, tagged_sentences):
        """
        Helper function:
          Take in train data (tagged sentences) and return feature dict (pre-vectorized dataset)
          eg. tagged_sentences = [[(word1, tag1),(word2, tag2)],[(word1, tag1),(word2, tag2)]]
        """
        # Initialize empty featursets list
        classifier_corpus = []
        for sentence in tagged_sentences:
            # Initialize empty history list (will be updated as loop through each word in sent)
            tag_history = []
            # Use zip(* ) to zip tokens & tags into two separate lists
            sentence_tokens, sentence_tags = zip(*sentence)
            # Loop through each word in sentence
            # Duplicate words are kept because contexts (prev/post words) may differ
            for index in range(len(sentence)):
                # Use the feature detector (eg. pos_features) initialized with the class
                featureset = self.feature_detector(sentence_tokens, index, tag_history)
                # Update featursets list with tuple (featureset, tag)
                classifier_corpus.append((featureset, sentence_tags[index]))
                # Update history for next index word
                tag_history.append(sentence_tags[index])
        return classifier_corpus

    def fit(self, train_data):
        """
        Training function:
          Uses fit method of self.classifier to fit model.          
        """
        t0_model = time()
        print("Commencing feature extraction...")
        # Transform batch into features dictionary
        features_dict = self._tags_to_dataset(train_data)
        featuresets, y = zip(*features_dict)
        # Extract X and y values from features_dictionary
        X = self.vectorizer.fit_transform(featuresets)
        # Run fit method on model
        print("Training model on features set size {}".format(X.shape))
        self.classifier.fit(X, y)
        print("Training complete. Total runtime: {:.2f} sec".format(time()-t0_model))
        self.fitted = True
        
    def _is_trained(self):
        if self.fitted == False:
          raise ValueError("Train the model first dummy!")
          
    def tag(self, tokens):
        """
        Tag untagged tokens, use predictions from previous tags as features. 
        """
        self._is_trained() # Make sure model is trained
        tags_predicted = []
        for i in range(len(tokens)):
            single_feats_dict = self.feature_detector(tokens, i, tags_predicted)
            single_feats_vector = self.vectorizer.transform(single_feats_dict)
            tags_predicted.append(self.classifier.predict(single_feats_vector)[0])
        return list(zip(tokens,tags_predicted))
    
    def tag_sents(self, sent_tokens):
        return [self.tag(tokens) for tokens in sent_tokens]
    
    def accuracy(self, reference, test):
        """
        Given a list of reference values and a corresponding list of test
        values, return the fraction of corresponding values that are equal.
        """
        if len(reference) != len(test):
            raise ValueError("Lists must have the same length.")
        return sum(x == y for x, y in zip(reference, test)) / len(test)
    
    def evaluate(self, test_set, verbose=True):
        """
        Evaluation function:
          Returns 'accuracy' metric using sklearn built-in 'score' method.
        """
        t0=time()
        untagged_sents = [untag(sent) for sent in test_set]
        retagged_sents = self.tag_sents(untagged_sents)
        preds = list(chain(*retagged_sents))
        actual = list(chain(*test_set))
        score = self.accuracy(actual, preds)
        if verbose:
            print("Accuracy: {}, Eval Runtime: {:.2f}".format(score, time()-t0))
        return score

## 3.4 Training and tuning the model

In [None]:
import nltk
from nltk.tag.util import untag
from itertools import chain
from time import time
import re
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import SGDClassifier
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

In [None]:
# Build RandomForest model
from sklearn.ensemble import RandomForestClassifier
model_rf = Custom_POS_Tagger(feature_detector=pos_features, classifier=RandomForestClassifier())
model_rf.fit(train_list)
model_rf.evaluate(test_list)

Training model on features set size (204607, 256137)
Training complete. Total runtime: 345.97 sec
Accuracy: 0.8852850938359167, Eval Runtime: 110.01


0.8852850938359167

In [None]:
# Build Regularized RandomForest model
from sklearn.ensemble import RandomForestClassifier
model_rf = Custom_POS_Tagger(feature_detector=pos_features, classifier=RandomForestClassifier(max_depth=200)) 
model_rf.fit(train_list)
model_rf.evaluate(test_list)

Commencing feature extraction...
Training model on features set size (204607, 256137)
Training complete. Total runtime: 220.65 sec
Accuracy: 0.8717775032872455, Eval Runtime: 108.51


0.8717775032872455

In [None]:
# Build SVC model
from sklearn.svm import LinearSVC
model_svc = Custom_POS_Tagger(feature_detector=pos_features, classifier=LinearSVC())
model_svc.fit(train_list)
model_svc.evaluate(test_list)

Commencing feature extraction...
Training model on features set size (204607, 256137)
Training complete. Total runtime: 54.81 sec
Accuracy: 0.9373630314380205, Eval Runtime: 7.60


0.9373630314380205

In [None]:
# Build Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
model_mnb = Custom_POS_Tagger(feature_detector=pos_features, classifier=MultinomialNB())
model_mnb.fit(train_list)
model_mnb.evaluate(test_list)

Commencing feature extraction...
Training model on features set size (204607, 256137)
Training complete. Total runtime: 21.81 sec
Accuracy: 0.8730127106825517, Eval Runtime: 1376.65


0.8730127106825517

# 4\. Build a POS tagger using NLTK

In this sectoin, we are going to create our own POS tagger, just as we did above, however we are going to use NLTK's built in POS tagger training functionality to do so. We will then compare our custom model with the NLTK implementation and see how the accuracy levels and runtime values comapre!

**Here is the documentation:** 

`Init signature: 
ClassifierBasedPOSTagger(feature_detector=None, train=None, classifier_builder=<bound method NaiveBayesClassifier.train of <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)

Docstring:      A classifier based part of speech tagger.
File:           /usr/local/lib/python3.6/dist-packages/nltk/tag/sequential.py
Type:           ABCMeta`

*NOTE: that its base class is the `ClassiferBasedTagger`, so it has all the same arguments with the addition of the `feature_detector`)*


**Parameters**:	
- feature_detector – A function used to generate the featureset input for the classifier:: feature_detector(tokens, index, history) -> featureset *(ClassifierBasedPOSTagger has a default tagger that is very similar to the one we constructed above)*
- train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
- backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
- classifier_builder – A function used to train a new classifier based on the data in train. It should take one argument, a list of labeled featuresets (i.e., (featureset, label) tuples).
- classifier – The classifier that should be used by the tagger. This is only useful if you want to manually construct the classifier; normally, you would use train instead.
- backoff – A backoff tagger, used if this tagger is unable to determine a tag for a given token.
- cutoff_prob – If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob

In [None]:
from nltk.tag.sequential import ClassifierBasedPOSTagger
t0 = time()
model_nltk = ClassifierBasedPOSTagger(train=train_list)
print("Runtime:", time()-t0)


Runtime: 7.268903017044067


In [None]:
t0 = time()
score = model_nltk.evaluate(test_list[0:1000])
print("Runtime:", time()-t0)
score

Runtime: 16.696187019348145


0.8973832344439373

# 5\. POS use cases

As you can imagine, POS tagging serves many uses in NLP, and is the building block for many higher level concepts, particularily:
- **Named Entity Recognition (NER)**: extracting important entities from text (like names, conmpanies, locations, etc.) by essentially grouping together POS tags
- **Dependency Parsing**: parsing dependencies and parent-child relationships between words in a sentence

We will learn about these more advanced concepts in later sections. For now, take a look at the outputs below using these pre-trained NLTK models to get a sense of what the outputs look like.

In [None]:
import nltk

# Create sentence
sentence = """Hi! My name is Mike Ciniello and I am the smartest man alive. I work at KPMG Canada."""
print("Raw Sentence:\n", sentence)

# Tokenize and pos tag
tokens = nltk.word_tokenize(sentence)
print("\nTokens:\n", tokens)

# Tag tokenized data
tagged_tokens = nltk.pos_tag(tokens)
print("\nPOS Tokens:\n", tagged_tokens)

# Create NER tree
ner_annotated_tree = nltk.ne_chunk(tagged_tokens)
print("\nNER Tree:\n",ner_annotated_tree)

Raw Sentence:
 Hi! My name is Mike Ciniello and I am the smartest man alive. I work at KPMG Canada.

Tokens:
 ['Hi', '!', 'My', 'name', 'is', 'Mike', 'Ciniello', 'and', 'I', 'am', 'the', 'smartest', 'man', 'alive', '.', 'I', 'work', 'at', 'KPMG', 'Canada', '.']

POS Tokens:
 [('Hi', 'NN'), ('!', '.'), ('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Mike', 'NNP'), ('Ciniello', 'NNP'), ('and', 'CC'), ('I', 'PRP'), ('am', 'VBP'), ('the', 'DT'), ('smartest', 'JJS'), ('man', 'NN'), ('alive', 'JJ'), ('.', '.'), ('I', 'PRP'), ('work', 'VBP'), ('at', 'IN'), ('KPMG', 'NNP'), ('Canada', 'NNP'), ('.', '.')]

NER Tree:
 (S
  (GPE Hi/NN)
  !/.
  My/PRP$
  name/NN
  is/VBZ
  (PERSON Mike/NNP Ciniello/NNP)
  and/CC
  I/PRP
  am/VBP
  the/DT
  smartest/JJS
  man/NN
  alive/JJ
  ./.
  I/PRP
  work/VBP
  at/IN
  (ORGANIZATION KPMG/NNP Canada/NNP)
  ./.)
