## Part-of-Speech Tagging

We will look at the following four main techniques used for POS tagging:

Lexicon-based

Rule-based

Probabilistic (or stochastic) techniques

Deep learning techniques

<b>Lexicon tagger</b> will tag 'run' basis the highest frequency tag. In most contexts, 'run' is likely to appear as a verb, implying that 'run' will be wrongly tagged in the first sentence.

 

But if there’s a rule that is applied to the entire text, such as, 'replace VB with NN if the previous tag is DT', or 'tag all words ending with ing as VBG', the tag can be corrected. <b>Rule-based tagging</b> methods use such an approach.

 

<b>Probabilistic taggers</b> don't naively assign the highest frequency tag to each word, instead, they look at slightly longer parts of the sequence and often use the tag(s) and the word(s) appearing before the target word to be tagged.

## POS Tagging - Lexicon and Rule Based Taggers

### 1. Reading and understanding the tagged dataset

In [14]:
# Importing libraries
import nltk
import numpy as np
import pandas as pd
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
import math

In [15]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to /Users/sarab/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

In [35]:
# loading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())

In [36]:
# samples: Each sentence is a list of (word, pos) tuples
print(wsj[:3])

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), ('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), ('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ','), ('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')]]


In [37]:
# coverting list of sents to a list of (word, pos tag) tuples to make th other data preprocessing steps easier and convenient to process
tagged_words = [tup for sent in wsj for tup in sent]
print(len(tagged_words))
tagged_words[:10]

100676


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT')]

### 2. Exploratory Analysis

We can explore the data to understand the POS tags. We can try to get the tags and words details.

1. How many unique tags are there in the corpus? 
2. Which is the most frequent tag in the corpus?
3. Which tag is most commonly assigned to the following words:
    - "bank"
    - "executive"

In [38]:
# number of unique POS tags in the corpus
tags = [pair[1] for pair in tagged_words]
unique_tags = set(tags)
len(unique_tags)

46

In [40]:
# Most frequent tag in the dataset
from collections import Counter
result_counts = Counter(tags)
result_counts

Counter({'NN': 13166,
         'IN': 9857,
         'NNP': 9410,
         'DT': 8165,
         '-NONE-': 6592,
         'NNS': 6047,
         'JJ': 5834,
         ',': 4886,
         '.': 3874,
         'CD': 3546,
         'VBD': 3043,
         'RB': 2822,
         'VB': 2554,
         'CC': 2265,
         'TO': 2179,
         'VBN': 2134,
         'VBZ': 2125,
         'PRP': 1716,
         'VBG': 1460,
         'VBP': 1321,
         'MD': 927,
         'POS': 824,
         'PRP$': 766,
         '$': 724,
         '``': 712,
         "''": 694,
         ':': 563,
         'WDT': 445,
         'JJR': 381,
         'NNPS': 244,
         'WP': 241,
         'RP': 216,
         'JJS': 182,
         'WRB': 178,
         'RBR': 136,
         '-RRB-': 126,
         '-LRB-': 120,
         'EX': 88,
         'RBS': 35,
         'PDT': 27,
         '#': 16,
         'WP$': 14,
         'LS': 13,
         'FW': 4,
         'UH': 3,
         'SYM': 1})

In [41]:
# most common tags
tag_counts.most_common(5)

[('NN', 13166), ('IN', 9857), ('NNP', 9410), ('DT', 8165), ('-NONE-', 6592)]

In [46]:
# Most commonly tag assigned to the word bank.
bank = [pair for pair in tagged_words if pair[0].lower() == 'bank']
print(bank)

[('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank', 'NN'), ('bank', 'NN'), ('Bank', 'NNP'), ('bank',

In [47]:
# Most commonly tag assigned to the word executive.
executive = [pair for pair in tagged_words if pair[0].lower() == 'executive']
print(executive)

[('executive', 'NN'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'NN'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'JJ'), ('executive', 'NN'), ('executive', 'NN'), ('executive'

In [48]:
# Words with the tag 'VBD' (verb, past tense) end with 'ed'
past_tense_verbs = [pair for pair in tagged_words if pair[1]=='VBD']
ed_verbs = [pair for pair in past_tense_verbs if pair[0].endswith('ed')]
print(len(ed_verbs) / len(past_tense_verbs))
ed_verbs[:20]

0.3881038448899113


[('reported', 'VBD'),
 ('stopped', 'VBD'),
 ('studied', 'VBD'),
 ('led', 'VBD'),
 ('worked', 'VBD'),
 ('explained', 'VBD'),
 ('imposed', 'VBD'),
 ('dumped', 'VBD'),
 ('poured', 'VBD'),
 ('mixed', 'VBD'),
 ('described', 'VBD'),
 ('ventilated', 'VBD'),
 ('contracted', 'VBD'),
 ('continued', 'VBD'),
 ('eased', 'VBD'),
 ('ended', 'VBD'),
 ('lengthened', 'VBD'),
 ('reached', 'VBD'),
 ('resigned', 'VBD'),
 ('approved', 'VBD')]

In [49]:
# Words with the tag 'VBG' end with 'ing'
participle_verbs = [pair for pair in tagged_words if pair[1]=='VBG']
ing_verbs = [pair for pair in participle_verbs if pair[0].endswith('ing')]
print(len(ing_verbs) / len(participle_verbs))
ing_verbs[:20]

0.9972602739726028


[('publishing', 'VBG'),
 ('causing', 'VBG'),
 ('using', 'VBG'),
 ('talking', 'VBG'),
 ('having', 'VBG'),
 ('making', 'VBG'),
 ('surviving', 'VBG'),
 ('including', 'VBG'),
 ('including', 'VBG'),
 ('according', 'VBG'),
 ('remaining', 'VBG'),
 ('according', 'VBG'),
 ('declining', 'VBG'),
 ('rising', 'VBG'),
 ('yielding', 'VBG'),
 ('waiving', 'VBG'),
 ('holding', 'VBG'),
 ('holding', 'VBG'),
 ('cutting', 'VBG'),
 ('manufacturing', 'VBG')]

## 3. Lexicon and Rule-Based Models for POS Tagging

Let's now see lexicon and rule-based models for POS tagging. We'll first split the corpus into training and test sets and then use built-in NLTK taggers. 

### 3.1 Splitting into Train and Test Sets

In [50]:
# splitting into train and test
random.seed(1234)
train_set, test_set = train_test_split(wsj, test_size=0.3)

print(len(train_set))
print(len(test_set))
print(train_set[:2])

2739
1175
[[('Sen.', 'NNP'), ('Danforth', 'NNP'), ('and', 'CC'), ('others', 'NNS'), ('also', 'RB'), ('want', 'VBP'), ('the', 'DT'), ('department', 'NN'), ('to', 'TO'), ('require', 'VB'), ('additional', 'JJ'), ('safety', 'NN'), ('equipment', 'NN'), ('*ICH*-1', '-NONE-'), ('in', 'IN'), ('light', 'JJ'), ('trucks', 'NNS'), ('and', 'CC'), ('minivans', 'NNS'), (',', ','), ('including', 'VBG'), ('air', 'NN'), ('bags', 'NNS'), ('or', 'CC'), ('automatic', 'JJ'), ('seat', 'NN'), ('belts', 'NNS'), ('in', 'IN'), ('front', 'JJ'), ('seats', 'NNS'), ('and', 'CC'), ('improved', 'VBN'), ('side-crash', 'JJ'), ('protection', 'NN'), ('.', '.')], [('The', 'DT'), ('company', 'NN'), ('said', 'VBD'), ('0', '-NONE-'), ('local', 'JJ'), ('authorities', 'NNS'), ('held', 'VBD'), ('hearings', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('allegations', 'NNS'), ('last', 'JJ'), ('spring', 'NN'), ('and', 'CC'), ('had', 'VBD'), ('returned', 'VBN'), ('the', 'DT'), ('plant', 'NN'), ('to', 'TO'), ('``', '``'), ('routine', 'JJ'), 

### 3.2 Lexicon (Unigram) Tagger

In NLTK, the `UnigramTagger()`  can be used to train a model.

In [51]:
# Lexicon (or unigram tagger)
unigram_tagger = nltk.UnigramTagger(train_set)
unigram_tagger.evaluate(test_set)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  unigram_tagger.evaluate(test_set)


0.8748310086721404

### 3.3. Rule-Based (Regular Expression) Tagger

In [58]:
# specify patterns for tagging
# example from the NLTK book
patterns = [
    (r'.*ing$', 'VBG'),              # gerund
    (r'.*ed$', 'VBD'),               # past tense
    (r'.*es$', 'VBZ'),               # 3rd singular present
    (r'.*ould$', 'MD'),              # modals
    (r'.*\'s$', 'NN$'),              # possessive nouns
    (r'.*s$', 'NNS'),                # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN')                    # nouns
]

In [59]:
regexp_tagger = nltk.RegexpTagger(patterns)
# help(regexp_tagger)

In [60]:
regexp_tagger.evaluate(test_set)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  regexp_tagger.evaluate(test_set)


0.21871599564744287