# Part of Speech Tagging
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. 

## Tagsets
The collection of tags used for a particular task 
Tags for parts of speech:	
- Nouns, verbs, adverbs, adjectives, articles, etc	
- Subtagging
    - nouns	can	be singular	or plural	
    - verbs	have tenses	
- Different	tagsets	have different focuses	

## Using a Tagger
The tagger converts a sentence, in the form of a list of words,
into a list of tuples, where each tuple is of the form (word, tag).

In [1]:
import nltk
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

We don’t even remember what all the tags mean sometimes

In [2]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


### Homographs


In [3]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") 
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

## Tagged Corpora

### Representing Tagged Tokens
By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag.

In [4]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

In [5]:
tagged_token[0]

'fly'

In [6]:
tagged_token[1]

'NN'

We can construct a list of tagged tokens directly from a string

In [7]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''
# TODO: Convert it to list of tuples (word,tag)


### Reading Tagged Corpora
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:

*The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.*

#### Many Tag Sets
Different corpora have different conventions for tagging.	
- NLTK made a simplified, unified tagset
 …	which	no	one	uses.	

In [8]:
print(nltk.corpus.brown.tagged_words())
print(nltk.corpus.brown.tagged_words(tagset='universal'))

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[('The', 'DET'), ('Fulton', 'NOUN'), ...]


In [9]:
print(nltk.corpus.treebank.tagged_words())
print(nltk.corpus.treebank.tagged_words(tagset='universal'))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]


###  Exploring Tagged Corpora

#### Finding the Tagset
Let's find the list of universal tages sorted by frequency

In [10]:
from nltk.corpus import brown 
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal') 
# TODO: do it here


In [11]:
from nltk.corpus import brown
def process(sentence): 
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)
            
for tagged_sent in brown.tagged_sents()[:100]:
    process(tagged_sent)

combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
expected to approve
expected to make
intends to make
seek to set
like to see


###### Syntax note: Python Default Dictionaries

## Automatic Tagging
We will explore various ways to automatically add part-of-speech tags to text. We will see that the tag of a word depends on the word and its context within a sentence. For this reason, we will be working with data at the level of (tagged) sentences rather than words. We'll begin by loading the data we will be using.

In [12]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

### The Default Tagger
The simplest possible tagger assigns the same tag to each token. This may seem to be a rather banal step, but it establishes an important baseline for tagger performance. In order to get the best result, we tag each word with the most likely tag. Let's find out which tag is most likely

In [13]:
brown_tagged_words = brown.tagged_words(categories='news')
#TODO: what is the most frequent tag in brown


Now we can create a tagger that tags everything as NN.

In [14]:
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.tag(tokens))

[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]


Unsurprisingly, this method performs rather **poorly**. On a typical corpus, it will tag only about an eighth of the tokens correctly, 

###### How to evalute the porformance?
We evaluate the performance of a tagger relative to the tags a human expert would assign. Since we don't usually have access to an expert and impartial human judge, we make do instead with gold standard test data. This is a corpus which has been manually annotated and which is accepted as a standard against which the guesses of an automatic system are assessed. The tagger is regarded as being correct if the tag it guesses for a given word is the same as the gold standard tag.

In [15]:
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

### The Regular Expression Tagger
The regular expression tagger assigns tags to tokens on the basis of human-defined matching patterns. 

In [16]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                     # nouns (default)
 ]

In [17]:
regexp_tagger = nltk.RegexpTagger(patterns)
print(regexp_tagger.tag(brown_sents[3]))

[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')]


In [18]:
regexp_tagger.evaluate(brown_tagged_sents)

0.20326391789486245

### The Lookup Tagger (Unigram Tagger)
A lot of high-frequency words do not have the NN tag. Let's find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a "lookup tagger" (an NLTK UnigramTagger):

In [19]:
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

most_freq_words = fd.most_common(100)

# TODO : make a dictionary of (word, most common tag for this word) for each word in most_freq_words
likely_tags = 

baseline_tagger = nltk.UnigramTagger(model=likely_tags)

baseline_tagger.evaluate(brown_tagged_sents)

SyntaxError: invalid syntax (<ipython-input-19-adc7bf950e00>, line 7)

It should come as no surprise by now that simply knowing the tags for the 100 most frequent words enables us to tag a large fraction of tokens correctly (nearly half in fact).

In [None]:
sent = brown.sents(categories='news')[3]
#TODO use it to tag sent


###### Combining taggers with backoff tagging
**Backoff tagging**: It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backoff tagger. If that one can't do it, it can pass the word on to the next backoff tagger, and so on until there are no backoff taggers left to check.

In [None]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags,backoff=nltk.DefaultTagger('NN'))

Evaluate:

In [None]:
baseline_tagger.evaluate(brown_tagged_sents)

Let's put all this together and write a program to create and evaluate lookup taggers having a range of sizes,

In [None]:
def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
    words_by_freq = [w for (w, _) in word_freqs]
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

In [None]:
display()

###  N-Gram Tagging

#### Unigram Tagging
Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
A unigram tagger behaves just like a lookup tagger, except there is a more convenient technique for setting it up, called training. In the following code sample, we train a unigram tagger, use it to tag a sentence, then evaluate.

In [None]:
from nltk.corpus import brown
brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

# TODO: make a unigram tagger using brown_tagged_sents as a training data, simply be passing it to the UnigramTagger
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

print(unigram_tagger.tag(brown_sents[2007]))

In [None]:
unigram_tagger.evaluate(brown_tagged_sents)

###### Separating the Training and Testing Data
A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text. Instead, we should split the data, training on 90% and testing on the remaining 10%:

In [None]:
size = int(len(brown_tagged_sents) * 0.9)
print(size)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

In [None]:
# other data: treebank
#from nltk.corpus import treebank
#train_sents = treebank.tagged_sents()[:3000]
#test_sents = treebank.tagged_sents()[3000:]

In [None]:
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

#### General N-Gram Tagging
An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens, 	

![title](img/tag-context.png)

##### Bigram Tagger

In [None]:
bigram_tagger = nltk.BigramTagger(train_sents)
print(bigram_tagger.tag(brown_sents[2007]))

In [None]:
unseen_sent = brown_sents[4203]
print(bigram_tagger.tag(unseen_sent))

In [None]:
#TODO: Evaluate bigram_tagger


**Why?**
*sparse data problem* As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. 
 there is a trade-off between the accuracy and the coverage of our results (and this is related to the *precision/recall trade-off* in information retrieval)

##### Layering taggers
Use the benefits of several types of taggers	
- Try the bigram tagger	
- When it is unable to find a tag, use the unigram tagger	
- If that fails, then use the default tagger	

In [20]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
#TODO: build a bigram tagger that backs off to unigram tagger t1

t2.evaluate(test_sents)

NameError: name 'train_sents' is not defined

Putting it all togather in a method...

In [21]:
def backoff_tagger(train_sents, tagger_classes, backoff=None):
    for cls in tagger_classes:
        backoff = cls(train_sents, backoff=backoff)
    return backoff

In [22]:
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger
default_tagger = DefaultTagger('NN')
initial_tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger], backoff=default_tagger)
initial_tagger.evaluate(test_sents)

NameError: name 'train_sents' is not defined

##### Tagging Unknown Words
- Our approach to tagging unknown words still uses *backoff* to a regular-expression tagger or a default tagger. These are unable to make use of context.
- if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context *the blog* or *to blog*
- How can we do better with these unknown words, or **out-of-vocabulary** items?
limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word *UNK* During training, a unigram tagger will probably learn that *UNK* is usually a *noun*.
However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then *UNK* will probably be tagged as a *verb*.

#####  Storing Taggers
Training a tagger on a large corpus may take a significant time.

In [23]:
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()

NameError: name 't2' is not defined

In [24]:
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()

EOFError: Ran out of input

In [25]:
text = """The board's action shows what free enterprise
...     is up against in our complex maze of regulatory laws ."""
tokens = text.split()
print(tagger.tag(tokens))

NameError: name 'tagger' is not defined

##### Performance Limitations
What is the upper limit to the performance of an n-gram tagger? 
- What is the precentage of words that have ambiguous contexts and could be assigned the wrong tag ...
Using trigram tagger

In [None]:
cfd = nltk.ConditionalFreqDist(
            ((x[1], y[1], z[0]), z[1])
            for sent in brown_tagged_sents
            for x, y, z in nltk.trigrams(sent))
ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

Use the **confusion matrix** to study the tagger's mistakes

**How *"fine"* the tagset should be?**
tagging process collapses distinctions: e.g. *lexical identity* is usually lost when all personal pronouns are tagged *PRP*. At the same time, the tagging process introduces new *distinction*s and removes ambiguities: e.g. *deal* tagged as *VB* or *NN*. 
- Finer distinctions (detailed tagset):  more information to tag on, but more work to do
- Fewer distinctions ( simplified tagset):  less information about context, but smaller range of choices in classification

##### What’s wrong with this picture? 
- Size of the n-gram table language model
- Context!	
        Words,not just pos, matter in context	
- Not understandable rules	

##  Transformation-Based Tagging (Brill Tagger)
Brill tagging is a kind of transformation-based learning, named after its inventor.
- The general idea: guess the tag of each word, then go back and fix the mistakes. 
- It uses supervised learning to build a list of transformational correction rules.
- First tag with the unigram tagger, then applying the rules to fix the errors.
- Rules ("replace T1 with T2 in the context C") 
    like:
    - (a) Replace *NN* with *VB* when the *previous* **word** is *TO*; 
    - (b) Replace *TO* with *IN* when the *next* **tag** is *NNS*
- Training: 
    - **guess** values for T1, T2 and C, to create thousands of candidate rules.
    - Each rule is **scored** according to its net benefit: the number of incorrect tags that it corrects, less the number of correct tags it incorrectly modifies
- The rules are linguistically **interpretable**

In [26]:
from nltk.tbl import demo as brill_tagger
brill_tagger.demo()

Loading tagged data from treebank... 
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
    Accuracy on test set: 0.8366
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
    Found 12799 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]
  18  19   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
  14  14   0   0  | VBP->VB if Pos:MD@[-2,-1]
  12  12   0   0  | VBP->VB if Pos:TO@[-1]
  

### Training Brill Tagger

In [27]:
from nltk.tag import brill, brill_trainer
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
    templates = [
        brill.Template(brill.Pos([-1])),
        brill.Template(brill.Pos([1])),
        brill.Template(brill.Pos([-2])),
        brill.Template(brill.Pos([2])),
        brill.Template(brill.Pos([-2, -1])),
        brill.Template(brill.Pos([1, 2])),
        brill.Template(brill.Pos([-3, -2, -1])),
        brill.Template(brill.Pos([1, 2, 3])),
        brill.Template(brill.Pos([-1]), brill.Pos([1])),
        brill.Template(brill.Word([-1])),
        brill.Template(brill.Word([1])),brill.Template(brill.Word([-2])),
        brill.Template(brill.Word([2])),
        brill.Template(brill.Word([-2, -1])),
        brill.Template(brill.Word([1, 2])),
        brill.Template(brill.Word([-3, -2, -1])),
        brill.Template(brill.Word([1, 2, 3])),
        brill.Template(brill.Word([-1]), brill.Word([1])),
    ]
    trainer = brill_trainer.BrillTaggerTrainer(initial_tagger,templates, deterministic=True)
    return trainer.train(train_sents, **kwargs)

***Compare Trigram tagger with brill***

In [28]:
from nltk.tag import TrigramTagger
initial_tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=default_tagger)
initial_tagger.evaluate(test_sents)

NameError: name 'train_sents' is not defined

In [None]:
brill_tagger = train_brill_tagger(initial_tagger, train_sents)
brill_tagger.evaluate(test_sents)

So, the BrillTagger has slightly increased accuracy over the initial_tagger

## Taggers are just classifiers
The modern approach: just turn your training data into features and throw them into a good classifier	

### TnT tagger
TnT stands for Trigrams'n'Tags. It is a statistical tagger based on second order Markov models.
The TnT tagger maintains a number of internal FreqDist and ConditionalFreqDist instances based on the training data. 
- These frequency distributions *count unigrams, bigrams, and trigrams*. 
- Then, during tagging, the frequencies are used to calculate the probabilities of possible tags for each word. So, instead of constructing a backoff chain of NgramTagger subclasses, the TnT tagger **uses all the ngram models together to choose the best tag**. 
- It also tries to guess the tags for the **whole sentence** at once by choosing the most likely model for the entire sentence, based on the probabilities of each possible tag.

In [None]:
from nltk.tag import tnt
tnt_tagger = tnt.TnT()
tnt_tagger.train(train_sents)
tnt_tagger.evaluate(test_sents[:100])
#tnt_tagger.evaluate(test_sents)

Training is fast but Tagging Process is slow

### Classifier-based tagging
The ClassifierBasedPOSTagger class uses classification to do part-of-speech tagging. Features are extracted from words, and then passed  to an internal classifier.

It defaults to training a ***NaiveBayesClassifier*** class with the given training data.
The feature detector finds multiple length suffixes, does some regular expression matching, and looks at the unigram, bigram, and trigram history to produce a fairly complete set of features for each word.
The feature sets it produces are used to train the internal classifier, and are used for classifying words into part-of-speech tags.

In [None]:
from nltk.tag.sequential import ClassifierBasedPOSTagger
tagger = ClassifierBasedPOSTagger(train=train_sents)
tagger.evaluate(test_sents)
#0.9309734513274336

Using ***MaxentClassifier***

In [None]:
from nltk.classify import MaxentClassifier
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger.evaluate(test_sents)

#### Detecting features with a custom feature detector
If you want to do your own feature detection, there are two ways to do it:
1. Subclass ClassifierBasedTagger and implement a feature_detector() method.
2. Pass a function as the feature_detector keyword argument into ClassifierBasedTagger at initialization.

In [None]:
def unigram_feature_detector(tokens, index, history):
    return {'word': tokens[index]}

In [None]:
from nltk.tag.sequential import ClassifierBasedTagger

tagger = ClassifierBasedTagger(train=train_sents, feature_detector=unigram_feature_detector)
tagger.evaluate(test_sents)

#### How to Determine the Category of a Word
- Morphological Clues: The internal structure of a word may give useful clues as to the word's category. 
    For example, -ness is a suffix that combines with an adjective to produce a noun, e.g. happy → happiness.
- Syntactic Clues: the typical contexts in which a word can occur. 
    For example, adjective in English is that it can occur immediately before a noun, or immediately following the words *be* or *very*. e.g. the near window .... The end is (very) near.
- Semantic Clues: the meaning of a word.  
    For example, the best-known definition of a noun is semantic: "the name of a person, place or thing".

New Words: mostly nouns, (open class)
