***
# <center>***Part of speech Tagging***

***

## ***I learned the following natural language processing techniques:***

- [Default tagging](#Default-tagging)  
- [Training a unigram part-of-speech tagger](#Training-a-unigram-part-of-speech-tagger)  
- [Combining taggers with backoff tagging](#Combining-taggers-with-backoff-tagging)  
- [Training and combining ngram taggers](#Training-and-combining-ngram-taggers)  
- [Creating a model of likely word tags](#Creating-a-model-of-likely-word-tags)  
- [Tagging with regular expressions](#Tagging-with-regular-expressions)  
- [Affix tagging](#Affix-tagging)  
- [Training a Brill tagger](#Training-a-Brill-tagger)  
- [Training the TnT tagger](#Training-the-TnT-tagger)  
- [Using WordNet for tagging](#Using-WordNet-for-tagging)  
- [Tagging proper names](#Tagging-proper-names)  
- [Classifier-based tagging](#Classifier-based-tagging)  
- [Training a tagger with NLTK-Trainer](#Training-a-tagger-with-NLTK-Trainer)  


**Part-of-speech tagging** is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on. Part-of-speech tagging is a necessary step before chunking. Without the part-of-speech tags, a chunker cannot know how to extract 
phrases from a sentence. But with part-of-speech tags, you can tell a chunker how to identify phrases based on tag patterns.



***
## ***<a id="Default-tagging"></a>Default tagging:***
***


**Default tagging** provides a baseline for part-of-speech tagging. It simply assigns the same part-of-speech tag to every token. We do this using the DefaultTagger class. This tagger  is useful as a last-resort tagger, and provides a baseline to measure accuracy improvements.

We are going to use the **treebank** corpus because it is a common standard and is quick to load and test. But everything we do should apply equally well to **brown**, **conll2000**, and any other part-of-speech tagged corpus.

The **`DefaultTagger`** class takes a single argument, the tag you want to apply. We will give it **NN**, which is the tag for a singular noun. **DefaultTagger** is most useful when you choose the most common part-of-speech tag. 

In [20]:
import warnings
warnings.filterwarnings("ignore")

from nltk.tag import DefaultTagger
tagger = DefaultTagger('NN')
tagger.tag(['Hello', 'World'])


[('Hello', 'NN'), ('World', 'NN')]

Every **tagger** has a **tag()** method that takes a list of tokens, where each token is a single word. This list of tokens is usually a list of words produced by a word tokenizer. As you can see, **tag()** returns a list of **tagged tokens**, where a tagged token is a tuple of **(word, tag)**.

***`DefaultTagger`*** is a subclass of ***`SequentialBackoffTagger`***. Every subclass of **SequentialBackoffTagger** must implement the choose_tag() method, which takes three arguments:
 - The list of tokens
 - The index of the current token whose tag we want to choose
 - The history, which is a list of the previous tags

SequentialBackoffTagger implements the tag() method, which calls the choose_tag() method of the subclass for each index in the tokens list while accumulating a history of the previously tagged tokens. This history is the reason for the Sequential in  SequentialBackoffTagger. We'll get to the backoff portion of the name in the Combining taggers with backoff tagging

**Evaluating accuracy:**  

To know how accurate a tagger is, you can use the evaluate() method, which takes a list 
of tagged tokens as a gold standard to evaluate the tagger. Using our default tagger created 
earlier, we can evaluate it against a subset of the treebank corpus tagged sentences.

In [21]:

from nltk.corpus import treebank
test_sents = treebank.tagged_sents()[3000:]
tagger.evaluate(test_sents)


0.14331966328512843

So, by just choosing **NN** for every tag, we can achieve **14%** accuracy testing on one-fourth of the treebank corpus. Of course, accuracy will be different if you choose a different default tag.

**Tagging sentences:**

**TaggerI** also implements a **tag_sents()** method that can be used to tag a list of sentences, instead of a single sentence. Here is an example of tagging two simple sentences:

In [22]:

tagger.tag_sents([['Hello', 'world', '.'], ['How', 'are', 'you', '?']])


[[('Hello', 'NN'), ('world', 'NN'), ('.', 'NN')],
 [('How', 'NN'), ('are', 'NN'), ('you', 'NN'), ('?', 'NN')]]

The result is a list of two `tagged sentences`, and of course, every tag is `NN` because we're using 
the DefaultTagger class. The `tag_sents()` method can be quiet useful if you have many 
sentences you wish to tag all at once.

**Untagging a tagged sentence:**

Tagged sentences can be untagged using `nltk.tag.untag()`. Calling this function with  a tagged sentence will return a list of words without the tags.

In [23]:

from nltk.tag import untag
untag([('Hello', 'NN'), ('World', 'NN')])


['Hello', 'World']



***
## ***<a id="Training-a-unigram-part-of-speech-tagger"></a>Training a unigram part-of-speech tagger:***
***




A **unigram** generally refers to a single token. Therefore, a unigram tagger only uses a single word as its context for determining the part-of-speech tag. `UnigramTagger` inherits from `NgramTagger`, which is a subclass of `ContextTagger`, which inherits from `SequentialBackoffTagger`. In other words, `UnigramTagger` is a context-based tagger whose context is a single word, or unigram.

**UnigramTagger** can be trained by giving it a list of tagged sentences at initialization.

In [24]:

from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
treebank.sents()[0]


['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [25]:

tagger.tag(treebank.sents()[0])


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

We use the first 3000 tagged sentences of the treebank corpus as the training set to initialize the `UnigramTagger` class. Then, we see the first sentence as a list of words, and can see how it is transformed by the tag() function into a list of tagged tokens.

Let's see how accurate the `UnigramTagger` class is on the test sentences

In [26]:

tagger.evaluate(test_sents)


0.8571551910209367

It has almost `86%` accuracy for a tagger that only uses single word lookup to determine the part-of-speech tag. All accuracy gains from here on will be much smaller.

**Overriding the context model:**

All taggers that inherit from ContextTagger can take a pre-built model instead of training their own. This model is simply a Python dict mapping a context key to a tag. The context keys will depend on what the ContextTagger subclass returns from its context() method. For UnigramTagger, context keys are individual words. But for other NgramTagger subclasses, the context keys will be tuples.

Here's an example where we pass a very simple model to the UnigramTagger class instead of a training set

In [27]:

tagger = UnigramTagger(model={'Pierre': 'NN'})
tagger.tag(treebank.sents()[0])


[('Pierre', 'NN'),
 ('Vinken', None),
 (',', None),
 ('61', None),
 ('years', None),
 ('old', None),
 (',', None),
 ('will', None),
 ('join', None),
 ('the', None),
 ('board', None),
 ('as', None),
 ('a', None),
 ('nonexecutive', None),
 ('director', None),
 ('Nov.', None),
 ('29', None),
 ('.', None)]

**Minimum frequency cutoff:**

The ContextTagger class uses frequency of occurrence to decide which tag is most likely for a given context. By default, it will do this even if the context word and tag occurs only once. If you would like to set a minimum frequency threshold, then you can pass a cutoff value to the `UnigramTagger` class.

In [28]:

tagger = UnigramTagger(train_sents, cutoff=3)
tagger.evaluate(test_sents)


0.775350744657889


***
## ***<a id="Combining-taggers-with-backoff-tagging"></a>Combining taggers with backoff tagging:***
***



**Backoff tagging** is one of the core features of `SequentialBackoffTagger`. It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backoff tagger. If that one can't do it, it can pass the word on to the next backoff tagger, and so on until there are no backoff taggers left to check.

In [29]:

tagger1 = DefaultTagger('NN')
tagger2 = UnigramTagger(train_sents, backoff=tagger1)
tagger2.evaluate(test_sents)


0.8741204403194475

**Saving and loading a trained tagger with pickle:**

Since training a tagger can take a while, and you generally only need to do the training once, pickling a trained tagger is a useful way to save it for later usage. If your trained tagger is called tagger, then here's how to dump and load it with pickle:

In [30]:

import pickle

f = open('tagger.pickle', 'wb')
pickle.dump(tagger, f)
f.close()


In [31]:

f = open('tagger.pickle', 'rb')
tagger = pickle.load(f)


In [33]:

tagger.evaluate(test_sents)


0.775350744657889


***
## ***<a id="Training-and-combining-ngram-taggers"></a>Training and combining ngram taggers:***
***


In addition to `UnigramTagger`, there are two more NgramTagger subclasses: `BigramTagger` and `TrigramTagger`. The BigramTagger subclass uses the previous tag as part of its context, while the TrigramTagger subclass uses the previous two tags. An `ngram` is a subsequence of *n* items, so the `BigramTagger` subclass looks at two items and the `TrigramTagger` subclass looks at three items.

In [34]:

from nltk.tag import BigramTagger, TrigramTagger
bitagger = BigramTagger(train_sents)
bitagger.evaluate(test_sents)


0.11318799913662854

In [35]:

bitagger = TrigramTagger(train_sents)
bitagger.evaluate(test_sents)


0.06902654867256637

Where `BigramTagger` and `TrigramTagger` can make a contribution is when we combine them with backoff tagging. This time, instead of creating each tagger individually, we will create a function that will take train_sents, a list of `SequentialBackoffTagger` classes, and an optional final backoff tagger, then train each tagger with the previous tagger as a backoff. 

In [37]:

def backoff_tagger(train_sents, tagger_classes, backoff=None):
    for cls in tagger_classes:
        backoff = cls(train_sents, backoff=backoff)
    return backoff
    

In [39]:

backoff = DefaultTagger('NN')
tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=backoff)


In [41]:

tagger.evaluate(test_sents)


0.8806388948845241

**Quadgram tagger:**

The `NgramTagger` class can be used by itself to create a tagger that uses more than three ngrams for its context key.

In [44]:

from nltk.tag import NgramTagger
quadtagger = NgramTagger(4, train_sents)
quadtagger.evaluate(test_sents)


0.058493416792575005

It's even worse than the `TrigramTagger`, Here's an alternative implementation of a `QuadgramTagger` class that we can include in a list to `backoff_tagger`. 

In [45]:

class QuadgramTagger(NgramTagger):
    def __init__(self, *args, **kwargs):
        NgramTagger.__init__(self, 4, *args, **kwargs)
        

In [46]:

quadtagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger, QuadgramTagger], backoff=backoff)
quadtagger.evaluate(test_sents)


0.8805093891646881



***
## ***<a id="Creating-a-model-of-likely-word-tags"></a>Creating a model of likely word tags:***
***



A model of likely word tags involves estimating the probability of a word belonging to a particular part-of-speech (POS) based on its occurrence in a tagged corpus. This is useful for POS tagging tasks in NLP.


In [48]:

from nltk.probability import FreqDist, ConditionalFreqDist
def word_tag_model(words, tagged_words, limit=200):
    fd = FreqDist(words)
    cfd = ConditionalFreqDist(tagged_words)
    most_freq = (word for word, count in fd.most_common(limit))
    return dict((word, cfd[word].max()) for word in most_freq)
    

In [49]:

from nltk.corpus import treebank
model = word_tag_model(treebank.words(), treebank.tagged_words())
tagger = UnigramTagger(model=model)
tagger.evaluate(test_sents)


0.5593352039715087

An accuracy of almost `56%` is ok, but nowhere near as good as the trained UnigramTagger. Let's try adding it to our backoff chain.

In [50]:

default_tagger = DefaultTagger('NN')
likely_tagger = UnigramTagger(model=model, backoff=default_tagger)
tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=likely_tagger)
tagger.evaluate(test_sents)


0.8806388948845241

The final accuracy is exactly the same as without the `likely_tagger`. This is because the frequency calculations we did to create the model are almost exactly the same as what happens when we train a UnigramTagger class.


***
## ***<a id="Tagging-with-regular-expressions"></a>Tagging with regular expressions:***
***


You can use regular expression matching to tag words.

In [51]:

patterns = [
  (r'^\d+$', 'CD'),
  (r'.*ing$', 'VBG'), # gerunds, i.e. wondering
  (r'.*ment$', 'NN'), # i.e. wonderment
  (r'.*ful$', 'JJ') # i.e. wonderful
 ]


Once you have constructed this list of patterns, you can pass it into RegexpTagger.


In [53]:

from nltk.tag import RegexpTagger
tagger = RegexpTagger(patterns)
tagger.evaluate(test_sents)


0.037470321605870924

So, it is not too great with just a few patterns, but since RegexpTagger is a subclass of `SequentialBackoffTagger`, it can be a useful part of a backoff chain. For example, it could be positioned just before a `DefaultTagger` class, to tag words that the ngram 
tagger(s) missed.



***
## ***<a id="Affix-tagging"></a>Affix tagging:***
***




***
## ***<a id="Training-a-Brill-tagger"></a>Training a Brill tagger:***
***




***
## ***<a id="Training-the-TnT-tagger"></a>Training the TnT tagger:***
***




***
## ***<a id="Using-WordNet-for-tagging"></a>Using WordNet for tagging:***
***




***
## ***<a id="Tagging-proper-names"></a>Tagging proper names:***
***




***
## ***<a id="Classifier-based-tagging"></a>Classifier-based tagging:***
***




***
## ***<a id="Training-a-tagger-with-NLTK-Trainer"></a>Training a tagger with NLTK-Trainer:***
***
