## PART OF SPEECH TAGGER USING HIDDEN MARKOV MODELS


#### iNTRODUCTION
    Part of speech tagging is the process of determining the syntactic category of a word from the words in its surrounding context. It is often used to help disambiguate natural language phrases because it can be done quickly with high accuracy.
    
In this notebook, we'll use the <a href ="https://pomegranate.readthedocs.io/en/latest/" target="_blank">Pomegranate library </a> to build a hidden Markov model for part of speech tagging using a "universal" tagset. Hidden Markov models have been able to achieve >96% tag accuracy with larger tagsets on realistic text corpora. Hidden Markov models have also been used for speech recognition and speech generation, machine translation, gene recognition for bioinformatics, and human gesture recognition for computer vision, and more.


Visit <a href ="https://mhardik003.notion.site/NLP-751ad844946e499c9c64445a1254f648" target="_blank"> My Notion Page </a> for better understading of HMM networks.

Code help taken from <a href="https://github.com/udacity/artificial-intelligence/tree/master/Projects/4_HMM%20Tagger"> here </a>

***


Brief of what we will be doing in this notebook
* Reading and preprocessing the data
    * Evaluating the `dataset` interface (in the helpers.py code) (will be discussed further later)
    
<br>

* Building a most frequent class tagger
    * using pair counts
    * most frequent class tagger (MFC Tagger)
    * Making predictions with a model for checking
    * Decoding sequences with a MFC tagger
    * Evaluating the accuracy of this model

<br>

* Build the HMM tagger
    * Unigram counts
    * Bigram Counts
    * Sequence starting Counts
    * Seqeucne ending counts
    * Basic HMM tagger
    * Decoding sequences with the HMM tagger

***


In [81]:
%load_ext autoreload
%aimport helpers, tests
%autoreload 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [82]:
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint

from IPython.core.display import HTML
from itertools import chain
from collections import Counter, defaultdict
from helpers import show_model, Dataset


from pomegranate import State, HiddenMarkovModel, DiscreteDistribution

***

# Processing the data

### The dataset interface 

* It isan iterable collection of sentences having partitions for training & testing
* Dataset only attributes
    * `training_Set` : reference to a subset object containing the samples for trainig
    *  `testing_set` : reference to a subset object containing the samples for testing
    
<br>

* Dataset & Subset Attributes
    * `sentences` : a dictionary with an entry {sentence_key : Sentence()} for each sentence in the corpus
    * `key` : an immutable ordered (not sorted) collection of the sentence_keys for the corpus
    * `vocab` : an immutable collection of the unique words in the corpus
    * `tagset` : an immutable collection of the unique tags in the corpus
    * `X` : returns an array of words grouped by sentences ((w11,w12,w13,..),(w21,w22,w23...)...)
    * `Y` : returns an array of tags grouped by sentences ((t11,t12,t13,..),(t21,t22,t23...)...)
    * `N` : returns the number of distinct samples (individual words/tags) in the dataset

<br>

* Methods
    * `stream()` : returns a flat iterable over all (word,tag) pairs across all the sentences in the corpus
    * `__iter__()` : returns an iterable over the data as (sentence_key, Sentence()) pairs
    * `__len__()` : returns the number of sentences in the corpus


***

For example, consider a Subset, subset, of the sentences <br>
` {"s0": Sentence(("See", "Spot", "run"), ("VERB", "NOUN", "VERB")), "s1": Sentence(("Spot", "ran"), ("NOUN", "VERB"))}.` <br>
The subset will have these attributes:

* `subset.keys` == {"s1", "s0"}  `# unordered`
* `subset.vocab` == {"See", "run", "ran", "Spot"} ` # unordered`
* `subset.tagset` == {"VERB", "NOUN"} ` # unordered`
* `subset.X` == (("Spot", "ran"), ("See", "Spot", "run"))  `# order matches .keys`
* `subset.Y` == (("NOUN", "VERB"), ("VERB", "NOUN", "VERB")) ` # order matches .keys`
* `subset.N` == 7 ` # there are a total of seven observations over all sentences`
* `len(subset)` == 2 ` # because there are two sentences`

***

In [83]:
data = Dataset("tags-universal.txt", "brown-universal.txt", train_test_split=0.8)

print("There are {} sentences in the corpus.".format(len(data)))
print("There are {} sentences in the training set.".format(len(data.training_set)))
print("There are {} sentences in the testing set.".format(len(data.testing_set)))

assert len(data) == len(data.training_set) + len(data.testing_set), \
       "The number of sentences in the training set + testing set should sum to the number of sentences in the corpus"


There are 57341 sentences in the corpus.
There are 45872 sentences in the training set.
There are 11469 sentences in the testing set.


***
## Sentences

Dataset.sentences is a dictionary of all sentences in the training corpus, each keyed to a unique sentence identifier. Each Sentence is itself an object with two attributes: a tuple of the words in the sentence named words and a tuple of the tag corresponding to each word named tags.

***

In [84]:
key = 'b100-38532'
print("Setence : {}". format(key))
print("Words in the sentence : \n\t{!s}" .format(data.sentences[key].words))
print("Tags in the sentence : \n\t{!s}" .format(data.sentences[key].tags))

Setence : b100-38532
Words in the sentence : 
	('Perhaps', 'it', 'was', 'right', ';', ';')
Tags in the sentence : 
	('ADV', 'PRON', 'VERB', 'ADJ', '.', '.')


***

## Unique Elements

We can access the list of unique words (the dataset vocabulary) via `Dataset.vocab` and the unique list of tags via `Dataset.tagset`.

***

In [85]:
print("There are a total of {} samples of {} unique words in the corpus."
      .format(data.N, len(data.vocab)))
print("There are {} samples of {} unique words in the training set."
      .format(data.training_set.N, len(data.training_set.vocab)))
print("There are {} samples of {} unique words in the testing set."
      .format(data.testing_set.N, len(data.testing_set.vocab)))
print("There are {} words in the test set that are missing in the training set."
      .format(len(data.testing_set.vocab - data.training_set.vocab)))   # since there are the words on which the POS tagger will work with only the transition probabilities

assert data.N == data.training_set.N + data.testing_set.N, \
       "The number of training + test samples should sum to the total number of samples"

There are a total of 1161241 samples of 56057 unique words in the corpus.
There are 928478 samples of 50422 unique words in the training set.
There are 232763 samples of 25277 unique words in the testing set.
There are 5635 words in the test set that are missing in the training set.



## Accessing word and tag Sequences

The `Dataset.X` and `Dataset.Y` attributes provide access to ordered collections of matching word and tag sequences for each sentence in the dataset.


In [86]:
# accessing words with Dataset.X and tags with Dataset.Y 
for i in range(2):    
    print("Sentence {}:".format(i + 1), data.X[i])
    print()
    print("Labels {}:".format(i + 1), data.Y[i])
    print()


Sentence 1: ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')

Labels 1: ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')

Sentence 2: ('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.')

Labels 2: ('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')




## Accessing (word, tag) Samples

The `Dataset.stream()` method returns an iterator that chains together every pair of (word, tag) entries across all sentences in the entire corpus.


In [87]:
# use Dataset.stream() (word, tag) samples for the entire corpus
print("\nStream (word, tag) pairs:\n")
for i, pair in enumerate(data.stream()):
    print("\t", pair)
    if i > 5: break



Stream (word, tag) pairs:

	 ('Mr.', 'NOUN')
	 ('Podger', 'NOUN')
	 ('had', 'VERB')
	 ('thanked', 'VERB')
	 ('him', 'PRON')
	 ('gravely', 'ADV')
	 (',', '.')


# Building Most Frequent Class tagger (MFC tagger)

For both our baseline tagger and the HMM model we'll build, we need to estimate the frequency of tags & words from the frequency counts of observations in the training corpus. 
In the next several cells we will create functions to compute the counts of several sets of counts.

<br>

### How a MFC tagger works?
Perhaps the simplest tagger (and a good baseline for tagger performance) is to simply choose the tag most frequently assigned to each word. This "most frequent class" tagger inspects each observed word in the sequence and assigns it the label that was most often assigned to that word in the corpus.

In [88]:
def pair_counts(words, tags):
    """Returns a dictionary keyed to each unique value in the first sequence list
    that counts the number of occurrences of the corresponding value from the
    second sequences list.
    
    For example, if sequences_A is tags and sequences_B is the corresponding
    words, then if 1244 sequences contain the word "time" tagged as a NOUN, then
    it would return a dictionary such that pair_counts[NOUN][time] == 1244
    """

    emission_counts_dict ={}
    temp_dict={}
    for i in range(len(words)):
        # print(i,"\n\n")
        for j in range(len(words[i])):
            # print(words[i][j],"    =>    ",tags[i][j])
            if tags[i][j] not in emission_counts_dict.keys():   # if there is no entry for that key in the dictionary
                emission_counts_dict[tags[i][j]] = {words[i][j]:1}
            if words[i][j] not in emission_counts_dict[tags[i][j]]: # if there is no entry for that words in the dictionary of its tag
                emission_counts_dict[tags[i][j]].update({words[i][j]:1})

            else: # if the words along with its tag is present, then to increase its count by 1
                emission_counts_dict[tags[i][j]].update({words[i][j]:emission_counts_dict[tags[i][j]][words[i][j]]+1})


    return emission_counts_dict

emission_counts=pair_counts(data.X, data.Y)

# pprint(emission_counts)
assert len(emission_counts) == 12, \
       "Uh oh. There should be 12 tags in your dictionary."

assert max(emission_counts["NOUN"], key=emission_counts["NOUN"].get) == 'time', \
       "Hmmm...'time' is expected to be the most common NOUN."

HTML('<div class="alert alert-block alert-success">Your emission counts look good!</div>')

## Implementation of the MFC tagger

Using the pair_counts() function and the training dataset to find the most frequent class label for each word in the training data, and populate the mfc_table below. The table keys would be words, and the values would be the appropriate tag string.

The MFCTagger class is provided to mock the interface of Pomegranite HMM models so that they can be used interchangeably.

In [101]:
# Create a lookup table mfc_table where mfc_table[word] contains the tag label most frequently assigned to that word
from collections import namedtuple

FakeState = namedtuple("FakeState", "name")


class MFCTagger:
    # NOTE: You should not need to modify this class or any of its methods
    missing = FakeState(name="<MISSING>")

    def __init__(self, table):
        # if the key is not found, then the value attached to the missing key will be "<MISSING>"
        self.table = defaultdict(lambda: MFCTagger.missing)
        # dictionary where the key is the word and the attahced value to the key is the namedtuple FakedState whose name arguement is the tag of that word which is the most frequent
        self.table.update({word: FakeState(name=tag)
                          for word, tag in table.items()})

    def viterbi(self, seq):
        """This method simplifies predictions by matching the Pomegranate viterbi() interface"""
        return 0., list(enumerate(["<start>"] + [self.table[w] for w in seq] + ["<end>"]))


# calculating the frequency of each tag being assigned to each word (hint: similar, but not
# the same as the emission probabilities) and use it to fill the mfc_table

# this will return a dictionary of dictionaries where each insdie dictionary has the key the tag of that word and the number of the occurence of that tag with that word
word_counts = pair_counts(data.training_set.Y, data.training_set.X)

# print(word_counts)

mfc_table = {}

for i in word_counts:
    # print(word_counts[i])
    max_count = 0
    for j in word_counts[i]:
        # print(word_counts[i][j])
        if (word_counts[i][j] > max_count):
            # print(i," ==> ", j)
            mfc_table[i] = j
            max_count = word_counts[i][j]


# pprint(mfc_table)

# Create a Most Frequent Class tagger instance
mfc_model = MFCTagger(mfc_table)

# for i in mfc_model:
#     print(i)

assert len(mfc_table) == len(data.training_set.vocab), ""
assert all(k in data.training_set.vocab for k in mfc_table.keys()), ""
assert sum(int(k not in mfc_table) for k in data.testing_set.vocab) == 5635, ""
HTML('<div class="alert alert-block alert-success">Your MFC tagger has all the correct words!</div>')


## Overcoming an issue in pomegranate

The helper functions provided below interface with Pomegranate network models & the mocked MFCTagger to take advantage of the <a href ="https://pomegranate.readthedocs.io/en/latest/nan.html" target="_blank">missing value  </a>functionality in Pomegranate through a simple sequence decoding function. Run these functions, then run the next cell to see some of the predictions made by the MFC tagger.

In [90]:
def replace_unknown(sequence):
    """Return a copy of the input sequence where each unknown word is replaced
    by the literal string value 'nan'. Pomegranate will ignore these values
    during computation.
    """
    return [w if w in data.training_set.vocab else 'nan' for w in sequence]


def simplify_decoding(X, model):
    """X should be a 1-D sequence of observations for the model to predict"""
    _, state_path = model.viterbi(replace_unknown(X))
    # do not show the start/end state predictions
    return [state[1].name for state in state_path[1:-1]]


## Decoding sequences with MFC tagger

In [91]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, mfc_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-37269

Predicted labels:
-----------------
['DET', 'ADJ', 'NOUN', 'VERB', 'ADP', 'NOUN', '.', 'CONJ', 'PRT', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'ADP', 'DET', 'ADJ', '.']

Actual labels:
--------------
('DET', 'ADJ', 'NOUN', 'VERB', 'ADP', 'NOUN', '.', 'CONJ', 'PRT', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'ADP', 'DET', 'ADJ', '.')


Sentence Key: b100-28144

Predicted labels:
-----------------
['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Predicted labels:
-----------------
['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']

Actual labels:
--------------
('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', 

### Evaluating the accuracy of the MFC tagger model

The function below will evaluate the accuracy of the MFC tagger on the collection of all sentences from a text corpus.

In [92]:


def accuracy(X, Y, model):
    """Calculate the prediction accuracy by using the model to decode each sequence
    in the input X and comparing the prediction with the true labels in Y.
    
    The X should be an array whose first dimension is the number of sentences to test,
    and each element of the array should be an iterable of the words in the sequence.
    The arrays X and Y should have the exact same shape.
    
    X = [("See", "Spot", "run"), ("Run", "Spot", "run", "fast"), ...]
    Y = [(), (), ...]
    """
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        
        # The model.viterbi call in simplify_decoding will return None if the HMM
        # raises an error (for example, if a test sentence contains a word that
        # is out of vocabulary for the training set). Any exception counts the
        # full sentence as an error (which makes this a conservative estimate).
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions




### Evaluate the accuracy of the MFC tagger

Run the next cell to evaluate the accuracy of the tagger on the training and test corpus.


In [93]:
mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)
print("training accuracy mfc_model: {:.2f}%".format(100 * mfc_training_acc))

mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)
print("testing accuracy mfc_model: {:.2f}%".format(100 * mfc_testing_acc))

assert mfc_training_acc >= 0.955, "Uh oh. Your MFC accuracy on the training set doesn't look right."
assert mfc_testing_acc >= 0.925, "Uh oh. Your MFC accuracy on the testing set doesn't look right."
HTML('<div class="alert alert-block alert-success">Your MFC tagger accuracy looks correct!</div>')


training accuracy mfc_model: 95.71%
testing accuracy mfc_model: 92.95%


***
## Building the HMM tagger

The HMM tagger has one hidden state for each possible tag, and parameterized by two distributions: 
* the emission probabilties giving the conditional probability of observing a given word from each hidden state (basically the MFC tagger we made above)
* the transition probabilities giving the conditional probability of moving between tags during the sequence.

We will also estimate the starting probability distribution (the probability of each tag being the first tag in a sequence), and the terminal probability distribution (the probability of each tag being the last tag in a sequence).

The maximum likelihood estimate of these distributions can be calculated from the frequency counts as described in the following sections where you'll implement functions to count the frequencies, and finally build the model. The HMM model will make predictions according to the formula:

<img src="hmm_formula.png">

<br>

***



*** 
## Unigram Counting
Estimating the co-occurence frequency of each symbol over all of the input sentences, where

<img src="unigram_formula.png">

In [94]:
def unigram_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequence list that
    counts the number of occurrences of the value in the sequences list. The sequences
    collection should be a 2-dimensional array.
    
    For example, if the tag NOUN appears 275558 times over all the input sequences,
    then you should return a dictionary such that your_unigram_counts[NOUN] == 275558.
    """
    tag_unigrams={}
    for i in sequences:
        # print(i)
        for j in i:
            if j not in tag_unigrams :
                tag_unigrams[j]=1
            else :
                tag_unigrams[j] = tag_unigrams[j]+1
    return tag_unigrams

    
    # raise NotImplementedError

# calling unigram_counts with a list of tag sequences from the training set
tag_unigrams = unigram_counts(data.training_set.Y)

print(tag_unigrams)

assert set(tag_unigrams.keys()) == data.training_set.tagset, \
       "Uh oh. It looks like your tag counts doesn't include all the tags!"
assert min(tag_unigrams, key=tag_unigrams.get) == 'X', \
       "Hmmm...'X' is expected to be the least common class"
assert max(tag_unigrams, key=tag_unigrams.get) == 'NOUN', \
       "Hmmm...'NOUN' is expected to be the most common class"
HTML('<div class="alert alert-block alert-success">Your tag unigrams look good!</div>')

{'DET': 109608, 'NOUN': 220754, 'VERB': 146138, 'ADP': 115857, 'ADJ': 66761, 'ADV': 44863, '.': 117718, 'PRT': 23866, 'PRON': 39369, 'CONJ': 30552, 'NUM': 11893, 'X': 1099}


## Bigram Counting
To estimate the co-occurence frequency of each pair of symbols in each of the input sentences. These counts are used in the HMM model to estimate the bigram probability of two tags from the frequency counts according to the formula :

<img src = "bigram_formula.png">

In [95]:
def bigram_counts(sequences):
       """Return a dictionary keyed to each unique PAIR of values in the input sequences
       list that counts the number of occurrences of pair in the sequences list. The input
       should be a 2-dimensional array.

       For example, if the pair of tags (NOUN, VERB) appear 61582 times, then you should
       return a dictiona
       try such that your_bigram_counts[(NOUN, VERB)] == 61582
       """
       tag_bigrams={}
       for sentence_tags in sequences:
              for i in range(len(sentence_tags)):
                     if(i>0):
                            if((sentence_tags[i-1], sentence_tags[i]) not in tag_bigrams):
                                   tag_bigrams[(sentence_tags[i-1], sentence_tags[i])]=1
                            else:
                                   tag_bigrams[(sentence_tags[i-1], sentence_tags[i])]+=1

       return tag_bigrams


# TODO: call bigram_counts with a list of tag sequences from the training set
tag_bigrams = bigram_counts(data.training_set.Y)

assert len(tag_bigrams) == 144, \
       "Uh oh. There should be 144 pairs of bigrams (12 tags x 12 tags)"
assert min(tag_bigrams, key=tag_bigrams.get) in [('X', 'NUM'), ('PRON', 'X')], \
       "Hmmm...The least common bigram should be one of ('X', 'NUM') or ('PRON', 'X')."
assert max(tag_bigrams, key=tag_bigrams.get) in [('DET', 'NOUN')], \
       "Hmmm...('DET', 'NOUN') is expected to be the most common bigram."
HTML('<div class="alert alert-block alert-success">Your tag bigrams look good!</div>')


***
## Sequence starting counts

To estimate the bigram probabilities of a sequence starting with each tag.

***

In [96]:
def starting_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the beginning of
    a sequence.
    
    For example, if 8093 sequences start with NOUN, then you should return a
    dictionary such that your_starting_counts[NOUN] == 8093
    """
    tag_starts={}
    for sentence_tags in sequences:
        if(sentence_tags[0] not in tag_starts):
            tag_starts[sentence_tags[0]]=1
        else:
            tag_starts[sentence_tags[0]]+=1
    
    return tag_starts

# TODO: Calculate the count of each tag starting a sequence
tag_starts = starting_counts(data.training_set.Y)

assert len(tag_starts) == 12, "Uh oh. There should be 12 tags in your dictionary."
assert min(tag_starts, key=tag_starts.get) == 'X', "Hmmm...'X' is expected to be the least common starting bigram."
assert max(tag_starts, key=tag_starts.get) == 'DET', "Hmmm...'DET' is expected to be the most common starting bigram."
HTML('<div class="alert alert-block alert-success">Your starting tag counts look good!</div>')

***
## Sequence ending counts
To estimate the bigram probabilities of a sequence ending with each tag.

***

In [97]:
def ending_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the end of
    a sequence.
    
    For example, if 18 sequences end with DET, then you should return a
    dictionary such that your_starting_counts[DET] == 18
    """
    tag_ends={}
    for sentence_tags in sequences:
        if(sentence_tags[-1] not in tag_ends):
            tag_ends[sentence_tags[-1]]=1
        else:
            tag_ends[sentence_tags[-1]]+=1
    
    return tag_ends

# TODO: Calculate the count of each tag ending a sequence
tag_ends = ending_counts(data.training_set.Y)

assert len(tag_ends) == 12, "Uh oh. There should be 12 tags in your dictionary."
assert min(tag_ends, key=tag_ends.get) in ['X', 'CONJ'], "Hmmm...'X' or 'CONJ' should be the least common ending bigram."
assert max(tag_ends, key=tag_ends.get) == '.', "Hmmm...'.' is expected to be the most common ending bigram."
HTML('<div class="alert alert-block alert-success">Your ending tag counts look good!</div>')


## Implementing a basic HMM tagger

Using the tag unigrams and bigrams calculated above to construct a HMM

* Add one state per tag
    * The emission distribution at each state should be estimated with the formula : `P(w|t) = C(t,w)/C(t)`

<br>

* Add an edge from the starting state basic_model.start to each tag
    * The transition probability should be estimated with the formula : `P(t|start) = C(start,t)/C(start)`

<br>

* Add an edge from each tag to the end state basic_model.end
    * The transition probability should be estimated with the formula : `P(end|t) = C(t,end)/C(t)`

<br>

* Add an edge between every pair of tags
    * The transition probability should be estimated with the formula : `P(t2|t1) = C(t1,t2)/C(t1)`

In [98]:
from operator import ilshift


basic_model = HiddenMarkovModel(name="base-hmm-tagger")

# Creating states with emission probability distributions P(word | tag) and add to the model
tag_counts = pair_counts(data.training_set.X, data.training_set.Y)
# pprint(tag_counts)
countingg = 0
# to convert them into their probabilities
for i in tag_counts:
    for j in tag_counts[i]:
        tag_counts[i][j] /= data.training_set.N

states_dict = {}
tags_states_dict = {}
for i in tag_counts:
    states_dict = {}
    # pprint(tag_counts[i])
    states_dict = tag_counts[i]

    # pprint(states_dict)
    emissions = DiscreteDistribution(states_dict)
    states = State(emissions, name=i)
    tags_states_dict[i] = states
    basic_model.add_state(states)


# Adding edges between states for the observed transition frequencies P(tag_i | tag_i-1)

for i in tag_starts:
    # print("Start, ", i, " ==> ", tag_starts[i]/len(data.training_set))
    basic_model.add_transition(
        basic_model.start, tags_states_dict[i], tag_starts[i]/len(data.training_set))


for i in tag_bigrams:
    # print(i[0]," , ",i[1]," ==> ", ((tag_bigrams[i]*2)/data.training_set.N))
    basic_model.add_transition(
        tags_states_dict[i[0]], tags_states_dict[i[1]], ((tag_bigrams[i]*2)/data.training_set.N))

for i in tag_ends:
    # print(i, " , End ==>", tag_ends[i]/len(data.training_set))
    basic_model.add_transition(
        tags_states_dict[i], basic_model.end, tag_ends[i]/len(data.training_set))
    # countingg+=tag_ends[i]/len(data.training_set)

# print("countingggg ==> ",countingg)


# finalizing the model
basic_model.bake()

assert all(tag in set(s.name for s in basic_model.states) for tag in data.training_set.tagset), \
    "Every state in your network should use the name of the associated tag, which must be one of the training set tags."
assert basic_model.edge_count() == 168, \
    ("Your network should have an edge from the start node to each state, one edge between every " +
        "pair of tags (states), and an edge from each state to the end node.")
HTML('<div class="alert alert-block alert-success">Your HMM network topology looks good!</div>')

hmm_training_acc = accuracy(
    data.training_set.X, data.training_set.Y, basic_model)
# print("training accuracy basic hmm model: {:.2f}%".format(
# 100 * hmm_training_acc))

hmm_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, basic_model)
# print("testing accuracy basic hmm model: {:.2f}%".format(
# 100 * hmm_testing_acc))


print(hmm_training_acc)
print(hmm_testing_acc)

assert hmm_training_acc > 0.965, "Uh oh. Your HMM accuracy on the training set doesn't look right."
assert hmm_testing_acc > 0.95, "Uh oh. Your HMM accuracy on the training set doesn't look right."
HTML('<div class="alert alert-block alert-success">Your HMM tagger accuracy looks correct! Congratulations, you\'ve finished the project.</div>')
#


0.9699098955494907
0.9537469443167513


In [99]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, basic_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-37269

Predicted labels:
-----------------
['DET', 'ADJ', 'NOUN', 'VERB', 'ADP', 'NOUN', '.', 'CONJ', 'PRT', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'ADP', 'DET', 'ADJ', '.']

Actual labels:
--------------
('DET', 'ADJ', 'NOUN', 'VERB', 'ADP', 'NOUN', '.', 'CONJ', 'PRT', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'ADP', 'DET', 'ADJ', '.')


Sentence Key: b100-28144

Predicted labels:
-----------------
['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Predicted labels:
-----------------
['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']

Actual labels:
--------------
('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', 

## Improving model performance

Other methods can be used to improve the performance on the larger tagsets where the data sparsity problem is more significant (more missing data tags that have zero occurrences in the data)

* Laplace Smoothing
    * Adding a small, non-zero value to all the observed counts to offset for unobserved values

<br>

* Backoff Smoothing
    * Interpolating between n-grams for missing data
    * Refer to chapter 4,9,10 of <a href="https://web.stanford.edu/~jurafsky/slp3/">Jurafsky Martin </a> to know more

<br>

* Using trigrams instead of bigrams
    * Consering three consecutive states instead of two