## What is Part of Speech (POS) tagging?

Back in elementary school, we have learned the differences between the various parts of speech tags such as nouns, verbs, adjectives, and adverbs. Associating each word in a sentence with a proper POS (part of speech) is known as POS tagging or POS annotation. POS tags are also known as word classes, morphological classes, or lexical tags.

### Stochastic (Probabilistic) tagging: 
A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. But sometimes this approach comes up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. One such approach is to calculate the probabilities of various tag sequences that are possible for a sentence and assign the POS tags from the sequence with the highest probability. Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.

### POS tagging with Hidden Markov Model
HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden Markov models are known for their applications to reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, partial discharges, and bioinformatics.

Let us consider an example proposed by Dr.Luis Serrano and find out how HMM selects an appropriate tag sequence for a sentence.

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16134154/pos2.png" style="width:600px;height:200px;">

In this example, we consider only 3 POS tags that are **noun, model and verb**. Let the sentence “ Ted will spot Will ” be tagged as noun, model, verb and a noun and to **calculate the probability associated with this particular sequence of tags** we require their **Transition probability** and **Emission probability**.

### Hidden Markov Model (HMM):
HMM has no input, and the probability distribution for the output should be given. An HMM is a five-tuple $\lambda = \{S,V,A,B,\Pi\} $

with:

* **State Vector:** $S= \{0,\ldots,N\}$
* **Output Vector:** $O= \{0,\ldots,M\}$
* **Matrix of transition probabilities:** $ A = (a_{ij}) $,  where  $ a_{ij} $ is the probability $s_j $ comes after $s_i $
* **Matrix of emission probabilities:** $ B $, where $b_i(k)$ is the probability to observe $v_k$ in the state $ s_i $
* **Initial state distribution:** $ \Pi $. where $ \pi_i $ is the probability that $ s_i $  is the intial  state


<img src="https://imgur.com/MiIxToo.png" width="520" height="400" />


## Dataset
Let us calculate the above two probabilities for the set of sentences below

* Mary Jane can see Will
* Spot will see Mary
* Will Jane spot Mary?
* Mary will pat Spot

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/17112900/pos3-1.png" style="width:600px;height:200px;">

### Emission probabilities

Now, what is the probability that the word Ted is a noun, will is a model, spot is a verb and Will is a noun. These sets of probabilities are **Emission probabilities** and should be high for our tagging to be likely.


In the above sentences, the word Mary appears four times as a noun. To calculate the emission probabilities, let us create a counting table in a similar manner.

| Words | Noun | Model  | Verb |
|-------|------|--------|------|
| Mary  | 4    | 0      | 0    |
| Jane  | 2    | 0      | 0    |
| Will  | 1    | 3      | 0    |
| Spot  | 2    | 0      | 1    |
| Can   | 0    | 1      | 0    |
| See   | 0    | 0      | 2    |
| pat   | 0    | 0      | 1    |

Now let us divide each column by the total number of their appearances for example, ‘noun’ appears nine times in the above sentences so divide each term by 9 in the noun column. We get the following table after this operation. 

| Words | Noun | Model  | Verb |
|-------|------|--------|------|
| Mary  | 4/9  | 0      | 0    |
| Jane  | 2/9  | 0      | 0    |
| Will  | 1/9  | 3/4    | 0    |
| Spot  | 2/9  | 0      | 1/4  |
| Can   | 0    | 1/4    | 0    |
| See   | 0    | 0      | 2/4  |
| pat   | 0    | 0      | 1/4  |


From the above table, we infer that
* The probability that Will  is Noun = 1/9
* The probability that Will is Model = 3/4

### Transition Probability

The **transition probability** is the likelihood of a particular sequence for example, how likely is that a noun is followed by a model and a model by a verb and a verb by a noun. This probability is known as Transition probability. It should be high for a particular sequence to be correct.

Next, we have to calculate the transition probabilities, so define two more tags $<S>$ and $<E>$. 
* $<S>$ is placed at the beginning of each sentence and 
* $<E>$ at the end as shown in the figure below.


<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16134911/pos4.png" style="width:600px;height:200px;">

Next, we divide each term in a row of the table by the total number of co-occurrences of the tag in consideration, for example, The Model tag is followed by any other tag four times as shown below, thus we divide each element in the third row by four.

|     | N   | M   | V   | \<E> |
|-----|-----|-----|-----|-----|
|  \<S> | 3/4 | 1/4 | 0   | 0   |
| N   | 1/9 | 3/9 | 1/9 | 4/9 |
| M   | 1/4 | 0   | 3/4 | 0   |
| V   | 4/4 | 0   | 0   | 0   |

Three fundamental problems should characterize the hidden Markov models:

* **Problem 1 (Likelihood):** Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood $P(O|λ )$.
* **Problem 2 (Decoding):** Given an observation sequence O and an HMM λ = (A, B), discover the best-hidden state sequence X.
* **Problem 3 (Learning):** Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

Now how does the HMM determine the appropriate sequence of tags for a particular sentence from the above tables?

Take a new sentence and tag them with wrong tags. Let the sentence, ‘ Will can spot Mary’.One possilbe tagging would                     mbe


<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135107/pos6-1.png" style="width:600px;height:200px;">

To go from $\pi$ to $O$ you need to multiply the corresponding transition probabilities $(1/4)$ and the corresponding emission probability $(3/4)$. You keep doing that for all the words, until you get the probability of an entire sequence. 

Now calculate the probability of the sequence of hidden state in the following manner:
$$P(S|O,λ )= 1/4*3/4*3/4*0*1*2/9*1/9*4/9*4/9=0$$

Other possible words tagging as shown below

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135135/pos7.png" style="width:600px;height:200px;">


Calculating  the product of these terms we get,
$$P(S|O,λ )= 3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164$$

keeping into consideration just three POS tags we have mentioned, but 81 different combinations of tags can be formed. Now let us visualize these 81 combinations as paths and using the transition and emission probability mark each vertex and edge as shown below.

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135201/pos8.png" style="width:600px;height:200px;">

The next step is to delete all the vertices and edges with probability zero, also the vertices which do not lead to the endpoint are removed. Also, we will mention-

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135225/pos9.png" style="width:600px;height:200px;">


$$<S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754$$

$$<S>→N→M→N→V→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164$$

Clearly, the probability of the second sequence is much higher and hence the HMM is going to tag each word in the sentence according to this sequence.

## Optimizing HMM with Viterbi Algorithm 

"The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM)."

Let us use the same example we used before and apply the Viterbi algorithm to it

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135201/pos8.png" style="width:600px;height:200px;">


## Viterbi Initialization

You will now populate a matrix C of dimension (num_tags, num_words). This matrix will have the probabilities that will tell you what part of speech each word belongs to. 

Now to populate the first column, you just multiply the initial $\pi$ distribution, for each tag, times $b_{i, \operatorname{cindex}\left(w_{1}\right)}$. Where the $i$, corresponds to the tag of the initial distribution and the $cindex(w_1)$ is the index of word 1 in the emission matrix.

|   | w_1     | w_2 | ... | w_k |
|---|---------|-----|-----|-----|
| N | c_(N,1) |     |     |     |
| M | c_(M,1) |     |     |     |
| V | c_(V,1) |     |     |     |


<img src="https://imgur.com/i05hwKj.png" width="220" height="300" />
<img src="https://imgur.com/qrEJYxO.png" width="720" height="400" />

$$c_{(N,1)} = 1/4 * 3/4 = 0.1875$$
$$c_{(M,1)} = 3/4 * 1/9 = 0.08333$$
$$c_{(V,1)} = 0  *  0   = 0 $$

And that's it, you are done with populating the first column of your new $C$ matrix.

You will now need to keep track what part of speech you are coming from. Hence we introduce a matrix $D$, which allows you to store the labels that represent the different states you are going through when finding the most likely sequence of POS tags for the given sequence of words $ w_1,... ,w_{K_w} $ 

At first you set the first column to $0$, because you are not coming from any POS tag. 

|   | w_1     | w_2 | ... | w_k |
|---|---------|-----|-----|-----|
| N | d_(N,1) = 0 |     |     |     |
| M | d_(M,1) = 0 |     |     |     |
| V | d_(V,1) = 0 |     |     |     |

## Viterbi: Forward Pass


So to populate a cell (i.e. 1,2) in the image above, you have to take the max of (kth cells in the previous column, times the corresponding transition probability of the kth POS to the first POS times the emission probability of the first POS and the current word you are looking at). You do that for all the cells.

<img src="https://imgur.com/MYqpamU.png" width="280" height="300" />
<img src="https://imgur.com/WllgNdh.png" width="720" height="400" />

$$c_{(N,2)} = \max [ c_{(N,1)} *  a_{(N,N)} * b_{N,\operatorname{cindex}\left(w_{2}\right)}\\,c_{(M,1)} *  a_{(N,M)} * b_{N,\operatorname{cindex}\left(w_{2}\right)}\\,c_{(V,1)} *  a_{(N,V)} * b_{N,\operatorname{cindex}\left(w_{2}\right)}] \\= \max [0.1875*1/9*0, 0.08333* 1/4* 0 , 0] = 0 $$


|   | w_1     | w_2 | ... | w_k |
|---|---------|-----|-----|-----|
| N | c_(N,1) | c_(N,2)   |     |     |
| M | c_(M,1) | c_(M,2)   |     |     |
| V | c_(V,1) | c_(V,2)   |     |     |


Now to populate the D matrix, you will keep track of the argmax of where you came from as follows: 
<img src="https://imgur.com/EGzZS3K.png" width="340" height="300" />

|   | w_1     | w_2 | ... | w_k |
|---|---------|-----|-----|-----|
| N | d_(N,1) = 0 | d_(N,2) = {} |     |     |
| M | d_(M,1) = 0 | d_(M,2) = N  |     |     |
| V | d_(V,1) = 0 | d_(V,2) = {}   |     |     |



There are two paths leading to this vertex as shown below along with the probabilities of the two mini-paths
<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135641/pos11-3.png" style="width:600px;height:200px;">

<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135742/pos1-5.png" style="width:600px;height:200px;">

## Viterbi: Backward Pass

Great, now that you know how to compute A, B, C, and D, we will put it all together and show you how to construct the path that will give you the part of speech tags for your sentence. 

Let us first consider a general example:

<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/GbakvlBQRk-2pL5QUNZPQg_dfd81697a97845e6809fb91cb80e2038_Screen-Shot-2021-03-10-at-3.34.53-PM.png?expiry=1643760000000&hmac=aulIWYQkywdtz0Tf-Wh3X7IPaN8gvsNHcsRvPfT_uSc" style="width:600px;height:200px;">

The equation above just gives you the index of the highest row in the last column of C. Once you have that, you can go ahead and start using your D matrix as follows: 

<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/eKIMiBJVQZ6iDIgSVVGeRQ_eca8f4f04a464168ace61d69f63bc29c_Screen-Shot-2021-03-10-at-3.36.07-PM.png?expiry=1643760000000&hmac=RQ6R2O677zV2NFAt5aqFPQmon7vQ_nCjQ8rjM1OBt4s" style="width:600px;height:200px;">


Note that since we started at index one, hence the last word $(w_5)$. Then we go to the first row of $D$ and what ever that number is, it indicated the row of the next part of speech tag. Then next part of speech tag indicates the row of the next and so forth. This allows you to reconstruct the POS tags for your sentence.

### Back to our example
The Forward Pass:
<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135838/pos13.png" style="width:600px;height:200px;">

The Backward Pass:
<img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16135907/pos14.png" style="width:600px;height:200px;">

In [None]:
## Install the library
!pip install hmmlearn

### Import Libraries:
first we will import all the packages that are required for this exercise. 
- [numpy](www.numpy.org) is the main package for scientific computing with Python.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- np.random.seed(1) is used to keep all the random function calls consistent
- `hmmlearn` implements the Hidden Markov Models (HMMs). 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from hmmlearn import hmm

%matplotlib inline
np.random.seed(1)

In [None]:
## hmm.MultinomialHMM Hidden Markov Model with multinomial (discrete) emissions.
class HMM(hmm.MultinomialHMM):
    def __init__(self,A,B,pi,**kwargs): #  keyword argument 
        n_components        = A.shape[0]
        super().__init__(n_components,**kwargs)
        self.transmat_     = A
        self.emissionprob_ = B
        self.startprob_    = pi
        
    def likelihood(self,obs_seq):
        if len(obs_seq.shape)==1:
            obs_seq = obs_seq.reshape(-1, 1)
        # logprob -> probability
        return np.exp(self.score(obs_seq))
         
    def decoding(self,obs_seq):
        if len(obs_seq.shape)==1:
            obs_seq = obs_seq.reshape(-1, 1)
        # logprob -> probability
        logprob, seq = self.decode(obs_seq)
        return np.exp(logprob), seq
    
    def learning(self,obs_seq):
        if len(obs_seq.shape)==1:
            obs_seq = obs_seq.reshape(-1, 1)
            
        self.fit(obs_seq)
    
    def show_model(self):
        np.set_printoptions(precision=4, suppress=True)
        print('A: Transition probability matrix')
        print(self.transmat_)
        print('------------------------------')
        print('B: Emission probability matrix')
        print(self.emissionprob_)
        print('-------------------------------')
        print('pi: Initital state distribution')
        print(self.startprob_)

| Words | Noun | Model  | Verb |
|-------|------|--------|------|
| Mary  | 4/9  | 0      | 0    |
| Jane  | 2/9  | 0      | 0    |
| Will  | 1/9  | 3/4    | 0    |
| Spot  | 2/9  | 0      | 1/4  |
| Can   | 0    | 1/4    | 0    |
| See   | 0    | 0      | 2/4  |
| pat   | 0    | 0      | 1/4  |

|     | N   | M   | V   | \<E> |
|-----|-----|-----|-----|-----|
|  \<S> | 3/4 | 1/4 | 0   | 0   |
| N   | 1/9 | 3/9 | 1/9 | 4/9 |
| M   | 1/4 | 0   | 3/4 | 0   |
| V   | 4/4 | 0   | 0   | 0   |

In [None]:
states = ['Noun', 'Modal', 'Verb','End']
 
observations = ['Mary','Jane','Will','Spot','Can','See','pat','.']

In [None]:
# Define the Multinomial HMM 
pi= np.array([3/4, 1/4 , 0  , 0])  # initial probability  
A = np.array([[1/9 ,3/9,1/9 , 4/9],
              [1/4 , 0 ,3/4,  0],
              [4/4 , 0 , 0 ,  0],
              [ 0 , 0 , 0 ,  4/4]]) # transmition probability

B = np.array([[4/9, 2/9, 1/9 , 2/9 , 0, 0, 0, 0],
              [0, 0, 3/4 ,0 , 1/4 ,0 , 0, 0 ],
              [0, 0, 0 ,1/4,  0,  2/4 , 1/4, 0],
              [0, 0, 0 ,0 , 0 ,0 , 0 ,1 ]
             ]) # Emission probability

model = HMM(A,B,pi)   # n_components: number of state
model.show_model()

**Problem 1 (Likelihood):** Given an HMM $λ = (A, B)$ and an observation sequence $O$, determine the likelihood $P(O|λ )$

**Note:** The log likelihood is provided from calling `.likelihood.`

How likely is a given sequence?
* $ O= \{\text{"Mary"}\}$
* $ O= \{\text{"Mary Jane"}\}$
* $ O= \{\text{"Mary can"}\}$
* $ O= ...$ Test your own example

In [None]:
obs_seq = np.array([0])
print("Prob(Mary | pi, A, B) = {:0.4f}".format(model.likelihood(obs_seq)))

In [None]:
0.75 * 0.4444

In [None]:
obs_seq = np.array([0,1])
print("Prob(Mary Jane | pi, A, B) = {:0.4f}".format(model.likelihood(obs_seq)))

In [None]:
obs_seq = np.array([0,4])
print("Prob(Mary can | pi, A, B) = {:0.4f}".format(model.likelihood(obs_seq)))

**Problem 2 (Decoding):** Given an observation sequence {O} and an HMM {λ = (A, B)}, discover the best-hidden state sequence {X}.

The **Viterbi algorithm** is one of most common decoding algorithms for HMM. Its goal is to find the most likely hidden state sequence corresponding to a series of observations. 

**Note:** The decoding is provided from calling `.decoding.`

What is the most probable “path” for generating a given sequence?
* $ O= \{\text{"Mary"}\}$
* $ O= \{\text{"Mary Jane"}\}$
* $ O= \{\text{"Mary can"}\}$
* $ O= \{\text{"Will can sport Mary"}\}$

In [None]:
obs_seq = np.array([0])
prob,state_seq = model.decoding(obs_seq)
print ("Most likely state sequence for observation (Mary): ", state_seq)
print("Probability: {:0.6f}".format(prob))

In [None]:
obs_seq = np.array([0,1])
prob,state_seq = model.decoding(obs_seq)
print ("Most likely state sequence for observation (Mary Jane): ", state_seq)
print("Probability: {:0.6f}".format(prob))

In [None]:
obs_seq = np.array([0,4])
prob,state_seq = model.decoding(obs_seq)
print ("Most likely state sequence for observation (Mary can): ", state_seq)
print("Probability: {:0.6f}".format(prob))

In [None]:
obs_seq = np.array([2,4,3,0])
prob,state_seq = model.decoding(obs_seq)
print ("Most likely state sequence for observation (Will can sport Mary): ", state_seq)
print("Probability: {:0.6f}".format(prob))

## Generate Sequence

In [None]:
# Generate the dataset a sequence of 100 measurements
O, _ = model.sample(5)

words = []
for o in O:
    words.append(observations[o[0]])
    
print("The generated sentence: ", words)

## References:
* https://www.mygreatlearning.com/blog/pos-tagging/
* https://www.coursera.org/