# Lab 7 :  Graphical and hidden Markov models
```
- [S23] Advanced Machine Learning, Innopolis University
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>


```
Lab Plan
1. Project start
2. Recap on HMMs
3. Simple example for HMMs
3. Task
```

<hr>


## 1. Recap on HMMs and Likelihood

<!--![](https://www.dropbox.com/scl/fi/wm8nbmqgchv3orw7r488f/Lecture-5-HMM.jpg?rlkey=3x0lmmzxchgz6esmq6yrjfzne&dl=1)-->

In [1]:
import numpy as np
states = ('Sunny', 'Rainy')
observations = ('happy', 'grumpy')
pi = np.array([2./3., 1./3.])  # initial probability
A = np.array([[7./9., 2./9.],[0.4, 0.6]]) # Transmission probability
B = np.array([[0.8, 0.2],[0.4, 0.6]]) # Emission probability
bob_says = np.array([0,0,1,1,1,0])

def forward(obs_seq, pi, A, B):
    T = len(obs_seq)
    N = A.shape[0]
    alpha = np.zeros((T, N))
    alpha[0] = pi*B[:,obs_seq[0]]
    for t in range(1, T):
        alpha[t] = np.inner(alpha[t-1],A) * B[:, obs_seq[t]]
    return alpha

def likelihood(alpha):
    # using the forward part of the forward-backward algorithm
    return  alpha[-1].sum()

alpha = forward(bob_says, pi, A, B)
print(alpha)

print(likelihood(alpha))

[[0.53333333 0.13333333]
 [0.35555556 0.11733333]
 [0.06052346 0.12757333]
 [0.01508469 0.06045203]
 [0.00503326 0.02538306]
 [0.00764435 0.00689726]]
0.014541607035957256


## 2. Viterbi algorithm simple example


In [2]:
from numpy import random
# Transition Probabilities (s - sunny; r-raining)
p_ss = 7./9.
p_sr = 2./9.
p_rs = 0.4
p_rr = 0.6

# Initial Probabilities
p_s = 2/3
p_r = 1/3

# Emission Probabilities (h - happy; g- grumpy)
p_sh = 0.8
p_sg = 0.2
p_rh = 0.4
p_rg = 0.6

moods = ['H', 'H', 'G', 'G', 'G', 'H']
probabilities = []
weather = []

if moods[0] == 'H':
    probabilities.append((p_s*p_sh, p_r*p_rh))
else:
    probabilities.append((p_s*p_sg, p_r*p_rg))

for i in range(1,len(moods)):
    yesterday_sunny, yesterday_rainy = probabilities[-1]
    if moods[i] == 'H':
        today_sunny = max(yesterday_sunny*p_ss*p_sh, yesterday_rainy*p_rs*p_sh)
        today_rainy = max(yesterday_sunny*p_sr*p_rh, yesterday_rainy*p_rr*p_rh)
        probabilities.append((today_sunny, today_rainy))
    else:
        today_sunny = max(yesterday_sunny*p_ss*p_sg, yesterday_rainy*p_rs*p_sg)
        today_rainy = max(yesterday_sunny*p_sr*p_rg, yesterday_rainy*p_rr*p_rg)
        probabilities.append((today_sunny, today_rainy))

for p in probabilities:
    if p[0] > p[1]:
        weather.append('S')
    else:
        weather.append('R')

weather

['S', 'S', 'S', 'R', 'R', 'S']

## 3. Parts of Speech (POS) Tagging

---

* process of assigning a part-of-speech to each word in a text
* POS is a disambiguation task (some words can have multiple part of speech depending in the context)
* Gives an idea about syntactic structure (parsing)

**How does it work?**

<!--![](http://www.cs.virginia.edu/~hw5x/Course/TextMining-2019Spring/_site/docs/codes/HMM.PNG)-->

## 4. Task (Viterbi algorithm)

<!-- ![](https://miro.medium.com/max/1080/1*8-5KZVj-_jZOWN83gGhD5A.png) -->
<!--![](https://www.dropbox.com/scl/fi/twg8yqdhqwxcbr5lcqj4w/pos_tag1.png?rlkey=5kizqwfv8gk8vg26434sp0rk2&dl=1)-->

First steps :
*  Calculate or estimate transition probabilities between different parts of speech tags
* Calculate or estimate prior probabilities of tags


In the task you are going to use a a fraction of conll2000 dataset. The dataset should be retrieved from : [Train](https://www.dropbox.com/s/x9n6f9o9jl7pno8/train_pos.txt?dl=1), [Test](https://www.dropbox.com/s/v8nccvq7jewcl8s/test_pos.txt?dl=1)

1. Calculate the transition probability and emission matrices (First step towards viterbi) <!-- 10 points -->
1. Use Viterbi algorithm for POS tagging task (you can use libiries such as [`hmmlearn`](https://github.com/hmmlearn/hmmlearn)).
1. Test your viterbi model on the test set and record the accuracy. The accuray referes to the number of correcly predicted tags in the whole test samples.
1. Using Recurrent neural network (RNN, GRU or LSTM) or Transfomer solve the task for POS

In [None]:
# Load dataset train and test
from datasets import load_dataset

dataset = load_dataset('conll2000')

train_data = dataset['train']
test_data = dataset['test']

print(len(train_data))

In [7]:
print(train_data)

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags'],
    num_rows: 8937
})


In [8]:
print(test_data)

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags'],
    num_rows: 2013
})
