# DSCI 563 Lab Assignment 2: Hidden Markov Model Project

The problem of **sequence labeling** consists of  

- part-of-speech tagging (POS tagging), 
- chunking, 
- named entity recognition (NER), and 
- semantic role labeling (SRL) 

in which we use any sequence labeling algorithms (such as HMM) to train and evaluate data sets. 

| *The*| *luxury*| *auto* | *maker* | *last* | *year*| *sold*  | *1,214*  | *cars*  | *in*  | *the* |*U.S.*  |
|---|---|---|---|---|---|---|---|---|---|---|---|
| dt |nn |nn |nn |jj |nn |vbd |cd |nns |in |dt |nnp |
|b-np |i-np |i-np |i-np |o |o |o |b-np |i-np |o |b-np |i-np |
| o|o|o|o|o|o|o|o|o|o|o|loc|
| b-ar$_0$| i-ar$_0$| i-ar$_0$| i-ar$_0$| o| o| pred| b-ar$_1$| i-ar$_1$| o| o| o|


| | dataset | train |dev| test |
|---|---|---|---|---|
|POS tagging | WSJ | Section 0-18 |  Section 19-21 | Section 22-34 |
|Chunking |WSJ |Section 15-18 | - | Section 20 |
|NER | Reuters| eng.train |eng.testa | eng.testb |
|SRL | WSJ| Section 2-21  | Section 24   | Section 23   |

## Assignment Objectives

In this assignment you will
- Build a Hidden Markov Model
- Use Hard EM to do semi-supervised part of speech tagging

In Part 1, you see the `HMM` class which we will use for this lab. Parts 1-2 ask you to fill in methods in the class and Part 3 asks you to apply the class for semi-supervised POS tagging of the Brown corpus. 

Parts 2 and 3 depend on the first part which implements training for HMMs, but they do not depend on each other.

Part 3 uses inference for the HMM class. You should start developing a solution using the provided greedy inference algorithm but switch to Viterbi, when Part 2 has been completed.

## Getting Started

Run the code below to access relevant modules (you can add to this as needed)

In [1]:
import nltk
from nltk.corpus import brown
import numpy as np
import scipy
from collections import defaultdict,Counter
from random import shuffle,seed,choice,random
from sklearn.metrics import adjusted_rand_score
nltk.download('universal_tagset')

## Tidy Submission

rubric={mechanics:2}

- You have been assigned a team for this project, which you will find in the teams.txt file on the DSCI 563 course repo
- One person in each group should create a private UBC github repo, and give access to all group members as well as the members of the teaching team
- In the `README.md` in the individual lab repo (the one created when the lab is opened) for all members of the group, you should have a link to this private, shared repo. Pushing that link is your only "submission". Don't put anything else in your repo for this lab.
- In the private shared repo, include the final notebook which contains the work by all team members. 
- **Note** any commits to the private shared repo after the deadline will result in a late penalty being applied to the project, so be careful about that.

### Part 1: HMM Initialization and training

#### Assignment 1.1
rubric={accuracy:1}

Your first task is to initialize an HMM model in the function `HMM.__init__()`, which takes two arguments a vocabulary `emissions` and a state set `states`. The `HMM` class contains the members

* `self.emissions`, the list of word types, and
* `self.states`, the list of possible POS tags.
* `self.w2i` and `self.i2w`, dictionaries for converting between word tokens like `"dog"` and index numbers like 134.
* `self.s2i` and `self.i2s`, dictionaries for converting between states like `"NOUN"` and index numbers like 12.

You should initialize three member variables:

* `self.init_prob`, an `np.array` of initial state probabilities of shape `1 x size_of_state_set`.
* `self.emission_prob`, an `np.array` of emission probabilities of shape `size_of_state_set x size_of_vocabulary`. 
* `self.transition_prob` an `np.array` of transition probabilities of shape `size_of_state_set x size_of_state_set`.

All array values should be initialized to 0.

After you complete this assignment correctly, you should be able to pass the assertions for 1.1 below the HMM class definition.



#### Assignment 1.2
rubric={accuracy:3, quality:1}

Your next task is fully supervised training of the HMM model in the function `HMM.train()`. The function takes a training set `data`, which is a list of tagged sentences, e.g.:
```
[[("the", "DET"),("dog", "NOUN"),("barks","VERB")], [("the","DET"),("dog","NOUN")]]
```

**Before you do anything else, please initialize all initial, emission and transition probabilities to 0.**

You should then convert words and POS tags in `data` into index numbers using the function `HMM.data2i()`. The output will look something like this: 

```
[[(101, 10), (1000, 5), (5, 2)], [(101, 10), (1000, 5)]]
```

The left element of each pair is an index number corresponding to a word type in the vocabulary and the right element is an index number of a state (i.e. POS tag in our case).

Start the actual training by storing **counts** of emissions, transitions and initial states in `self.emission_prob`, `self.transition_prob` and `self.initial_prob`, respectively. For example, if the word number `101` is emitted twice in the state `10`, then we want `self.emission_prob[10][101] == 2`. Similarly, if we transition from state `10` to state `5` twice in `data`, we want `self.transition_prob[10][5] == 2`.

Then **apply add-one smoothing** to all counts, and normalize probabilities according to:

$$\large P_{initial}(s) = \frac{{\rm count}_{initial}(s)}{\sum_{t} {\rm count}_{initial}(t)}$$

$$\large P_{emission}(w | s) = \frac{{\rm count}(w,s)}{{\rm count}(s)}$$

$$\large P_{transition}(t|s) = \frac{{\rm count}(s,t)}{{\rm count}(s)}$$

Finally, convert all probabilities to log-probabilities using $p \mapsto \log_2 p$ (note that the base of the logarithm is 2). 

In [16]:
from copy import deepcopy

# Symbol used to replace unknown tokens in the input 
UNK="<UNK>"

class HMM:
    def __init__(self, emissions, states):
            # Vocabulary and tag set.
            self.emissions = deepcopy(emissions + [UNK])
            self.states = deepcopy(states)

            # Use these to convert between strings and ID numbers
            self.w2i = {w:i for i, w in enumerate(self.emissions)}
            self.i2w = self.emissions
            self.s2i = {s:i for i,s in enumerate(self.states)}
            self.i2s = self.states
        
            # your code here

            
    def data2i(self, data):
        """ Encode emissions and states into index numbers. 
        
            ex is either a sequence of words or a sequence 
            of word-state pairs.
        """
        idx_data = []
        if type(data[0][0]) == type(""):
            for ex in data:
                idx_data.append([])
                for w in ex:
                    w = w if w in self.w2i else UNK
                    idx_data[-1].append(self.w2i[w])
        else:
            for ex in data:
                idx_data.append([])
                for w,s in ex:
                    w = w if w in self.w2i else UNK
                    idx_data[-1].append((self.w2i[w], self.s2i[s]))
        return idx_data

    def train(self, data):
        # Initialize all parameters to 0.
        self.init_prob[:] = 0
        self.emission_prob[:] = 0
        self.transition_prob[:] = 0
        
        data = self.data2i(data)
        # your code here
   

    def greedy_decode(self, ex):
        """ Greedy (or beam 1 decoding). """
        ex = self.data2i([ex])[0]
        state_distr = np.array(self.init_prob)
        output = []
        log_prob = 0
        for w in ex:  
            state_distr += self.emission_prob[:, w].reshape(1,-1)
            output.append(state_distr.argmax())
            log_prob = state_distr.max()
            state_distr = log_prob + self.transition_prob[[output[-1]], :]
        return [(self.i2w[w], self.i2s[s]) for w, s in zip(ex, output)], log_prob
    
    def extract_output(self, trellis, back_pointers, ex):
        log_prob = trellis[:,-1].max()
        output = [trellis[:,-1].argmax()]
        while len(output) < len(ex):
            output.append(back_pointers[output[-1], len(ex) - len(output)])
        output = output[::-1]
        
        return [(self.i2w[w], self.i2s[s]) for w, s in zip(ex, output)], log_prob
    
    def viterbi_decode(self, ex):
        """ Viterbi decoding using loops. """
        # your code here

    
    def fast_viterbi_decode(self, ex):
        """ Vectorized Viterbi decoding. """
        # your code here

Assertions to check your code for Assignment 1.1:

In [3]:
hmm = HMM(["the", "dog", "barks"],["DET", "NOUN", "VERB", "ADJ"])
assert hmm.init_prob.shape == (1,4)                 # 1 x size_of_state_set (4)
assert hmm.transition_prob.shape == (4,4)           # size_of_state_set x size_of_vocabulary (3+UNK)
assert hmm.emission_prob.shape == (4,4)             # size_of_state_set x size_of_state_set
print("Success!")

init_prob
[[0. 0. 0. 0.]]
transition_prob
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
emission_prob
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Success!


Assertions to check your code for Assignment 1.2:
`hmm = HMM(["the", "dog", "barks"],["DET", "NOUN", "VERB", "ADJ"])`
 self.emission_prob, self.transition_prob]:


`inin_prob`: 
- before laplace smoothing: `[2/2, 0/2, 0/2, 0/2]`
- after  laplace smoothing: `[2+1/2+4, 0+1/2+4, 0+1/2+4, 0+1/2+4]` where `4` is # of T

`emission_prob` for `the`, `dog`, `barks`, `UNK`:
- before laplace smoothing: `[[2/2 0/2 0/2 0/2] [0/2 2/2 0/2 0/2] [0/1 0/1 1/1 0/1] [0. 0. 0. 0.]]`
- after  laplace smoothing: `[[2+1/2+4, 0+1/2+4, 0+1/2+4, 0+1/2+4] [0+1/2+4, 2+1/2+4, 0+1/2+4, 0+1/2+4] [0+1/1+4, 0+1/1+4, 1+1/1+4, 0+1/1+4] [0+1/4, 0+1/4, 0+1/4, 0+1/4]]`

`trainsition_prob` for `"DET", "NOUN", "VERB", "ADJ"` where `DET->NOUN:2` and `NOUN->VERB:1`:
- before laplace smoothing: `[[0. 2. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]]`
- +1                      : `[[1. 3. 1. 1.] [1. 1. 2. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]]`
- after laplace smoothing : `[[1./sum(axis=1) 3./6 1./6 1./6] [1./sum(axis=1) 1./5 2./5 1./5] ...] `

In [4]:
data = [[("the", "DET"),("dog", "NOUN"),("barks","VERB")], [("the","DET"),("dog","NOUN")]]
hmm.train(data)

assert np.abs(np.exp2(hmm.init_prob) - [3/6, 1/6, 1/6, 1/6]).sum() < 0.001
assert np.abs(np.exp2(hmm.emission_prob) - [[3/6, 1/6, 1/6, 1/6],[1/6, 3/6, 1/6, 1/6],[1/5, 1/5, 2/5, 1/5], [1/4, 1/4, 1/4, 1/4]]).sum() < 0.001
assert np.abs(np.exp2(hmm.transition_prob) - [[1/6, 3/6, 1/6, 1/6], [1/5, 1/5, 2/5, 1/5], [1/4, 1/4, 1/4, 1/4], [1/4, 1/4, 1/4, 1/4]]).sum() < 0.001

output, prob = hmm.greedy_decode("the dog barks".split())
assert output == [("the", "DET"), ("dog", "NOUN"), ("barks", "VERB")]
assert np.abs(prob - np.log2(3/6 * 3/6 * 3/6 * 3/6 * 2/5 * 2/5)).sum() < 0.0001
print("Success!")

init_prob
[[-1.        -2.5849625 -2.5849625 -2.5849625]]
transition_prob
[[-2.5849625  -1.         -2.5849625  -2.5849625 ]
 [-2.32192809 -2.32192809 -1.32192809 -2.32192809]
 [-2.         -2.         -2.         -2.        ]
 [-2.         -2.         -2.         -2.        ]]
emission_prob
[[-1.         -2.5849625  -2.5849625  -2.5849625 ]
 [-2.5849625  -1.         -2.5849625  -2.5849625 ]
 [-2.32192809 -2.32192809 -1.32192809 -2.32192809]
 [-2.         -2.         -2.         -2.        ]]
Success!


`init_prob`
```
 ["DET",     "NOUN",    "VERB",    "ADJ"]  
[[-1.        -2.5849625 -2.5849625 -2.5849625]]
```
$P(D|bos) = -1$, $P(V|bos) = -2.5849625$, ...


`transition_prob`
```
         "DET"       "NOUN"      "VERB"      "ADJ"
"DET"  [[-2.5849625  -1.         -2.5849625  -2.5849625 ]
"NOUN"  [-2.32192809 -2.32192809 -1.32192809 -2.32192809]
"VERB"  [-2.         -2.         -2.         -2.        ]
"ADJ"   [-2.         -2.         -2.         -2.        ]]
```
$P(N|D) = -1$, $P(V|N) = -1.32192809$, ...

`emission_prob`
```
         "the"       "dog"       "barks"     "UNK"
"DET"  [[-1.         -2.5849625  -2.5849625  -2.5849625 ]
"NOUN"  [-2.5849625  -1.         -2.5849625  -2.5849625 ]
"VERB"  [-2.32192809 -2.32192809 -1.32192809 -2.32192809]
"ADJ"   [-2.         -2.         -2.         -2.        ]]
```
$P(the|D) = -1$, $P(dog|N) = -1$, ...

### Part 2: Viterbi decoding

In this part, you'll implement Viterbi decoding in two ways. 

**Note!** Assignments 2.1 and 2.2 do not depent on each other. You can work on them in parallel. 

#### Assignment 2.1
rubric={accuracy:4,quality:2}

In this assignment, you will implement the Viterbi algorithm using loops (i.e. as a non-vectorized algorithm) in the function `HMM.viterbi_decode()`.

The function takes a single argument `ex`, a list representing a sentence, e.g.:
```
["The", "dog", "sleeps"]
```

Start by converting `ex` into a list of index numbers using the function `HMM.data2i()`. Note that the function takes a list of sentences as input instead of a single sentence. 

You should then initialize two `np.array` objects: 

* `trellis`, which contains the Viterbi probabilities $v_i(s)$ for each state $s$ and position $i$ in the sentence, and 
* `back_pointers`, which contains back pointers. These identify the optimal tag history.

Both of these need to have dimension `len(self.states) x len(ex)` and you should initialize all values in `trellis` to negative infinity (`-float("inf")`) and all values in `back_pointers` to `-1`.

We can then start filling the trellis one row at a time:

1. We'll first initialize all elements `trellis[0,s]` to the sum of the initial log-probability of state `s` and the emission log-probability of the first input word `ex[0]` in the given state.
2. When filling in the cell for state `s` in position `i+1`, we need to loop over all states in row `i` and find the state $r_{max}$ which maximizes $\log_2 v_{i}(r) + \log_2 P_{transition}(r,s) + \log_2 P_{emission}(w,s)$, where $w$ is the $i+1$th token in `ex`. This is the Viterbi log-probability $v_{i+i}(s)$.
3. You should also store $r_{argmax}$ in cell `s,i+1` in `back_pointers`.

When you've filled in `trellis` and `back_pointers`, call the function `self.extract_output()` which will extract the output tag sequence.

After successfully completing the Viterbi algorithm, you should be able to pass the following assertions:

![HMM overview](hmm-overview.png)

In [18]:
hmm = HMM(["the", "dog", "barks"],["DET", "NOUN", "VERB", "ADJ"])

data = [[("the", "DET"),("dog", "NOUN"),("barks","VERB")], [("the","DET"),("dog","NOUN")]]
hmm.train(data)

output, prob = hmm.viterbi_decode("the dog barks".split())
assert(output == [("the", "DET"),("dog", "NOUN"),("barks","VERB")])
assert np.abs(prob - np.log2(3/6 * 3/6 * 3/6 * 3/6 * 2/5 * 2/5)).sum() < 0.0001
print("Success!")

Success!


#### Assignment 2.2
rubric={accuracy:4,quality:2,efficiency:3}

You will now implement a vectorized version of Viterbi in the function `HMM.fast_viterbi_decode()`, which again takes a single argument: `ex` representing a sentence. A successfully implemented vectorized Viterbit can be substantially faster than a loop-based approach.

Start by converting word tokens in `ex` into index numbers and intialize `trellis` and `back_pointers` as in Assignment 2.1.

You should then initialize the first row of the trellis to the sum of your initial probability vector and emission probabilities for `ex[0]`. Note that for full efficieny marks, you have to compute all the probabilities using a single addition of `np.arrays`.

When filling in row `i+1`, You should start by computing a `len(self.states) x len(self.states)` matrix `log_probs`, where the element `log_probs[r,s]` represents the log-probability: 

$\log_2 v_{i}(r) + \log_2 P_{transition}(r,s) + \log_2 P_{emission}(w,s)$

where $w$ is the token `ex[i+1]`. For full efficiency marks, you should compute this matrix using a single addition of `np.array` objects. Specifically, you'll need to use the previous row of the trellis `trellis[:,i]`, the transition log-probabilities `self.transition_prob` and the emission probabilites for $w$. 

After you've computed `log_probs`, you need to find the maximal element in each row and assign it to row `i+1` in `trellis`. These will be your Viterbi log-probabilities `v_{i+1}(s)` in row `i+1`. You also need to store the index of the element in `back_pointers`. For full efficieny marks you should use a NumPy operation to fill in your trellis row and a single operation to fill in the back pointers. 

When you've completed the entire `trellis` and `back_pointers`, call the function `self.extract_output` which will extract the output tag sequence.

**Note!** You need to ensure that [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) works correctly when summing arrays. Your transition log-probabilities will be an `n x n` matrix and your trellis row and emission log-probabilities will be either `1 x n` or `n x 1` arrays. Both shapes can be broadcast to `n x n` but this will lead to different results as demonstrated by the following example of NumPy addition:

$$\begin{bmatrix}
1 & 2\\
3 & 4
\end{bmatrix} + \begin{bmatrix}
5 & 6
\end{bmatrix} =
\begin{bmatrix}
1 & 2\\
3 & 4
\end{bmatrix} + \begin{bmatrix}
5 & 6 \\
5 & 6 
\end{bmatrix}=
\begin{bmatrix}
6 & {\color{red} 8}\\
{\color{red} 8} & 10
\end{bmatrix}
$$

vs.

$$\begin{bmatrix}
1 & 2\\
3 & 4
\end{bmatrix} + \begin{bmatrix}
5\\
6
\end{bmatrix} =
\begin{bmatrix}
1 & 2\\
3 & 4
\end{bmatrix} + \begin{bmatrix}
5 & 5\\
6 & 6
\end{bmatrix}=
\begin{bmatrix}
6 & {\color{red}7}\\
{\color{red}9} & 10
\end{bmatrix}
$$
    
Make sure that you know how broadcasting should happen and use <code>numpy.array.reshape</code> to transpose axes if needed.
</div>

After successfully completing the vectorized Viterbi algorithm, you should be able to pass the following assertions:

In [19]:
hmm = HMM(["the", "dog", "barks"],["DET", "NOUN", "VERB", "ADJ"])

data = [[("the", "DET"),("dog", "NOUN"),("barks","VERB")], [("the","DET"),("dog","NOUN")]]
hmm.train(data)

output, prob = hmm.fast_viterbi_decode("the dog barks".split())
assert(output == [("the", "DET"),("dog", "NOUN"),("barks","VERB")])
assert np.abs(prob - np.log2(3/6 * 3/6 * 3/6 * 3/6 * 2/5 * 2/5)) < 0.0001
print("Success!")

Success!


### Part 3: Hard EM

In this part, you will use hard EM to train an HMM in a semi-supervised manner. We'll use the Brown corpus and form a small manually annotated training set which is combined with a large amount of unlabeled data.


The `accuracy` function below computes tagging accuracy:

In [20]:
def accuracy(sys_data, gold_data):
    """ Compute tagging accuracy. """
    sys_tags = [t for ex in sys_data for w,t in ex]
    gold_tags = [t for ex in gold_data for w,t in ex]
    return 100 * (np.array(sys_tags) == np.array(gold_tags)).sum()/len(gold_tags)

#### Assignment 3.1
rubric={accuracy:1}

Start by reading the tagged sentences in the Brown corpus using the tag set `"universal"`. You should then divide the corpus into a train split `train_set`, containing 80% of the sentences in the corpus, and a test split `test_set` containing the remaining 20%.

To avoid over-representation of any domain in the training and test set, you should assign sentences into the train and test splits evenly over the entire corpus. For every consecutive 10 sentences in the Brown corpus, assign 8 to `train_set` and the remaining 2 to `test_set`. E.g. if the sentences in the Brown corpus are $s_1 ... s_n$, then the test set will contain sentences $s_9, s_{10}, s_{19}, s_{20}, s_{29}, s_{30}, ...$ and all remaining sentences will end up in `train_set`. 

You should also generate `train_input` and `test_input` which contain untagged versions of the sentences in `train_set` and `test_set`, i.e. simply lists of word tokens. 

In [8]:
# your code here

In [9]:
print("len. train_input = ", len(train_input))
print("len. test_input = ", len(test_input))

len. train_input =  45872
len. test_input =  11468


#### Assignment 3.2
rubric={accuracy:1}

We will now generate a small labeled training set `mini_train_set`. Sample every 5000th sentence from `train_set` into the small labeled training set.

You should also generate 

    `vocab`, a list of word types occurring in `train_set` and `test_set`, as well as 
    `tags`, a list of unique tags occurring in `mini_train_set`.

In [10]:
# Rate at which we sample examples from our large 
# training set into our small annotated training set
INV_TRAIN_SAMPLING_RATE=5000

# your code here

In [11]:
print("len. mini_train_set = ", len(mini_train_set))
print("len. vocab = ", len(vocab))
print("len, tags = ", len(tags))

len. mini_train_set =  10
len. vocab =  49815
len, tags =  11


#### Assignment 3.3
rubric={accuracy:1}

Initialize an `HMM` object `hmm` using the vocabulary and tagset from Assignment 1.2. Train the model on `mini_train_set`.

Apply inference to the sentences in `test_input` and print tagging accuracy. Initially, you can use `HMM.greedy_decode` (you should get accuracy around 30%). Switch to `HMM.fast_viterbi_decode` when it is ready (you should get accuracy close to 50%).

This is out baseline accuracy before we run hard EM.

In [21]:
# your code here

Tagging accuracy on test set: 49.20%


#### Assignment 3.4
rubric={accuracy:2}

When we run hard EM, we will use tag perplexity as stopping criterio. You should now implement the function `get_perplexity` which takes a list $\mathcal{D}$ of $N$ tagged sentences $(x_i, y_i)$ and log-probabilities $\log_2 P(x_i, y_i)$ as input. It then returns average per-token tag perplexity as defined by the following formula:

$$\large {\rm PP}(\mathcal{D}) = 2^{- \frac{1}{N} \cdot \sum_{i=1}^N \log_2 \big( \frac{P(x_i,y_i)}{|x_i|}\big)}$$

where $|x_i|$ is the length of sentence $x_i$. 

In [22]:
def get_perplexity(data):


Assertion to check your code:

In [23]:
data = [([("the", "DET"),("dog", "NOUN"),("barks","VERB")], -35.853), 
        ([("the","DET"),("dog","NOUN")], -16.594)]
assert np.abs(get_perplexity(data) - 1115.9062) < 0.001
print("Success!")
get_perplexity(data)

Success!


1115.9061629064486

#### Assignment 3.5
rubric={accuracy:3,quality:1}

Finally, we will implement hard EM. Her you will alternate between:

* the **E-step** where you tag both `train_input` and `test_input` using current HMM parameters, and
* the **M-step** where you retrain the HMM using the tagged output from the E-step.

We'll use perplexity as stopping criterion. Generally the M-step will always reduce perplexity. When this reduction `delta` is smaller than a threshold `PERPLEXITY_TH` (0.1), we will stop the EM algorithm. Start by initializing two variables `old_perplexity` and `delta` (i.e. the change in perplexity) to infinity (`float("inf")`). Also reinitialize an `HMM` object `hmm` using `vocab` and `tags`, and train it on `mini_train_set`.

At every step of the algorithm, you should first tag the entire training set `train_input` and the test set `test_input` using your current parameters for `hmm` (use `HMM.fast_viterbi_decode()` if it's available, or `HMM.greedy_decode()`, otherwise). 

You should then compute `perplexity` on the tagged output. Using your old and new perplexity value, compute an updated value for `delta`. Your should also print the current `perplexity` and `delta`. 

If `delta` is less than `PERPLEXITY_TH`, you can stop. Otherwise, use the tagger output for `train_input` and `test_input` to retrain `hmm`.

The perplexity should continuously drop  when using Viterbi, i.e. the perplexity should always be positive (note that this is not necessarily true for `HMM.greedy_decode()`). The output of your code should look somewhat like this (exact numbers may vary):

```
Perplexity 93187.23582920311, Delta inf
Perplexity 1139.9238338813068, Delta 92047.31199532181
Perplexity 1091.1388085910141, Delta 48.78502529029265
Perplexity 1070.2039949884722, Delta 20.934813602541908
Perplexity 1058.0225338763507, Delta 12.181461112121497
Perplexity 1046.6394965626928, Delta 11.383037313657951
Perplexity 1038.5689868317825, Delta 8.070509730910317
Perplexity 1031.4118994196706, Delta 7.157087412111878
...
```

Note, it will take a while to run EM. Using `HMM.fast_viterbi_decode` will be much faster than using `HMM.viterbi_decode`. If you need to, you can increase `PERPLEXITY_TH` (maybe set the value to `10`) and the algorithm will run for fewer iterations, however, then your improvements in tagging accuracy will also be smaller.

In [24]:
PERPLEXITY_TH = 0.1

# your code here

Perplexity 93187.23582920311, Delta inf
Perplexity 1138.5166673205285, Delta 92048.71916188259
Perplexity 1090.714779722392, Delta 47.801887598136545
Perplexity 1070.4186072568796, Delta 20.296172465512427
Perplexity 1058.3564657661252, Delta 12.062141490754357
Perplexity 1046.70582331659, Delta 11.650642449535098
Perplexity 1038.5945742586457, Delta 8.111249057944406
Perplexity 1031.4303413596956, Delta 7.164232898950104
Perplexity 1025.611318659956, Delta 5.819022699739662
Perplexity 1020.3892310824843, Delta 5.222087577471598
Perplexity 1015.9301294775604, Delta 4.459101604923944
Perplexity 1012.3090026168911, Delta 3.6211268606692784
Perplexity 1009.4388950559606, Delta 2.870107560930478
Perplexity 1007.3941124301826, Delta 2.044782625777998
Perplexity 1005.692491205116, Delta 1.701621225066674
Perplexity 1004.5079769338549, Delta 1.1845142712610368
Perplexity 1003.4261644342002, Delta 1.0818124996546885
Perplexity 1002.8497844898543, Delta 0.5763799443459448
Perplexity 1002.276679

Finally, tag `test_input` using your EM-trained `hmm` and evaluate accuracy. If you used Viterbi for decoding, you should get at least a few percentage points improvement in tagging accuracy (accuracy at least 52%). When using greedy decoding, results may actually be a bit worse than without hard EM training, so make sure to run your final results using Viterbi. 

Using soft EM, we could get larger improvements. 

In [25]:
# your code here for fast_viterbi_decode

56.61697411852475


In [26]:
# your code here for viterbi_decode

56.61697411852475


In [27]:
# your code here for greedy_decode

47.10324396771213


Johnson, M. (2007). Why Doesn’t EM Find Good HMM POS-Taggers? *Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)*, 296–305. http://www.aclweb.org/anthology/D/D07/D07-1031