# Tagging lab
## Viterbi
$k = 2$
The Markov assumption:
$$
\mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) =
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$
The Viterbi function gives the optimal value:
$$
V(w_1, w_2, \ldots w_n) =
{\max}_{l_i\in L}
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$
For small $n$-s:
$$
V(w_1) = {\max}_{l_1\in L} \mathbb{P}(w_1 \ | \ l_1)
$$

$$
V(w_1, w_2) = {\max}_{l_1, l_2\in L} \mathbb{P}(w_1 \ | \ l_1) \cdot \mathbb{P}(l_2 \ | \ l_1) \cdot \mathbb{P}(w_2 \ | \ l_2)
$$

$$
V(w_1, w_2, w_3) = {\max}_{l_1, l_2, l_3\in L} V(w_1, w_2) \cdot \mathbb{P}(l_3 \ | \ l_2) \cdot \mathbb{P}(w_3 \ | \ l_3)
$$

$$
V(w_1, w_2, w_3, w_4) = {\max}_{l_1, l_2, l_3, l_4\in L} V(w_1, w_2, w_3) \cdot \mathbb{P}(l_4 \ | \ l_3) \cdot \mathbb{P}(w_4 \ | \ l_4)
$$

## Implementing Viterbi
For $k=2$
### Task 1
Read the file [`umbc.casesensitive.word_pos.1M.txt`](http://sandbox.hlt.bme.hu/~gaebor/ea_anyag/python_nlp) line-by-line and make a vocabulary of words and labels. The file is in the format:

`"word\tpos\n"`

You have to have:
* a dict of words to indices
* a dict of labels to indices
* reverse dict, indices to words, indices to pos tags (labels)


Note that there can be some errors in the file, character encoding and delimiter!

In [None]:
dowi = {}
doli = {}
with open('umbc.casesensitive.word_pos.1M.txt', 'r', encoding='utf8', errors='ignore') as f:
    for l in f:
        tmp = l.strip().split('\t')
        if len(tmp) != 2:
            continue
        w, l = tmp[0], tmp[1]
        dowi[w] = dowi.get(w, len(dowi))
        doli[l] = doli.get(l, len(doli))
        
doiw = {i: w for w, i in dowi.items()}
doil = {i: l for l, i in doli.items()}

### Task 2
Obtain the following matrices:
* counts of words with pos tags
  * a $|V|\times |L|$ matrix of integers
  $$M(i,j) = \# \ i^\text{th} \text{ word with pos tag } j$$
  * an $|L|\times |L|$ matrix of integers
  $$N(i,j) = \# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}$$

After that
* empirical probabilities
  * a $|V|\times |L|$ matrix of floats
  $$P_1(i,j) = \frac{\# \ i^\text{th} \text{ word with pos tag } j}{\# \ \text{pos tag } j}$$
  * an $|L|\times |L|$ matrix of floats
  $$P_2(i,j) = \frac{\# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}}{\# \ \text{pos tag } j}$$


In [None]:
import numpy as np


V = len(dowi)
L = len(doli)

M = np.zeros((V, L), dtype=np.int32)
N = np.zeros((L, L), dtype=np.int32)

l_prev = 'imaginary label, which is not present in dict of labels to indices (aka doli)'
with open('umbc.casesensitive.word_pos.1M.txt', 'r', encoding='utf8', errors='ignore') as f:
    for l in f:
        tmp = l.strip().split('\t')
        if len(tmp) != 2:
            continue
        w, l = tmp[0], tmp[1]
        M[dowi[w], doli[l]] += 1
        try:
            N[doli[l_prev], doli[l]] += 1
        except KeyError:
            pass  # should only occur on first cycle
        l_prev = l

In [None]:
P1 = M / M.sum(axis=0).astype(np.float)
P2 = N / N.sum(axis=0).astype(np.float)



### Task 3
Implement the pseudocode in the tagging lecture ($k=2$).
Use the global variables `P1` and `P2` (and the dictionaries).

It is useful to have a one-step-viterbi function.
$$
\mathrm{step}(\pi_{k-1}, v, word) = \max_{w\in L} \left(\pi(w, k-1)\cdot \mathbb{P}(v \ | \ w)\cdot \mathbb{P}(word \ | \ v)\right) 
$$
$\pi_{k-1}$ is a vector, the previous column of the dynamic table. 

In [3]:
def viterbi_step(previous, v, word):
    max_ = 0
    for i in doli.values():
        prod = previous[i] * P2[i, v] * P1[word, v]
        if prod > max_:
            max_ = prod
    return max_

In [2]:
def viterbi(words):
    """from a list of strings returns the optimal probability"""
    num_words = len(words)
    pi = np.zeros((L, num_words))
    
    # first line of pi
    k = 0
    for v in range(L):
        i = dowi[words[k]]
        pi[v, k] = P1[i, v]
        
    # step trough other lines
    for k in range(1, num_words):
        word = words[k]
        i = dowi[word]
        for v in range(L):
            pi[v, k] = viterbi_step(pi[:, k-1], v, i)
            
    return np.max(pi[:, -1])

### Task 4
Add the backtracking with an extra table, which stores the argmax, not the max value.

In [None]:
def viterbi_step(previous, v, word):
    max_ = 0
    max_i = 0
    for i in doli.values():
        prod = previous[i] * P2[i, v] * P1[word, v]
        if prod > max_:
            max_ = prod
            max_i = i
    return max_, max_i

In [None]:
def viterbi(words):
    """from a list of strings returns the optimal probability"""
    num_words = len(words)
    pi = np.zeros((L, num_words))
    label_is = np.zeros((L, num_words), np.int32)
    
    # first line of pi
    k = 0
    for v in range(L):
        i = dowi[words[k]]
        pi[v, k] = P1[i, v]
        
    # step trough other lines
    for k in range(1, num_words):
        word = words[k]
        i = dowi[word]
        for v in range(L):
            pi[v, k], label_is[v, k] = viterbi_step(pi[:, k-1], v, i)
            
    v_max = np.max(pi[:, -1])
    i_max = np.argmax(pi[:, -1])
    # backtrack label indices
    labels = [doil[i_max]]
    for word_idx in range(num_words-1, 0, -1):
        labels.append(doil[label_is[doli[labels[-1]], word_idx]])
    
    labels.reverse()
    return v_max, labels

In [None]:
viterbi("The guy from the pub smoked a cigarette .".split())