# Tagging lab
## Viterbi
$k = 2$
The Markov assumption:
$$
\mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) =
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$
The Viterbi function gives the optimal value:
$$
V(w_1, w_2, \ldots w_n) =
{\arg\max}_{l_i\in L}
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$
For small $n$-s:
$$
V(w_1) = {\arg\max}_{l_1\in L} \mathbb{P}(w_1 \ | \ l_1)
$$

$$
V(w_1, w_2) = {\arg\max}_{l_1, l_2\in L} \mathbb{P}(w_1 \ | \ l_1) \cdot \mathbb{P}(l_2 \ | \ l_1) \cdot \mathbb{P}(w_2 \ | \ l_2)
$$

$$
V(w_1, w_2, w_3) = {\arg\max}_{l_1, l_2, l_3\in L} V(w_1, w_2) \cdot \mathbb{P}(l_3 \ | \ l_2) \cdot \mathbb{P}(w_3 \ | \ l_3)
$$

$$
V(w_1, w_2, w_3, w_4) = {\arg\max}_{l_1, l_2, l_3, l_4\in L} V(w_1, w_2, w_3) \cdot \mathbb{P}(l_4 \ | \ l_3) \cdot \mathbb{P}(w_4 \ | \ l_4)
$$

## Implementing Viterbi
For $k=2$
### Task 1
Read the file [`umbc.casesensitive.word_pos.1M.txt`](http://sandbox.hlt.bme.hu/~gaebor/ea_anyag/python_nlp) line-by-line and make a vocabulary of words and labels. The file is in a format:

`"word\tpos\n"`
You have to have:
* a dict of words to indices
* a dict of labels to indices
* reverse dict, indices to words

Note that there can be some errors in the file, character encoding and delimiter!

In [None]:
dowi = {}
doli = {}
doiw = {}
with open('umbc.casesensitive.word_pos.1M.txt', 'r', encoding='utf8', errors='ignore') as f:
    for l in f:
        tmp = l.strip().split('\t')
        if len(tmp) != 2:
            continue
        w, l = tmp[0], tmp[1]
        dowi[w] = dowi.get(w, len(dowi))
        i = dowi[w]
        try:
            doli[l].add(i)
        except KeyError:
            doli[l] = set([i])
        doiw[i] = doiw.get(i, w)
        
print(len(dowi), len(doli), len(doiw))

### Task 2
Obtain the following matrices:
* counts of words with pos tags
  * a $|V|\times |L|$ matrix of integers
  $$M(i,j) = \# \ i^\text{th} \text{ word with pos tag } j$$
  * an $|L|\times |L|$ matrix of integers
  $$N(i,j) = \# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}$$

After that
* empirical probabilities
  * a $|V|\times |L|$ matrix of floats
  $$P_1(i,j) = \frac{\# \ i^\text{th} \text{ word with pos tag } j}{\# \ \text{pos tag } j}$$
  * an $|L|\times |L|$ matrix of floats
  $$P_2(i,j) = \frac{\# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}}{\# \ \text{pos tag } j}$$


In [None]:
import numpy as np


V = len(dowi)
L = len(doli)

M = np.empty((V, L), dtype=np.int32)
N = np.empty((L, L), dtype=np.int32)
P1 = np.empty((V, L), dtype=np.float)
P2 = np.empty((L, L), dtype=np.float)

for i in range(V):
    for j in range(L):
        # M(i,j) = # ith word with pos tag j
        pos_tag_j = doli[j]
        M[i, j] += doiw[i]
        # N(i,j) = # ith pos after jth pos
        N[i, j] = 0
        # P1(i,j) = \frac{ # ith word with pos tag j }{ # pos tag j }
        P1[i, j] = 0
        # P2(i,j) = \frac{ # ith pos after jth pos }{ # pos tag j }
        P2[i, j] = 0

### Task 3
Implement the pseudocode in the tagging lecture ($k=2$).
Use the global variables `P1` and `P2` (and the dictionaries).

It is useful to have a one-step-viterbi function.
$$
\mathrm{step}(\pi_{k-1}, v, word) = \max_{w\in L} \left(\pi(k−1,w)\cdot \mathbb{P}(v \ | \ w)\cdot \mathbb{P}(word \ | \ v)\right) 
$$
$\pi_{k-1}$ is a vector, the previous column of the dynamic table. 

In [None]:
def viterbi_step(previous, u, v, word):
    return 0.0

In [None]:
def viterbi(words):
    """from a list of strings returns the optimal probability"""
    return 0.0

### Task 4
Add the backtracking with an extra table, which stores the argmax, not the max value.

In [None]:
import numpy, scipy
from collections import defaultdict

In [None]:
with open("umbc.casesensitive.word_pos.1M.txt", "rb") as f:
    for line in f:
        parts = line.strip().split(b'\t')
        if len(parts) == 2:
            word = parts[0]
            pos = parts[1]
            pass