## Katona Máté (PD6YOR)


# Homework 2

The maximum score of this homework is 100 points. Grading is listed in this table:

| Grade | Score range |
| --- | --- |
| 5 | 85+ |
| 4 | 70-84 |
| 3 | 55-69 |
| 2 | 40-54 |
| 1 | 0-39 |

Most exercises include tests which should pass if your solution is correct.
However successful test do not guarantee that your solution is correct.
You are free to add more tests.

## Deadline
__2017 November 20<sup>th</sup> Monday 23:59__


# Main exercise (60 points)

Implement the Viterbi algorithm fot $k=3$.

The input can be found [here](http://sandbox.hlt.bme.hu/~gaebor/ea_anyag/python_nlp/).
You can use the 1, 10 or 100 million word corpus, it is advised to use the 1 million corpus while you are developing and you can try te larger ones afterwards.

Write a class called `Viterbi` with the following attributes:
* `__init__`: has no arguments, except `self`
* `train`: one argument (aside `self`), an iterable object which generates 2-tuples of strings `[(word1, pos1), (word2, pos2), ...]`
  * initializes the vocabularies, empirical probabilities or any other data attributes needed for the algorithm.
  * you can read the data (generator) only once
  * returns `None`
* predict: one argument (aside `self`), an iterable object, a list of words.
  * returns the predicted label sequence: a list of labels with the same length as the input.

Don't use global variables!

### Hint
* Use a 3-dimensional array for the transition probabilities $\mathbb{P}(v \ | \ w, u)$.
* Use a 3-dimensional array for the Viterbi table, one index for time, one index for the previous state and one index for the state before that.
* Same for the backtracking table

In [2]:
import numpy as np

In [114]:
class NotTrainedException(Exception):
    pass


class Viterbi(object):
    def __init__(self):
        self.trained = False
        
    def train(self, data):
        # copy and count data - ugly
        data = [(w, t) for w, t in data]
        dowi, doti = {}, {}
        for w, t in data:
            dowi[w] = dowi.get(w, len(dowi))
            doti[t] = doti.get(t, len(doti))
        self.doiw = {i: w for w, i in dowi.items()}
        self.doit = {i: t for t, i in doti.items()}
        self.dowi = dowi
        self.doti = doti
        self.W = len(dowi)  # number of words
        self.T = len(doti)  # number of tags
        # probs
        word_tags = np.zeros((self.W, self.T), dtype=np.int32)  # number of (word, tag) occurences
        tag_seqs = np.zeros((self.T, self.T, self.T), dtype=np.int32)  # number of (tag, tag, tag) occurences
        prev_ti = None
        prev_prev_ti = None
        for w, t in data:
            ti = doti[t]
            word_tags[dowi[w], ti] += 1
            if prev_ti is not None and prev_prev_ti is not None:
                tag_seqs[ti, prev_ti, prev_prev_ti] += 1
            prev_prev_ti = prev_ti
            prev_ti = ti
        self.word_tag_probs = word_tags / word_tags.sum(axis=0).astype(np.float32)
        self.tag_seq_probs = tag_seqs / tag_seqs.sum(axis=(1,2)).astype(np.float32)
        # enable prediction
        self.trained = True
        print("trained on {} words and {} tags".format(self.W, self.T))
        return
    
    def _step(self, viterbi_prev, p_tag, tag, wi):
            m, arg_m = 0., -1  # 0 is a valid index
            word_tag_prob = self.word_tag_probs[wi, tag]
            if word_tag_prob == 0:
                return 0., 0
            for pp_tag in range(self.T):
                # print('word: {}'.format(self.doiw[wi]))
                # print('tested tag seq: {}, {}, {}'.format(self.doit[pp_tag], self.doit[p_tag], self.doit[tag]))
                prod = (viterbi_prev[pp_tag, p_tag] * 
                        self.tag_seq_probs[tag, p_tag, pp_tag] * 
                        word_tag_prob)
                # print('prev[{}, {}] = {}'.format(pp_tag, p_tag, viterbi_prev[pp_tag, p_tag]))
                # print('tag_seq_prob: {}'.format(self.tag_seq_probs[tag, p_tag, pp_tag]))
                # print('word_tag_prob: {}'.format(word_tag_prob))
                # print('pi: {}'.format(prod))
                if prod > m:
                    m, arg_m = prod, pp_tag
                    print('new max found: {} for word {}, tag {}'.format(m, self.doiw[wi], self.doit[arg_m]))
            return m, arg_m
        
    def predict(self, words):
        # raise error if not trained yet    
        if not self.trained:
            raise NotTrainedException("Train me first!")
        # viterbi init
        N = len(words)  # length of sentence
        viterbi = np.zeros((N, self.T, self.T), dtype=np.float32)
        backpts = -np.ones((N, self.T, self.T), dtype=np.int32)  # 0 is a valid index
        # k=0 -> pi(0, *, *) = 1
        viterbi[0, :, :] = np.ones((self.T, self.T))
        # k=1 -> prev_prev_tag does not exist
        # set pi(1, *, v) = P(w_1|v)
        wi = self.dowi[words[1]]
        for tag in range(self.T):
            viterbi[1, :, tag] = self.word_tag_probs[wi, tag] * np.ones((self.T))
        # k=2...N
        for k in range(2, N):
            wi = self.dowi[words[k]]
            for tag in range(self.T):
                for prev_tag in range(self.T):
                    viterbi[k, prev_tag, tag], backpts[k, prev_tag, tag] = self._step(viterbi[k-1, :, :], prev_tag, tag, wi)
        # get max likelihood last tags
        print(backpts)
        max_prob = viterbi[-1, :, :].max()
        max_prob_tags = np.where(viterbi[-1, :, :] == max_prob)
        next_tag = max_prob_tags[0][0]
        next_next_tag = max_prob_tags[1][0]
        pred_tags_reversed = [self.doit[next_next_tag], self.doit[next_tag]]
        print(pred_tags_reversed)
        # backtrack max likelihood tags
        for k in range(N-1, 2, -1):
            pred_current_tag = backpts[k, next_tag, next_next_tag]
            pred_tags_reversed.append(self.doit[pred_current_tag])
            next_next_tag = next_tag
            next_tag = pred_current_tag
            print(pred_tags_reversed)
        # flip predicted tags
        pred_tags_reversed.reverse()
        return max_prob,  pred_tags_reversed

In [None]:
with open("umbc.casesensitive.word_pos.1M.txt", "r", encoding="utf8", errors="ignore") as f:
    generator1 = (line.strip().split("\t") for line in f)
    generator2 = (line for line in generator1 if len(line) == 2)
    
    viterbi = Viterbi()
    viterbi.train(generator2)

In [64]:
tags = {'the': 'DT',
        'cat': 'NN', 'dog': 'NN', 'man': 'NN',
        'goes': 'VBZ', 'sits': 'VBZ',
        'to': 'TO', 'on': 'IN', 
        'store': 'NN', 'chair': 'NN', 'bed': 'NN',
        '.': '.',
       }
sents = ('the cat goes to the store .',
         'the cat sits on the bed .',
         'the cat sits on the chair .',
         'the dog goes to the store .',
         'the dog sits on the bed .',
         'the dog sits on the chair .',
         'the man goes to the store .',
         'the man sits on the bed .',
         # 'the man sits on the chair .',
        )
dummy = [(word, tags[word]) for sent in sents for word in sent.split(' ')]

In [112]:
v = Viterbi()
v.train(dummy)

trained on 12 words and 6 tags


In [113]:
v.predict('the man sits on the chair .'.split(' '))

new max found: 0.041666666666666664 for word sits, tag DT
new max found: 0.013888889302810032 for word on, tag NN
new max found: 0.008680555620230734 for word the, tag VBZ
new max found: 0.0010850694961845875 for word chair, tag IN
new max found: 0.0005787037312984467 for word ., tag DT
[[[-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]]

 [[-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]
  [-1 -1 -1 -1 -1 -1]]

 [[ 0  0 -1  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0 -1  0  0  0]
  [ 0  0 -1  0  0  0]
  [ 0  0 -1  0  0  0]
  [ 0  0 -1  0  0  0]]

 [[ 0  0  0  0  0 -1]
  [ 0  0  0  0  0 -1]
  [ 0  0  0  0  0  1]
  [ 0  0  0  0  0 -1]
  [ 0  0  0  0  0 -1]
  [ 0  0  0  0  0 -1]]

 [[-1  0  0  0  0  0]
  [-1  0  0  0  0  0]
  [-1  0  0  0  0  0]
  [-1  0  0  0  0  0]
  [-1  0  0  0  0  0]
  [ 2  0  0  0  0  0]]

 [[ 0  5  0  0  0  0]
  [ 0 -1  0  0  0  0

IndexError: index 7 is out of bounds for axis 0 with size 7

In [None]:
assert(viterbi.predict("You talk the talk .".split()) == ['PRP', 'VB', 'DT', 'NN', '.'])
print(viterbi.predict("The dog .".split()))
print(viterbi.predict("The dog runs .".split()))
print(viterbi.predict("The dog runs slowly .".split()))
print(viterbi.predict("The dog 's run was slow .".split()))

## Small exercise 1. (10 points)
Modify the Viterbi class: use a logarithmic scale for probabilities.

In the Viterbi table instead of 
$$
\pi(k,u,v) = \max_{w\in L} \pi(k-1, w, u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v) 
$$
use
$$
\hat\pi(k,u,v) = \max_{w\in L} \hat\pi(k−1,w,u) + \log\mathbb{P}(v \ | \ w,u) + \log\mathbb{P}(w_k \ | \ v) 
$$

Note that the minimum probability is $0$, but the minimum logarithm is $-\infty$. Both numpy and python float can deal with minus infinity.<br>
Precalculate the log-probabilities in the initializer, not during the dymanic programming.

This should not affect the result, just the numbers in the viterbi table.

Name the log-scaled imlementation `ViterbiLog`, it can inherit from `Viterbi` or it can be a whole new class.

In [None]:
with open("umbc.casesensitive.word_pos.1M.txt", "r", encoding="utf8", errors="ignore") as f:
    generator1 = (line.strip().split("\t") for line in f)
    generator2 = (line for line in generator1 if len(line) == 2)

    viterbi_log = ViterbiLog()
    viterbi_log.train(generator2)

In [None]:
assert(viterbi.predict("The dog runs slowly .".split()) == viterbi_log.predict("The dog runs slowly .".split()))

## Small exercise 2. (30 points)
### a) 15 points
Modify the Viterbi class: use a sparse storage for transition probabilities, not a 3-dimensional array.

Use a dict to store the frequencies of the 2 and 3 tuples of labels.

For example if you had _"adjective noun"_ 10 times and _"adjective noun determinant"_ 5 times, then store the following

In [None]:
{('JJ', 'NN'): 10, ('JJ', 'NN', 'DT'): 5}

In the example $\mathbb{P}(DT \ | \ JJ, NN ) = 0.5$

Note that whenever $\#\{JJ, NN\} = 0$ or $\#\{JJ, NN, DT\} = 0$, then $\mathbb{P}(DT \ | \ JJ, NN ) = 0$.

Implement this in a new class `ViterbiSparse`, it can inherit from the original one or it can be a new class.

### b) 15 points
Try to find a sparse representation (with `dict`-s) which makes the inner for loop shorter. Note that you don't have to take the maximum over all the $w\in L$ elements, if you already know that some transition probabilities are zeros.

$$
\max_{\substack{w\in L \\ \mathbb{P}(v \ | \ w,u) > 0}} \pi(k-1, w, u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v)
$$