# Hidden Markov Models
This practice relies on training the model using the [Universal Dependencies 2.4](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2988#) corpus.

In [1]:
!curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2988{/ud-treebanks-v2.4.tgz,/ud-documentation-v2.4.tgz,/ud-tools-v2.4.tgz}


[1/3]: https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2988/ud-treebanks-v2.4.tgz --> ud-treebanks-v2.4.tgz
--_curl_--https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2988/ud-treebanks-v2.4.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 16  326M   16 52.8M    0     0   682k      0  0:08:10  0:01:19  0:06:51  858k
curl: (18) transfer closed with 287118753 bytes remaining to read

[2/3]: https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2988/ud-documentation-v2.4.tgz --> ud-documentation-v2.4.tgz
--_curl_--https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2988/ud-documentation-v2.4.tgz
 71 73.7M   71 52.8M    0     0   678k      0  0:01:51  0:01:19  0:00:32  664k02:43  0:00:10  0:02:33  588k    0  0:01:58  0:00:46  0:01:12  411k
curl: (18) transfer closed with 21936762 bytes remaining to read

[

In [2]:
!tar -xf ud-treebanks-v2.4.tgz


gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Let `count`, `countB`, and `p`, respectively functions to count how many times a tag is in the corpus, how many times a tag is in the corpus given another previous one, and finally, the probability of a tag given another previous one.

In [9]:
def count(tag, corpus):
  count = 0
  for sentence in corpus:
    for token in sentence:
      if token.upos == tag:
        count+=1
  return count

def countB(tagi, tagi_1, corpus):
  count = 0
  for sentence in corpus:
    idx = 1
    while idx <= len(sentence):
      if sentence[idx-1].upos == tagi and sentence[idx-2].upos == tagi_1:
        count+=1    
      idx+=1
  return count

def p(tagi, tagi_1, corpus):
  num = countB(tagi_1, tagi, corpus)
  den = count(tagi, corpus)
  return num/den

Let's also define the training corpus and discover the tags inside it. 

Let's also build the transition matrix `tm` that gets the probability of change between tags.

In [10]:
import pyconll
import pyconll.util
import pandas

UD_ENGLISH_TRAIN = './ud-treebanks-v2.4/UD_English-LinES/en_lines-ud-train.conllu'

train = pyconll.load_from_file(UD_ENGLISH_TRAIN)
train = [train[2]]

tags = set()

for sentence in train:
  for token in sentence:
    print(token.lemma, token.upos)
    tags.add(token.upos)
    
print(countB('NOUN', 'DET', train))
    
tags = {tag: 0 for tag in tags}
tm = {tag: tags for tag in tags}

for rowtag in tm:
  for coltag in tm[rowtag]:
    tm[rowtag][coltag] = p(rowtag, coltag, train)

print('Transition Matrix')
print('-----------------')
print(pandas.DataFrame(tm))

some PRON
of ADP
the DET
content NOUN
in ADP
this DET
topic NOUN
may AUX
not PART
be AUX
applicable ADJ
to ADP
some DET
language NOUN
. PUNCT
0
Transition Matrix
-----------------
       PUNCT  ADJ  ADP  AUX  DET  PART  NOUN  PRON
ADJ      0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
ADP      1.0  1.0  1.0  1.0  1.0   1.0   1.0   1.0
AUX      0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
DET      0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
NOUN     0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
PART     0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
PRON     0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0
PUNCT    0.0  0.0  0.0  0.0  0.0   0.0   0.0   0.0


In [5]:
summ = 0
for x in tm:
    for y in tm[x]:
        summ += tm[x][y]
print(summ)

16.999999999999996


Let's also define two functions `tag_word_count` to tell how many times a word with a specific tag appears in the corpus, and `p_B` to define the probability of having such word with such tag in the corpus.

In [5]:
def tag_word_count(tag, word, corpus):
  count = 0
  for sentence in corpus:
    for token in sentence:
      if word == token.lemma and token.upos == tag:
        count += 1
  return count

def p_B(tag, word, corpus):
  num = tag_word_count(tag, word, corpus)
  den = count(tag, corpus)
  return num/den

Having the training functions, let's define `ctrl_phrase` as a testing phrase to check its B matrix.

In [6]:
ctrl_phrase = 'you will back the bill'.split()

mb = {word: {tag: 0 for tag in tags} for word in ctrl_phrase}

for word in ctrl_phrase:
  for sentence in train:
    for token in sentence:
      if token.lemma == word:
        mb[word][token.upos] = p_B(token.upos, word, train)

print(pandas.DataFrame(mb))

            you      will      back       the      bill
ADJ    0.000000  0.000000  0.000000  0.000000  0.000000
ADP    0.000000  0.000000  0.000000  0.000000  0.000000
ADV    0.000000  0.000000  0.024531  0.000722  0.000000
AUX    0.000000  0.052336  0.000000  0.000000  0.000000
CCONJ  0.000000  0.000000  0.000000  0.000000  0.000000
DET    0.000000  0.000000  0.000000  0.591573  0.000000
INTJ   0.000000  0.000000  0.000000  0.000000  0.000000
NOUN   0.000000  0.000443  0.001993  0.000000  0.000111
NUM    0.000000  0.000000  0.000000  0.000000  0.000000
PART   0.000000  0.000000  0.000000  0.000000  0.000000
PRON   0.097779  0.000000  0.000000  0.000000  0.000000
PROPN  0.000000  0.000000  0.000000  0.000000  0.000599
PUNCT  0.000000  0.000000  0.000000  0.000000  0.000000
SCONJ  0.000000  0.000000  0.000000  0.000000  0.000000
SYM    0.000000  0.000000  0.000000  0.000000  0.000000
VERB   0.000000  0.000169  0.000507  0.000000  0.000000
X      0.000000  0.000000  0.000000  0.000000  0

Finally, let's define a list of dictionaries `hmm` to save the Hidden Markov Model for this case.

In [7]:
hmm = []
for word in mb:
  argmax = {'p': 0, 'upos': '', 'word': word}
  for tag in mb[word]:
    if mb[word][tag] > argmax['p']:
      argmax = {'p': mb[word][tag], 'upos': tag, 'word': word}
  hmm.append(argmax)

for e in hmm:
  print(e)

{'p': 0.09777870043595599, 'upos': 'PRON', 'word': 'you'}
{'p': 0.052336448598130844, 'upos': 'AUX', 'word': 'will'}
{'p': 0.024531024531024532, 'upos': 'ADV', 'word': 'back'}
{'p': 0.591572799332499, 'upos': 'DET', 'word': 'the'}
{'p': 0.0005991611743559018, 'upos': 'PROPN', 'word': 'bill'}
