# Training a Hidden Markovian Model (HMM)

## Spanish Corpus: 

* AnCora | Github: https://github.com/UniversalDependencies/UD_Spanish-AnCora

* I've used conllu parser to read the corpus: https://pypi.org/project/conllu/

* Universal Tag POS (Documentation): https://universaldependencies.org/u/pos/

In [1]:
#@title Previous Dependencies
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting conllu
  Downloading conllu-4.5.2-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.2
Cloning into 'UD_Spanish-AnCora'...
remote: Enumerating objects: 1054, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 1054 (delta 53), reused 70 (delta 53), pack-reused 984[K
Receiving objects: 100% (1054/1054), 388.11 MiB | 24.44 MiB/s, done.
Resolving deltas: 100% (746/746), done.


In [3]:
#@title Reading AnCora Corpus
from conllu import parse_incr
# Void List 
wordList = []
# File Selection & Reading Permission
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist.serialize())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
22	en	en	ADP	sps00	_	23	case	23:case	_
23	multitud	multitud	NOUN	ncfs000	Gender=Fem|Number=Sing	18	obl	18:obl	Entity=(CESSCASTAA200004128873c18--1-gstype:gen
24	de	de	ADP	sps00	_	25	case	25:case	_
25	vehículos	vehículo	NOUN	ncmp000	Gender=Masc|Number=Plur	23	nmod	23:nmod	Entity=(CESSCASTAA200004128873c11--1-gstype:gen
26	que	que	PRON	pr0cn000	PronType=Int,Rel	28	nsubj	28:nsubj	Entity=(CESSCASTAA200004128873c11--1-CorefType:ident,gstype:gen)
27	están	estar	AUX	vmip3p0	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	28	cop	28:cop	_
28	preparados	preparado	ADJ	aq0mpp	Gender=Masc|Number=Plur|VerbForm=Part	25	acl	25:acl	_
29	para	para	ADP	sps00	_	30	mark	30:mark	_
30	consumir	consumir	VERB	vmn0000	VerbForm=Inf	28	advcl	28:advcl	_
31	indistintamente	indistintamente	ADV	rg	_	32	advmod	32:advmod	Entity=(CESSCASTAA200004128873s4.sn.78--2-gstype:gen
32	combustible	combustible	NOUN	ncms000	Gender=Masc|Number=Sing	30	obj	30:obj

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [5]:
#@title Structure Corpur's Tagged Tokens - Checking on One Token
tokenlist[1]

{'id': 2,
 'form': 'El',
 'lemma': 'el',
 'upos': 'DET',
 'xpos': 'da0ms0',
 'feats': {'Definite': 'Def',
  'Gender': 'Masc',
  'Number': 'Sing',
  'PronType': 'Art'},
 'head': 5,
 'deprel': 'det',
 'deps': [('det', 5)],
 'misc': {'Entity': '(CESSCASTP2002080117s20.sn.3--1-gstype:gen'}}

In [7]:
# Retrieving Work & Grammatical Category on Previos Token
tokenlist[1]['form']+'|'+tokenlist[1]['upos']

'El|DET'

## Model Training - Counting Calculation:

* Tags (tags) `tagCountDict`: $C(tag)$
* Emisions (word|tag) `emissionProbDict`: $C(word|tag)$
* Transitions (tag|prevtag) `transitionDict`: $C(tag|prevtag)$

In [8]:
tagCountDict = {} 
emissionDict = {}
transitionDict = {}

tagtype = 'upos'
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

# Counting Calculation (pre-probabilities)
for tokenlist in parse_incr(data_file):
  prevtag = None
  for token in tokenlist:

    # C(tag)
    tag = token[tagtype]
    if tag in tagCountDict.keys():
      tagCountDict[tag] += 1
    else:
      tagCountDict[tag] = 1

    # C(word|tag) -> Emision probabilities
    wordtag = token['form'].lower()+'|'+token[tagtype] # (word|tag)
    if wordtag in emissionDict.keys():
      emissionDict[wordtag] = emissionDict[wordtag] + 1
    else:
      emissionDict[wordtag] = 1

    #  C(tag|previous_tag) ->  Transition Probabilities
    if prevtag is None:
      prevtag = tag
      continue
    transitiontags = tag+'|'+prevtag
    if transitiontags in transitionDict.keys():
      transitionDict[transitiontags] = transitionDict[transitiontags] + 1
    else:
      transitionDict[transitiontags] = 1
    prevtag = tag
    
#transitionDict
#emissionDict
#tagCountDict

## Model Training - Probabilities Calculus
* Trasitional Probabilities:
$$P(tag|prevtag) = \frac{C(prevtag, tag)}{C(prevtag)}$$

* Emission Probabilities:
 $$P(word|tag) = \frac{C(word|tag)}{C(tag)}$$

In [9]:
transitionProbDict = {} # matriz A
emissionProbDict = {} # matriz B

# Transition Probabilities 
for key in transitionDict.keys():
  tag, prevtag = key.split('|')
  if tagCountDict[prevtag]>0:
    transitionProbDict[key] = transitionDict[key]/(tagCountDict[prevtag])
  else:
    print(key)

# Emission Probabilities 
for key in emissionDict.keys():
  word, tag = key.split('|')
  if emissionDict[key]>0:
    emissionProbDict[key] = emissionDict[key]/tagCountDict[tag]
  else:
    print(key)

transitionProbDict['ADJ|ADJ']
#emissionProbDict

0.030217452696978255

## Saving Model Paramenters

In [10]:
import numpy as np
np.save('transitionHMM.npy', transitionProbDict)
np.save('emissionHMM.npy', emissionProbDict)
transitionProbdict = np.load('transitionHMM.npy', allow_pickle='TRUE').item()
transitionProbDict['ADJ|ADJ']

0.030217452696978255