<a href="https://colab.research.google.com/github/joooser/Natural_Language/blob/main/HMM_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Entrenando un Modelo Markoviano Latente (HMM)

## Corpus de español: 

* AnCora | Github: https://github.com/UniversalDependencies/UD_Spanish-AnCora

* usamos el conllu parser para leer el corpus: https://pypi.org/project/conllu/

* Etiquetas Universal POS (Documentación): https://universaldependencies.org/u/pos/

In [1]:
#@title dependencias previas
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting conllu
  Downloading conllu-4.5.2-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.2
Cloning into 'UD_Spanish-AnCora'...
remote: Enumerating objects: 928, done.[K
remote: Counting objects: 100% (181/181), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 928 (delta 128), reused 174 (delta 121), pack-reused 747[K
Receiving objects: 100% (928/928), 337.55 MiB | 28.03 MiB/s, done.
Resolving deltas: 100% (653/653), done.


In [2]:
#@title leyendo el corpus AnCora
from conllu import parse_incr 
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist.serialize())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
25.1	_	_	PRON	p	_	_	_	26:nsubj	Entity=(CESSCASTAA200007042484c7--1-CorefType:ident,gstype:gen)
26	tuvieron	tener	VERB	vmis3p0	Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin	22	acl	22:acl	_
27	que	que	SCONJ	cs	_	28	cc	28:cc	_
28	vender	vender	VERB	vmn0000	VerbForm=Inf	26	conj	26:conj	_
29	a	a	ADP	sps00	_	31	case	31:case	_
30	un	uno	DET	di0ms0	Definite=Ind|Gender=Masc|Number=Sing|PronType=Art	31	det	31:det	Entity=(CESSCASTAA200007042484s4.sn.73--2-gstype:gen
31	valor	valor	NOUN	ncms000	Gender=Masc|Number=Sing	28	obl	28:obl	_
32	muy	mucho	ADV	rg	_	33	advmod	33:advmod	_
33	inferior	inferior	ADJ	aq0cs0	Number=Sing	31	amod	31:amod	_
34-35	al	_	_	_	_	_	_	_	_
34	a	a	ADP	spcms	_	37	case	37:case	_
35	el	el	DET	_	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	37	det	37:det	_
36	de	de	ADP	sps00	_	37	case	37:case	Entity=(CESSCASTAA200007042484s4.sn.86--2-gstype:gen
37	compra	compra	NOUN	ncfs000	Gender=Fem|Number=Sing	33	nmod

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [3]:
#@title Estructura de los tokens etiquetados del corpus
tokenlist[1]

{'deprel': 'det',
 'deps': [('det', 5)],
 'feats': {'Definite': 'Def',
  'Gender': 'Masc',
  'Number': 'Sing',
  'PronType': 'Art'},
 'form': 'El',
 'head': 5,
 'id': 2,
 'lemma': 'el',
 'misc': {'Entity': '(CESSCASTP2002080117s20.sn.3--1-gstype:gen'},
 'upos': 'DET',
 'xpos': 'da0ms0'}

In [4]:
tokenlist[1]['form']+'|'+tokenlist[1]['upos']

'El|DET'

## Entrenamiento del modelo - Calculo de conteos:

* tags (tags) `tagCountDict`: $C(tag)$
* emisiones (word|tag) `emissionProbDict`: $C(word|tag)$
* transiciones (tag|prevtag) `transitionDict`: $C(tag|prevtag)$

In [5]:
tagCountDict = {} 
emissionDict = {}
transitionDict = {}

tagtype = 'upos'
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

# Calculando conteos (pre-probabilidades)
for tokenlist in parse_incr(data_file):
  prevtag = None
  for token in tokenlist:

    # C(tag)
    tag = token[tagtype]
    if tag in tagCountDict.keys():
      tagCountDict[tag] += 1
    else:
      tagCountDict[tag] = 1

    # C(word|tag) -> probabilidades emision
    wordtag = token['form'].lower()+'|'+token[tagtype] # (word|tag)
    if wordtag in emissionDict.keys():
      emissionDict[wordtag] = emissionDict[wordtag] + 1
    else:
      emissionDict[wordtag] = 1

    #  C(tag|tag_previo) -> probabilidades transición
    if prevtag is None:
      prevtag = tag
      continue
    transitiontags = tag+'|'+prevtag
    if transitiontags in transitionDict.keys():
      transitionDict[transitiontags] = transitionDict[transitiontags] + 1
    else:
      transitionDict[transitiontags] = 1
    prevtag = tag
    
#transitionDict
#emissionDict
#tagCountDict

## Entrenamiento del modelo - calculo de probabilidades
* probabilidades de transición:
$$P(tag|prevtag) = \frac{C(prevtag, tag)}{C(prevtag)}$$

* probabilidades de emisión:
 $$P(word|tag) = \frac{C(word|tag)}{C(tag)}$$

In [6]:
transitionProbDict = {} # matriz A
emissionProbDict = {} # matriz B

# transition Probabilities 
for key in transitionDict.keys():
  tag, prevtag = key.split('|')
  if tagCountDict[prevtag]>0:
    transitionProbDict[key] = transitionDict[key]/(tagCountDict[prevtag])
  else:
    print(key)

# emission Probabilities 
for key in emissionDict.keys():
  word, tag = key.split('|')
  if emissionDict[key]>0:
    emissionProbDict[key] = emissionDict[key]/tagCountDict[tag]
  else:
    print(key)

transitionProbDict['ADJ|ADJ']
#emissionProbDict

0.030217452696978255

## Guardar parámetros del modelo

In [7]:
import numpy as np
np.save('transitionHMM.npy', transitionProbDict)
np.save('emissionHMM.npy', emissionProbDict)
transitionProbdict = np.load('transitionHMM.npy', allow_pickle='TRUE').item()
transitionProbDict['ADJ|ADJ']

0.030217452696978255