# HMM 

An HMM Part-of-Speech (POS) Tagger is a statistical model that assigns parts of speech to words in a sentence using the Hidden Markov Model (HMM) framework.

First we import our model with the needed dependencies:

In [1]:
from utils.conllu_dataloader import *
from model.hmm import HMMPOSTagger

In [2]:
def csv_to_list_of_lists(file_path):
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        return [list(row) for row in reader]

def csv_to_list(file_path):
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        return [row for row in reader][0]

In [3]:
sentences = csv_to_list_of_lists('./datasets/dataset_sentences.csv')
pos_tags = csv_to_list_of_lists('./datasets/dataset_pos_tags.csv')
vocabulary = csv_to_list('./datasets/dataset_vocab.csv')
tags = csv_to_list('./datasets/dataset_tags.csv')

tags.append('*')
tags.append("<STOP>")
vocabulary.append('*')
vocabulary.append("<STOP>")

When we initialize the model we have these parameters: 

- `tags`: set of possible tags
- `vocab`: set of possible words

 

In [4]:
hmm = HMMPOSTagger(tags, vocabulary)

After initializing we get the following attributes:

- `self.tags2idx` = dictionary to get the idx from the tags
- `self.idx2tags` = dictionary to get the tags from the idx
- `self.tags` = set of used tags
- `self.Q`= number of tags
- `self.vocab` = vocabulary
- `self.transition_counts` = matrix to store the transition counts
- `self.emission_counts` = matrix to store the emission counts
- `self.transition_probs` = matrix to store the transition probabilities
- `self.emission_probs` = matrix to store the emission probabilities
- `self.word_counts` = counter of each word
- `self.tag_counts` = counter of each tag

# CONLLU DATALOADER

Converts conllu files to csv. It generates 4 files: sentences, sentences post tags, general tags, vocabulary

# Loader class: functions

## TRAIN FUNCTION

Function to train our HMM model

Our train function has the following parameters:
- `sentences`
list of all the sentences in our train dataset
- `pos_tags`
list of all the PoS tags of the sentences in our train dataset
- `change_vocab` if true, it updates the vocab by removing the least frequent words and turning them all to <UNK>

First the model counts the occurrences of the tags and the words and stores them in two dictionaries. 
The first one is "transition_counts" where we store each tags that follows another tag; and the second one is "emission_counts" where we store each word associated with a tag.
We also take into account the number of occurrences of each tag and word, and we store them in "tag_counts" and "word_counts" respectively.

Every time we start a new sentence we insert a "*" in the first position and a <STOP> in the last position. This is to take into consideration the first and last words of a sentence. 

After running all the sentences we check if we have to change the vocabulary. If so, we revise the words in dictionary "word_counts", we remove the words that appear less than 5 times and we add the counts to the <UNK> token. We do this so that the model can work better with unknown words.

Then, to sample with the model, we have to get the probabilities of transitions and emission counts. This is achieved with the following function:
$$
P(q_i|q_{i-1})={count(q_{i-1},q_i) \over count(q_{i-1})}
$$

Where $ q_{i-1} $ are going to be the previous tags and $ q_i $ the current tags in transition counts, and with emission counts, the $ q_{i-1} $ are going to be the current words and the $q_i$ are going to be the current tags. 


After running the train function we will have two matrices:
 1) transition probabilty matrix
 2) emission probabilty matrix


In [5]:
hmm.train(sentences, pos_tags, change_vocab = True)

If we want to inspect the transition and emission probabilities we can check them like this:

In [6]:
print(hmm.transition_probs.items())
print(hmm.emission_probs.items())

dict_items([(18, defaultdict(<class 'float'>, {3: 0.04433281639808509, 8: 0.060818555727867304, 1: 0.10322972152922932, 15: 0.0812150225878228, 6: 0.10066752073359854, 12: 0.021070730227226754, 4: 0.034320005394106935, 2: 0.052491403142067294, 9: 0.22739532061223114, 10: 0.049052659968983885, 13: 0.00300047198435709, 11: 0.06375160137549726, 14: 0.025622007956307733, 7: 0.035027981929741756, 5: 0.03371316836356281, 0: 0.0051581147596251094, 16: 0.05512103027442519, 17: 0.004011867035263974})), (3, defaultdict(<class 'float'>, {8: 0.540289126335638, 1: 0.061345065996228784, 17: 0.0010999371464487744, 11: 0.00785669390320553, 3: 0.054682589566310495, 7: 0.03909490886235072, 2: 0.07272155876807039, 5: 0.13060967944688875, 0: 0.027341294783155248, 6: 0.004116907605279698, 4: 0.01891891891891892, 14: 0.007982401005656819, 15: 0.01272784412319296, 10: 0.003959773727215588, 9: 0.008610936517913262, 12: 0.003802639849151477, 13: 0.0008485229415461973, 16: 0.0005971087366436204, '<STOP>': 0.003

# VITERBI ALGORITHM

This algorithm determines what PoS tags a sentence contains using the HMM that we have trained. 
The parameter that we use here is:
- `sentence`: the sentence we have to predict the PoS tags of


First, we take a look at the sentence and we search each word in the vocabulary. If there is no such word in the vocabulary we replace it with the <UNK> token.
Next, we apply the algorithm. To do that, we define two variables: the viterbi matrix, where we will store the probabilities; and the backpointer, where we store the most probable path. At this point we calculate the probabilty of the word for each tag with the following function: $$viterbi[q,t]=\max viterbi[q',t-1]*A_{[q',q]}*B_{[q,t]}$$

Where $A$ is the transition matrix and $B$ the mission matrix.

The $q'$ that gets the maximum pobabilty is stored in the backpointer. This way the viterbi function gives back the  initial sentence and the PoS tags predicted by the algorithm.

In [7]:
sentence = ['Jeremy','loves','NLP']
hmm.viterbi_alg(sentence)

(['<UNK>', 'loves', '<UNK>'], ['PROPN', 'VERB', 'NOUN'])

# EVALUATE FUNCTION

This function evaluates the test dataset with Viterbi and, with whatever Viterbi returns, we calculate accuracy and F1.

In [11]:
test = [['Jeremy', 'Loves', 'NLP'],
        ['How', 'do', 'people', 'live', 'in', 'houses'],
        ['Those','in', 'power', 'have', 'little', 'interest', 'in', 'education']
]
tags = [['PROPN', 'VERB', 'NOUN'],
        ['ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'NOUN'],
        ['PRON', 'ADP', 'NOUN', 'VERB', 'ADJ', 'NOUN', 'ADP', 'NOUN']]
        
acc, pred_tags = hmm.evaluate(test, tags)
print(f'Accuracy: %{acc*100}')
print(f'Sentences given: {test}')
print(f'Pred_tags: {pred_tags}')

Accuracy: %94.11764705882352
Sentences given: [['Jeremy', 'Loves', 'NLP'], ['How', 'do', 'people', 'live', 'in', 'houses'], ['Those', 'in', 'power', 'have', 'little', 'interest', 'in', 'education']]
Pred_tags: [['PROPN', 'VERB', 'NOUN'], ['ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'NOUN'], ['PRON', 'ADP', 'NOUN', 'AUX', 'ADJ', 'NOUN', 'ADP', 'NOUN']]
