# PoS tagging using HMMs and the Viterbi algorithm

This notebook serves as a showcase of the PoS tagger implemented using Viterbi's algorithm. 

In [1]:
from HMM import HMM

import numpy as np
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import seaborn as sns
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

## First predictions in Basque and Spanish

Let's start by making our first predictions with examples in Basque and Spanish, the two languages that were chosen for this exercise. For both cases, the same procedure is followed. We first declare an object of the HMM class with a given name (we use this variable to distinguish languages). Then, we load the .conllu file with the training data, and train the model in the corresponding language. Afterwards, we can make predictions. Each prediction returns two values, the obtained tags after applying Viterbi, and the logarithmic probability of this prediction.

In [2]:
basque_hmm = HMM("EUS")

print("Training the model: ", basque_hmm.name)
basque_hmm.train("UD_Basque-BDT/eu_bdt-ud-train.conllu")

basque_sentence = "Nire etxea oso handia da"
print("Tagging the sentence: ", basque_sentence)
basque_tags, basque_log_prob = basque_hmm.pos_tagging(basque_sentence)
print("POS: ", basque_tags)
print("Log probability: ", basque_log_prob)

Training the model:  EUS
Tagging the sentence:  Nire etxea oso handia da
POS:  [('nire', 'PRON'), ('etxea', 'NOUN'), ('oso', 'ADV'), ('handia', 'ADJ'), ('da', 'AUX')]
Log probability:  -1042.0


  self.A[i] = np.full((len(self.tags)), float("-inf"))


In [3]:
spanish_hmm = HMM("ESP")

print("Training the model: ", spanish_hmm.name)
spanish_hmm.train("./UD_Spanish-AnCora/es_ancora-ud-train.conllu")

spanish_sentence = "El gato Juan vive aqui"
print("Tagging the sentence: ", spanish_sentence)
spanish_tags, spanish_log_prob, = spanish_hmm.pos_tagging(spanish_sentence)
print("POS: ", spanish_tags)
print("Log probability: ", spanish_log_prob)

Training the model:  ESP
Tagging the sentence:  El gato Juan vive aqui
POS:  [('el', 'DET'), ('gato', 'NOUN'), ('juan', 'PROPN'), ('vive', 'VERB'), ('aqui', 'NOUN')]
Log probability:  -31.0


## Testing the Viterbi algorithm

To train the HMMs, we are using a train split of the Universal Dependencies dataset. A test split can also be used to see how accurate are our models. We have implemented a test function to output several interesting metrics.

In [5]:
print("Testing the model: ", spanish_hmm.name)
test_scores = spanish_hmm.test("./UD_Spanish-AnCora/es_ancora-ud-dev.conllu")

for metric, value in test_scores.items():
    print(f"{metric}: {value}")

print("---------------------")

print("Testing the model: ", basque_hmm.name)
test_scores = spanish_hmm.test("./UD_Basque-BDT/eu_bdt-ud-dev.conllu")

for metric, value in test_scores.items():
    print(f"{metric}: {value}")

Testing the model:  ESP


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


Accuracy: 0.40991535671100365
Recall: 0.9177177622792699
Micro-averaged F1 score: 0.9449558806677109
Macro-averaged F1 score: 0.6338176712145135
---------------------
Testing the model:  EUS
Accuracy: 0.0011123470522803114
Recall: 0.3368869936034115
Micro-averaged F1 score: 0.4622457509055447
Macro-averaged F1 score: 0.1516984897255314


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


## Interpreting Viterbi algorithm graphically

The Viterbi algorithm is a dynamic programming algorithm used to find the most likely state sequence of a HMM given a sequence of observation. It works by computing the probability of each state at each time step and then using these probabilities to compute the probability of the entire state sequence. The algorithm maintains a matrix of probabilities, where each row represents a state at a particular time step and each column represents a state at the previous time step. We can graphically see this matrix, and understand the decision process of the HMM.

In [9]:
basque_hmm = HMM("EUS")

print("Training the model: ", basque_hmm.name)
basque_hmm.train("UD_Basque-BDT/eu_bdt-ud-train.conllu")

basque_sentence = "Nire etxea oso handia da"
viterbi = basque_hmm.pos_get_viterbi(basque_sentence)
max_value = np.max(viterbi[:, -1])

df = pd.DataFrame(viterbi, columns=spanish_sentence.split(), index=spanish_hmm.tags)
sns.heatmap(df, annot=True, cmap="YlGnBu", vmin=max_value*1.5, vmax=0)

Training the model:  EUS
[[-1026. -1034. -1033.   -17. -1043.]
 [-1027. -1034. -1035. -1041. -1044.]
 [-1027. -1034.   -14. -1040. -1044.]
 [-1025. -1036. -1036. -1041.   -20.]
 [-1026. -1035. -1035. -1040. -1043.]
 [-1027. -1034. -1035. -1041. -1044.]
 [-1033. -2052. -1043. -1047. -1050.]
 [-1023.    -9. -1033. -1038. -1040.]
 [-2044. -2052. -2053. -2058. -2061.]
 [-1028. -1034. -1037. -1041. -1045.]
 [   -8. -1036. -1039. -1043. -1047.]
 [-1027. -1037. -1035. -1039. -1044.]
 [-2044. -2052. -2053. -2058. -2061.]
 [-2044. -2052. -2053. -2058. -2061.]
 [-2044. -2052. -1044. -2058. -2061.]
 [-1023. -1032. -1032. -1037.   -21.]
 [-1025. -1034. -1034. -1039. -1042.]]


In [10]:
spanish_hmm = HMM("ESP")

print("Training the model: ", spanish_hmm.name)
spanish_hmm.train("./UD_Spanish-AnCora/es_ancora-ud-train.conllu")

spanish_sentence = "El gato Juan vive aqui"
viterbi = spanish_hmm.pos_get_viterbi(spanish_sentence)
max_value = np.max(viterbi[:, -1])

df = pd.DataFrame(viterbi, columns=spanish_sentence.split(), index=spanish_hmm.tags)
sns.heatmap(df, annot=True, cmap="YlGnBu", vmin=max_value*1.5, vmax=0)

[[-1026. -1029. -1028. -1036.   -19.]
 [-1023. -1031. -1027. -1033.   -25.]
 [-1027. -1033. -1030. -1036.   -21.]
 [-1027. -1036. -1031. -1035.   -27.]
 [-1025. -1035. -1029. -1034.   -30.]
 [   -4. -1031. -1030. -1035.   -23.]
 [-1034. -2048. -1038. -1043.   -35.]
 [-1024.    -4. -1031. -1036.   -16.]
 [-2044. -2048. -2048. -2053. -2056.]
 [-1033. -1040. -1038. -1042. -1045.]
 [-1026. -1031. -1029. -1035.   -24.]
 [-1026.    -9.    -9. -1032.   -19.]
 [-2044. -2048. -2048. -2053. -2056.]
 [-1027. -1038. -1031. -1036.   -30.]
 [-1034. -1033. -1039. -1041.   -34.]
 [-1026.   -13. -1030.   -12.   -17.]
 [-1025. -1036. -1029. -1034.   -31.]]


  self.A[i] = np.full((len(self.tags)), float("-inf"))
