# Hidden Markov Model for Part-of-Speech tagging

Part-of-Speech (POS) tagging is a process of labeling words in a text with their respective part-of-speech or word class, e.g., in English language, word classes/lexical categories are such as *nouns*, *verbs*, *adjectives* and *adverbs*. It is an essential step performed in the early part of a NLP pipeline. A well-known purpose of POS tagging is word disambiguition. A word can have different meaning depending its definition and its relationship with other words surrounding it in a sentence. For instance, in the following two sentences, the word "*book*" are *noun* and *verb*, respectively. 

"*They like the new book.*" 
<br>
"*They book a table for dinner.*" 

POS tagging also provides features for parsing, coreference resolution and relation extraction. Example of applications are text-to-speech system, translation and etc. 

In this notebook, we will use the the MASC tagged corpus to train a Hidden Markov Model Tagger. 


In [1]:
## Initialization
#
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import masc_tagged
#nltk.download("masc_tagged")


In [3]:
## Use MASC tag labeled corpus for supervised training
##


s_labeled = masc_tagged.tagged_sents()  # List of sentences in word-tag pairs as training data

## Split data into 80%-20% as training and test sets
#
k = round(0.8*len(s_labeled))

s_train = [sen for i, sen in enumerate(s_labeled) if i < k]
s_test = [sen for i, sen in enumerate(s_labeled) if i >= k]

# print('Original : ', len(s_labeled), '\n',
#       'Training : ', len(s_train), '\n',
#       'Testing : ', len(s_test))


In [None]:
## Use Radio planet text as unlabeled corpus for unsupervised learning  
#

# with open(r"C:/Users/meiye/radio_planet_tokens.txt", encoding='utf8') as fobj:
#     L = fobj.readlines()
    
# s_unlabeled = [word_tokenize(line) for line in L]
# s_unlabeled = [[(word, None) for word in sent] for sent in s_unlabeled]


# train_unlabeled = [sen for i, sen in enumerate(s_unlabeled) if i < 900]
# test_unlabeled = [sen for i, sen in enumerate(s_unlabeled) if i >= 900]

# trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)

In [5]:
## Maximum Likelihood Estimation vs. Lidstone smoothing
#
from nltk.util import unique_list
from nltk.tag import HiddenMarkovModelTrainer
from nltk.probability import *

symbols = unique_list(word for sent in s_train for word, tag in sent)
tag_set = unique_list(tag for sent in s_train for word, tag in sent)
# print('sym_len : ', len(symbols), '\n', 'tagset_len : ', len(tag_set))


## Extend symbols with those in the unlabelled set (for semisupervised learning with additional unlabeled text)
# symbols = unique_list(symbols + unique_list(word for sent in train_unlabeled for word, tag in sent))
# print('ext_sym_len : ', len(symbols))

trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)

hmm_mle = trainer.train_supervised(s_train, estimator=lambda fd, bins: MLEProbDist(fd) )
hmm_lid00 = trainer.train_supervised(s_train, estimator=lambda fd, bins: LidstoneProbDist(fd, 0, bins))
hmm_lid01 = trainer.train_supervised(s_train, estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
hmm_lid05 = trainer.train_supervised(s_train, estimator=lambda fd, bins: LidstoneProbDist(fd, 0.5, bins))
hmm_lid08 = trainer.train_supervised(s_train, estimator=lambda fd, bins: LidstoneProbDist(fd, 0.8, bins))

In [6]:
print('MLE : ', hmm_mle.evaluate(s_test))
print('Lidstone gamma=0 : ', hmm_lid00.evaluate(s_test))
print('Lidstone gamma=0.1 : ', hmm_lid01.evaluate(s_test))
print('Lidstone gamma=0.5 : ', hmm_lid05.evaluate(s_test))
print('Lidstone gamma=0.8 : ', hmm_lid08.evaluate(s_test))

MLE :  0.48726554562370855
Lidstone gamma=0 :  0.48726554562370855
Lidstone gamma=0.1 :  0.8488076075664692
Lidstone gamma=0.5 :  0.8339669166752286
Lidstone gamma=0.8 :  0.8240464856102377
