In [1]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to /home/jan/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

# HMM in NLTK

Hidden Markov Models (HMMs) largely used to assign the correct label sequence
to sequential data or assess the probability of a given label and data
sequence. These models are finite state machines characterised by a number of
states, transitions between these states, and output symbols emitted while in
each state. The HMM is an extension to the Markov chain, where each state
corresponds deterministically to a given event. In the HMM the observation is
a probabilistic function of the state. HMMs share the Markov chain's
assumption, being that the probability of transition from one state to another
only depends on the current state - i.e. the series of states that led to the
current state are not used. They are also time invariant.

The HMM is a directed graph, with probability weighted edges (representing the
probability of a transition between the source and sink states) where each
vertex emits an output symbol when entered. The symbol (or observation) is
non-deterministically generated. For this reason, knowing that a sequence of
output observations was generated by a given HMM does not mean that the
corresponding sequence of states (and what the current state is) is known.
This is the 'hidden' in the hidden markov model.

Formally, a HMM can be characterised by:

- the output observation alphabet. This is the set of symbols which may be
  observed as output of the system.
- the set of states.
- the transition probabilities *a_{ij} = P(s_t = j | s_{t-1} = i)*. These
  represent the probability of transition to each state from a given state.
- the output probability matrix *b_i(k) = P(X_t = o_k | s_t = i)*. These
  represent the probability of observing each symbol in a given state.
- the initial state distribution. This gives the probability of starting
  in each state.


In [2]:
from nltk.tag.hmm import HiddenMarkovModelTrainer
from nltk.corpus import treebank

train_data = treebank.tagged_sents()[:30]
test_data = treebank.tagged_sents()[3000:]

trainer = HiddenMarkovModelTrainer()
HMM = trainer.train_supervised(train_data)

'accuracy:' + str(round(HMM.evaluate(test_data), 3))

'accuracy:0.106'

In [3]:
HMM.tag(['the', 'men', 'attended', 'to', 'the', 'meetings'])

[('the', 'DT'),
 ('men', 'NNP'),
 ('attended', 'NNP'),
 ('to', 'NNP'),
 ('the', 'NNP'),
 ('meetings', 'NNP')]