# Hidden Markov Model for Phonotactics

This spring, I spent three weeks building a Hidden Markov Model for a class in computational linguistics, and it's been a fun thing to think about from several perspectives. I set aside some time to write up a full explanation that should hopefully cross-pollinate the reader's insight into both machine learning and formal linguistics.

## Background

In broad strokes, one of the main shifts in linguistics since Chomsky is a movement toward computation and statistical methods. 

The field of linguistics, as evidenced by the proceedings of most of my interactions after I announce that I hold a degree in it, is not particularly well-understood, even by linguists. One way that I explain this is that the field itself is at a strange junction of disciplines: much as I might protest the classification, linguistics is not considered part of STEM. It's "further out" than even economics, because it doesn't have a singular paradigm to point to. And I think this is what makes the field so interesting: as an almost-humanistic science, linguistics gets to play the game of trying to figure out what mold it's going to force the human being into in order to create an object for its studies.

The foremost task of the linguist is, or ought to be, to determine whether or not we're even asking good questions about language in the first place. Language is the house of being, after all: it's always a little bit large than we are, and we can't even try to understand it without using it. And this leads to one of the crucial political problems within the field right now: big data. The statistician, the data scientist, and the software engineer come from a totally different lineage than the formal linguist. They look for patterns, build products, and invest the economic profits into R&D. Coding interviews are explicitly about breaking down problems into known methods and engineering one's way through the unknown. And I love it, don't get me wrong - machine learning and information processing more generally are absolutely the future of scientific praxis. These methods have and will continue to produce great experiments. But this is the tension inherent in doing computational linguistics: we're trying to get things conceptually right, not only statistically likely. Models have to be selected for qualitative reasons before they can be tested. After all, testing can only tell you if you found what you were looking for.

The task here, as articulated by a number of my professors, is to rebrand linguistics as the "data science of language." This is somewhat appealing because of the uniue position language takes in our own being. We can't have a human experience that isn't mediated through language at least implicitly. So, while it's entirely appropriate to attack text with any statistical model we have, I maintain that it's philosophically important to understand how we think the text got there.

In the case of looking at English prose, we think the text came from the mental processes of an English speaker. Now, in the Chomskyian tradition, we would want to look at the "average competent speaker in a neutral setting" - the linguistic version of the physics problem involving a spherical chicken in a vacuum - in order to build a "descriptively adequate grammar," that is, a minimum-length description of linguistic competence. (For the computer scientist, this means presenting an algorithm to bound the Kolmogorov complexity of a given phenomenon, which I've done in my thesis). But Chomsky, back in 1968, intended a live linguistic being as an informant. The shift in praxis toward computation allows us to try to make inferences from a corpus of the speaker's language usage rather than from interpersonal inquiry.

## Our Model of Language

One of the most important assumptions in the history of phonology is the phoneme: a symbolic, semantically meaningless, and discrete unit that concatenates to form words and enjoys a psychological reality prior to and distinct from the sounds made to render it. Unsurprisingly, the psychic reality of the entity has been a hotly debated topic in the history of linguistics - I'm personally a strong realist after reading Edward Sapir's essay on the topic - and the insight of the symbolic and discrete minimal unit is nonetheless vastly influential. Notably, Claude Levi-Strauss and Jacques Lacan used it as the conceptual keystone of their work in structural anthropology and psychoanalysis. Today it allows us to exploit a homology between speech and gene transcription: both involve the production of a complex organic phenomenon from information stored in a string of symbols. 

For the sake of philsophical correctness, then, it's important to highlight this: when we look at English text, we believe that the symbols on the page are actually renditions of strings of psychically real symbols arranged according to psychologically real rules of syntax and morphology. Those fields of linguistics, sadly, are beyond my own expertise. 

Here's an overview of the model to which I subscribe:

![diagram](https://i.imgur.com/ZdGYWlF.png)

The three red boxes are collections of strings of phonemes. Importantly, so far as we know, these are psychologically real units. We can do empirical research on, say, recall time of various words, and the contents of the lexicon present themselves robustly in experimental psychology. The box labeled "text" is importantly a collection of symbols in a script, not phonemes (hence the different color). There is a not-necessarily-bijective correspondence between the two. The atrocity that is English spelling conventions should testify to this fact. The circle labeled "speech" should be self-explanatory. Note that, on this view, text-to-speech or speech-to-text technology will be tasked with inverting one of the mappings "orthography" or "phonetics," even if this is done implicitly. Note also that the difference between our computational approach and the "traditional" approach is here merely reduced to attacking a different node in a directed graph, but that this belies substantial practical differences.

(A final, somewhat pedantic clarification: the mappings on the box "sentences" are imprecise. Strings of morphemes have to go through a process of phonological alternation before a final string of phonemes is produced, but it's unsettled as to exactly how the concatenation of morphemes interfaces with the rewriting of the full phrase. It's also no necessarily the case that morphemes are made of the same phonemes as appear after phonological rewrite, and some models might involve an alphabet of phonemes in addition to an alphabet of so-called "archiphonemes" that comprise a different representation for lexical morphemes.)

I'll reiterate that this isn't an engineering project here; it's a low-resolution model of a paticular phenomenon. We know merely from introspection that sentences are made of words and that the literal meanings of sentences we speak don't vary based on, for instance, our auditory production of them or on trivial typos. Now, certainly this simplifies away language acquisition and pragmatics and world-knowledge and all that jazz, but so does most contemporary AI.

So what we intend to investigate is this orthography mapping. In particular, we believe in what a phonologist calls an "surface form" - a string of phonemes that gets rendered into either text or speech (or signing, whistling, etc.) - and we can use this postulate to influence our choice of model class. Chomsky himself endorsed using a loss function (he used the term "evaluation metric" in "Aspects of the Theory of Syntax") to select an optimal model, and so the task of phoneme inference is indeed, if adapted from foundational postulates of linguistics, properly a machine learning task.


## Frequency, Acceptability, and Grammaticality

Again, the adaptation is substantial. For Chomsky, a grammar's evaluation metric is description length and the method of achieving it is retrieving grammaticality judgments on various reserahcer-curated data from as many live informants as necessary. This naturally results in a workflow where many linguists will tell you that they "live in the exceptions." For instance, a challenge in English phonology is what's called the rule of trisyllabic shortening: when have the word <serene\> and the word <serenity\>, we see that the second vowel shortens and remains stressed. But there are exceptions to this, like <nightingale\>. It gets ore complicated, and becomes a motivating issue for a research program called Lexical Phonology. (More data [here](http://people.cs.uchicago.edu/~jagoldsm/slides/2016-phono-lexical.pdf).) The upshot is that a few choice examples and thousands of hours of introspection are what motivate traditional formal linguistic praxis.

In contrast, the computational linguist lives in the corpus. From the text one has to infer a grammar that could have produced it, and without getting any new information from probing exceptions. Instead, we're reduced to analyzing frequencies of given features simply because it's the only thing we can do. Of course, even this is deceptive, because a computer does not _a priori_ know what features are. Give a machine an English corpus and it only "sees" binary - inferring even that the corpus is "in English" is already a challenging task. Thus, I find it important to remember that even by stipulating that we're looking for language features, we're already importing substantial priors to the task.

But we really have to remember that frequency and grammaticality are not the same thing. For instance, Sanskrit has ten vowels (five qualities, and long and short are distinct) but in the corpus a full eighty percent of vowel phoneme occurrences are short /a/. The distribution of vowels in Sanskrit is a very hady fingerprint for the corpus, as it turns out, but it can't really tell us in itself what's grammatical. One is certainly justified in claiming that things that aren't grammatical should occur with zero frequency, but even this might not be the case. Suppose we're counting occurence of words in an English corpus. It's entirely possible that typos like <teh\>, or possibly loanwords with strange spelling, are going to occur actually with greater frequency than grammatical words that occur uniquely. And, of course, the computer does not know about such things. 

But okay, supposing the corpus is totally clean phonemic transcription from before 2015, compare the word <meh\> with the word <squanch\>. The former is actually a phonological exception in English - it ends in a lax vowel unlike any other word in the language - and the latter does not yet have widely-accepted meaning or usage but is totally valid on a strictly phonological level. Here's "squanch" on Google Trends:

![squanch](https://i.imgur.com/c1VxJou.png)


Indeed, it will have zero frequency in almost any corpus from before 2015 (apparently except for some minor coinages, which should attest to its validity) but should ideally be counted as more valid in the phonological grammar than "meh," which has high frequency in many corpora. Here's Google Trends comparing the two words ("meh" is blue): 

![meh_squanch](https://i.imgur.com/gg30QzN.png)

This should actually be a familiar concept from machine learning and I think is an interesting perspective on language: we're trying to learn a grammar, which is a classifier function $G : \Sigma^* \rightarrow \{ 0, 1 \}$, where $\Sigma$ is the phonemic inventory. Now, obviously phonemic inventories vary across languages and it's still unclear how to get a universal theory of phonemes, but regardless, presuming that every language works on a subset of some universal phonemic alphabet, the task of grammar inference is spelled out for us. Every language has both a grammar and a corpus, and the corpus constitutes noisy training data for the grammar. 

Now, Chomsky has long espoused a principle called "Paucity of Evidence" where children can learn from this noisy corpus of natural language astonishingly quickly. That is, in oversimplified ML terms, the learning rate is much higher than one might expect from pure unsupervised learning. Human beings must be acquiring language through a model-based approach, and the task of the linguist is to speculate about this model. (It's referred to as Universal Grammar, or UG.) Therefore the task of the computational linguist in particular, is, to come full circle, one of model selection.

The takeaway here is the usual parable that language is always a little larger than we are. I've demonstrated that, in order to avoid serious errors, frequency analysis must be supplemented by priors that come essentially from our own linguistic _a priori_. The idea of a distinction between competence (grammar) and performance (corpus) goes all the way back to Chomsky's early publications, and it can be helpfully rephrased in terms of machine learning. Now, is our task here or in general actually the construction of a phonological grammar for English? Not necessarily. But it's important nonetheless to respect in any language-related task that there's deep structure to the data and that we can directly glimpse this structure through our own intuitions as speakers. Moreover, the corpus and the grammar have a very complex interdependency and it's difficult but valuable to try to keep the full complexity in view. This is my personal excitement for computational linguistics, and I hope I can share it with the world.

## Modeling Phonology

Now, there's one other key ingredient: we know that phonology is regular. That is, rules for sound production can only be local. To see this, read any descriptive grammar of sny natural language. You'll more likely than not find a bunch of what linguists call "rewrite rules" - things like "/s/ becomes /z/ after a voiced consonant, e.g. <cats\> versus <dogs\>" - which are just regular expressions in disguise. That linguistics should be so intimately tied with formal language theory is not particularly surprising in light of the fact that regular languages exist on the _Chomsky_ hierarchy.

The regularity of phonology allowed the rise of finite-state transducer models for computational linguistics, which is a wonderful thing to geek out about. The Xerox Research Center in Palo Alto produced, in the eighties, a full-fledged program for building and evaluating rewrite-rule regular grammars of various languages. Though the technology never made it to market and never even changed research praxis in phonology, it got me excited enough to build a similar model in my thesis.

Now, what's important about regularity is that it means we're not unjustified in choosing a Markovian process to perform our frequency analysis. When we implement the HMM, it's only going to look at bigram frequencies, which won't even be enough to infer a regular grammar, but at the very least we know that any local properties it misses will actully be local rather than arbitrarily high up a syntax tree. If you subscribe to an autosegmental theory of phonology, as I do, locality might exist on multiple strata - for instance, tonality in East African languages has segments that extend well beyond individual phonemes. This is no issue: just run a model on each stratum.

In sum, we are justified in trying to apply local-only models to phonology. To wit, we can bring in the HMM.

## Hidden Markov Model

The Hidden Markov Model is a Gaussian mixture model that operates on categorical data. (In this case, our categorical data are the observed characters in the corpus.) An HMM supposes that each observed symbol is actually some other, "hidden" symbol. Given a string of observed symbols, then, there should exist a "hidden" string that the observations are actually a manifestation of. It associates to each hidden state a probability distribution predicting the following hidden state and a probability distribution predicting the emission of an output symbol. Through this, the model makes two independence assumptions (referred to as the _Markov property_ of 


The HMM is associated with three main tasks:

* Given an observed string, estimate its probability of occurrence; 
* Given an observed string, compute the most likely hidden string that would produce it;
* Given a corpus of observed strings, unsupervisedly learn the probability of assignment between hidden and observed states. 

We choose to use this model for phonology because we know each character is underlyingly a phoneme, and we can use the unsupervised learning feature of the model to put real quantitative strength to grammatical predictions about the corpus. For instance, at the most basic, we know that English has two categories of letters - consonants and vowels - and this is a pretty easy thing to determine from introspection. A Hidden Markov Model told only that there are two categories of symbol, if it can determine this distribution from the tendency of English words not to cluster vowels but to sometimes cluster consonants, will put some numbers to this intuition. 

So we'll choose separating consonants from vowels as our starting task. To investigate the three problems, let's define exactly what the HMM is. Imagine that the interface between "sentences" and "text" in the above diagram is a bit collapsed and viewed from the following perspective:

![diagram2](https://i.imgur.com/2HQ1wB7.png)

Then the HMM is a hypothesis about the box of hidden states to which we will ultimately assign a probability based on how well it predicts the corpus. Now suppose we look at an individual string in the text box and the corresponding underlying string in the hidden states box (graphic taken from user Hakeem.gadi on Wikimedia Commons):

![HMMsequence](https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/HMMsequence.svg/579px-HMMsequence.svg.png)

The model directly contains, in its distributions, the probability of each hidden state succeeding each other hidden state and the probability of each hidden state emitting each observed state. We can directly use these distributions to solve the first of the three problems: if $w$ is our observed word with length $k$, we can take each possible $k$-long string of hidden states and figure out its probability of emitting $w$ and simply sum over the whole set. This will be done in a later section. Additionally, we can see how using Bayes's rule to take the string $w$ as an observation immediately allows us to compute the most likely underlying string that could hae produced it, given the distributions in the model, and well also do this.

But notice that neither of those two problems will tell us much about consonats and vowels until we've actually trained the model. This is a little more difficult to explain, and it will involve solving the first problem iteratively. For what I take as useful visual intuition, consider an analogous unsupervised learning task: k means clustering. In this task, we're given data in Euclidean space (say $\mathbb{R}^2$) and we decide we want $k$ clusters from it (in the below example, say $k = 3$). The algorithm, somewhat counterintuitively, _randomly assigns_ initial values for the mean points of each cluster and then computes the [Voronoi diagram](https://en.wikipedia.org/wiki/Voronoi_diagram) of the space, i.e. determines for each mean point $m$ which of the points $x$ is closest to $m$ among all mean points. From these newly-estimated clusters, new means are computed and the process repeats iteratively. Here we see an animation of the process (taken from user Chire on Wikimedia Commons):

![K-means_convergence](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)

What happens, as we notice, is that the algorithm will eventually terminate. However, it's uncertain if it always terminates in the same place, but this is simply a risk we run for using noisy data: sometimes models get stuck in local extrema. On the other hand, it manages to pick out clusters pretty well despite not being told anything other than how many there are going to be.

The HMM will train itself somewhat similarly. Instead of three clusters, we'll initially look for two. And we won't be trying to actually cluster; instead, to assign the most suitable probability distributions of emission from hidden states to surface states. This process will also be iterative: instead of computing means, we'll randomly initialize the distributions and perform the analogy to re-clustering, using Bayes's rule to compute new distributions that correspond to what's observed in the corpus, then setting those as our new distributions. This gets very technical, but the iterative movement is analogous to the clustering depicted above.

Now, we're actually going to have to solve the first problem - computing word probability - before we can train the model. But before that, we need a starting distribution. And again, somewhat counter-intuitively, _any starting guess is as good as any other_. Here's the naive implementation of this:

In [1]:
import random

def distribute_random(distribution, from_states, to_states):
	for from_state in from_states:
		magnitude = 0.0
		for to_state in to_states:
			distribution[(from_state, to_state)] = random.random()
			magnitude += distribution[(from_state, to_state)]
		for to_state in to_states:
			distribution[(from_state, to_state)] /= magnitude
		

def distribute_pi(pi, from_states):
	magnitude = 0.0
	for from_state in from_states:
		pi[from_state] = random.random()
		magnitude += pi[from_state]
	for from_state in from_states:
		pi[from_state] /= magnitude

Now here's my Python implementation of the constructor for our model, keeping track of the longest word in the corpus as well as the corpus size, for convenience, and inferring $\Delta$ from the collection of all characters in the corpus:

In [2]:
verbose_flag = True
print_flag = True

class HMM():

	def __init__(self, _hidden_states, filename):
		self.A = dict()
		self.B = dict()
		self.Pi = dict()
		self.softcount = dict()
		self.hidden_states = _hidden_states
		self.corpus_name = filename
		self.word_count = 1
		self.longest = 0

		# Determine the output states from the corpus
		chars = set(['#'])
		corpus_file = open(self.corpus_name, "r")
		line = corpus_file.readline().strip('\n').lower()
		while line:
			chars |= set(line)
			self.word_count += 1
			self.longest = max(len(line), self.longest)
			line = corpus_file.readline().strip('\n').lower()
		self.output_states = sorted(list(chars))

		distribute_random(self.A, self.hidden_states, self.hidden_states)
		distribute_random(self.B, self.hidden_states, self.output_states)
		distribute_pi(self.Pi, self.hidden_states)

        # Print out the discrete distributions
		if verbose_flag:
			print "========================"
			print "==   Initialization   =="
			print "========================"

		for from_state in self.hidden_states:
			if verbose_flag:
				print "\n--\n\nCreating state " + from_state.__str__() + "\n\n"
				print "Transitions:\n"
			transition_sum = 0.0
			for to_state in self.hidden_states:
				if verbose_flag:
					print "    To state     " + to_state.__str__() + ":    " + self.A[(from_state, to_state)].__str__()
				transition_sum += self.A[(from_state, to_state)]
			if verbose_flag:
				print "\nTotal: " + transition_sum.__str__()
				print "\n\nEmissions:\n"
			emission_sum = 0.0
			for to_state in self.output_states:
				if verbose_flag:
					print "    To letter    " + to_state.__str__() + ":    " + self.B[(from_state, to_state)].__str__()
				emission_sum += self.B[(from_state, to_state)]
			if verbose_flag:
				print "\nTotal: " + emission_sum.__str__()

		if verbose_flag:
			print "\n--\n"
			print " Starting distribution:\n"
		start_sum = 0.0
		for state in self.hidden_states:
			if verbose_flag:
				print "    For state    " + state.__str__() + ":    " + self.Pi[state].__str__()
			start_sum += self.Pi[state]
		if verbose_flag:
			print "\nTotal: " + start_sum.__str__()

We can run this on the 1000-word English corpus I've included. The following code will print out the randomly-initialized distributions decided after inferring the output symbol alphabet:

In [3]:
states = set([0, 1])
my_hmm = HMM(states, "english1000.txt")

==   Initialization   ==

--

Creating state 0


Transitions:

    To state     0:    0.313775963271
    To state     1:    0.686224036729

Total: 1.0


Emissions:

    To letter    #:    0.0138307993741
    To letter    ':    0.0223516349764
    To letter    .:    0.00547429141631
    To letter    a:    0.0211041496365
    To letter    b:    0.00306361266567
    To letter    c:    0.0240465102024
    To letter    d:    0.0510861008098
    To letter    e:    0.0764764480147
    To letter    f:    0.0673579146557
    To letter    g:    0.0463518684155
    To letter    h:    0.0480514242245
    To letter    i:    0.0737159747059
    To letter    j:    0.0397397093505
    To letter    k:    0.00584211021557
    To letter    l:    0.0424010904947
    To letter    m:    0.0398494958431
    To letter    n:    0.007482193855
    To letter    o:    0.00905501905756
    To letter    p:    0.0382000699608
    To letter    q:    0.0749026125024
    To letter    r:    0.00248408563261
    To lette

## Estimating Word Probability: Forward Method

Now, we can use the random distribution to start estimating the probabilities of hypothesized word. It'd be nice to unravel the probability of a word in terms of the conditional probabilities of its characters, i.e. $P(w) = P(w_1)P(w_2\mid w_1)...P(w_k\mid w_1...w_{k-1})$ but we don't actually have these probabilities. We only know the probabilities of transition and emission. 

Instead, what we can do is observe that for a $k$-long string $w$, we've actually assigned distributions such that every hidden state $q$ has some probability of emitting each character $w_t$. Thus _any_ underlying string of $k$ states could emit $w$, and because there are only finitely many such strings, all we need do is take the following sum:

$$\hat{P}_{\theta}(w) = \sum_{S \in \Sigma^k} \hat{P}_{\theta}(S) \hat{P}_{\theta}(w \mid S)$$

That is, the probability of $w$ is simply the sum of probabilities where each $S$ occurs and $S$ also emits $w$. So we have to compute the probability of each hidden string and then the probability of emission. Recall that the model explicitly gives us the emission probabilities - that much is simple. The probability given $S$ of seeing $w$ is simply

$$\hat{P}_{\theta}(w \mid S) = \prod_{t \leq k} B(q_t, w_t)$$

Now, to compute the probability of a hidden string $S$ is possible because we have implicitly defined the hidden states to have the Markov property: by keeping track only of bigram frequencies, our prediction is that the probability of seeing a state $q_1$ given previous state $q_0$ is invariant given any more information. That is, for any $S = q_0...q_k$,

$$\hat{P}_{\theta}(q_k \mid q_{k-1}) = \hat{P}_{\theta}(q_k \mid q_{k-1}...q_0)$$

This is emphatically not the case for the words. Take a moment to think about this: the corpus itself has no guarantee of being Markovian,. Instead, we cleverly decided to model it as the nondeterministic rendition of a hidden sequence that we take _a priori_ to be Markovian. This allows us to do whatever math we want behind the scenes and to simply state at the end how probable our prediction was. 

But this goes deeper that the mere disorder of the corpus. The real reason that we couldn't compute the probability above is that the corpus is an observation, not a model, and probability is a property assigned by a researcher through a model. So we're not actually going to determine the probabilities of observed words, but rather we're going to "estimate" them. The corpus does not have probability, properly speaking: it has frequency. Think of any issue in natral science, like the often-deemed-improbable fact of life at all. It's meaningless to talk about the probability of an actual observation like "life exists," but it is interesting to ask "how would we build a model that assigns a meaningful probability to this fact?" 

This said, we observe that the probability of $S$ is indeed what we'd expect (recalling that $\pi$ is the initial-state distribution):

$$\hat{P}_{\theta}(S) = \hat{P}_{\theta}(q_0)\hat{P}_{\theta}(q_1 \mid q_0)...\hat{P}_{\theta}(q_k\mid q_{k-1}) = \pi(q_0) \prod_{t = 1}^k \hat{P}_{\theta}(q_t \mid q_{t-1})$$

And therefore, taking the full expression yields a formula we can actually compute and can even understand just from looking at it.

$$\hat{P}_{\theta}(w) = \sum_{S \in \Sigma^k} \left( \pi(q_0) \prod_{t = 1}^{k} A(q_t, q_{t+1})B(q_t, w_t) \right)$$

We call this the "forward method" for computation because it moves forward in phonological "time," i.e. along the word. We're going to want to reorganize this a bit, though, to make things easier down the road. 

Consider a new function $\alpha_w(q, t)$. If we imagine, for a given word $w$, that each hidden string $S$ emitting $w$ is a pipe and that an original volume of 1 unit of fluid is poured through $\pi$ at time $1$, and it splits at each time according to $\theta$, then  $\alpha_w(q, t)$ represents the amount of liquid expected to flow into hidden state $q$ at time $t$ given word $w$.  That is,

$$\alpha_w(q, 1) = \pi(q); \ \ \ \ \ \alpha_w(q, t + 1) = \sum_{q' \in \Sigma} \alpha_w(q', t-1) A(q', q) B(q', w_t)$$

Then we have

$$\hat{P}_{\theta}(w) = \sum_{q \in \Sigma} \alpha_w(q, |w|)$$

Why did we do this? In effect, looking at $\alpha_w$ lets us see the partial products of the expression above. We're going to need this to recompute the initial distribution $\pi$ when it comes time to train, since $\pi$ is only sensitive to the first term in the product. It will also be instructive to see that the backward computation algorithm lines up at every time $t$, which may help visualize the notion of probability "flowing" through the model.

Below is a Python implementation of the forward algorithm:

In [4]:
def forward(hmm, word):
		# Initialize the distribution
		alpha = dict()

		# Print if flagged
		if verbose_flag:
			print "\n--\n\nComputing forward probabilities."

		# Initilize \alpha values to the initial distribution
		for state in hmm.hidden_states:
			alpha[(state, 0)] = hmm.Pi[state]

		# Moving forward, compute new alpha values from probability products
		for t in range(1, len(word) + 1):

			# Print if flagged
			if verbose_flag:
				print "\n\n    Time " + t.__str__() + ": \'" + word[t-1] + "\'"

			# Keep a running sum at each time t
			t_sum = 0.0
			# Run through posssible next states
			for to_state in hmm.hidden_states:
				alpha[(to_state, t)] = 0

				# Print if flagged
				if verbose_flag:
					print "        To state " + to_state.__str__()

				# Find the forward probability given the next letter
				for from_state in hmm.hidden_states:
					increment = alpha[(from_state, t-1)] * hmm.B[(from_state, word[t-1])] * hmm.A[(from_state, to_state)]
					alpha[(to_state, t)] += increment

					# Print if flagged
					if verbose_flag:
						print "            From state " + from_state.__str__() + ": \\alpha_w(" + from_state.__str__() + ", " + (t-1).__str__() + ") * B(" + from_state.__str__() + ", " +  word[t-1] + ") * A(" + from_state.__str__() + ", " + to_state.__str__() + ") = " + increment.__str__() 
				
				# Print if flagged
				if verbose_flag:
					print "        \\alpha_w(" + to_state.__str__() + ", " + t.__str__() + ") = " +  alpha[(to_state, t)].__str__()
				
				# Add the probability from the current state to the sum for t
				t_sum += alpha[(to_state, t)]
			
			# Print if flagged
			if verbose_flag:
				print "\n    \sum_{t \leq k} \\alpha_w(q_t, " + t.__str__() + ") = " + t_sum.__str__()

		# Print if flagged
		if verbose_flag:
			print "\n--\n\n"		
			for t in range(0, len(word)+1):
				print "Time " + t.__str__() + ":"
				for state in hmm.hidden_states:
					print "    \\alpha_w(" + state.__str__() + ", " + t.__str__() + ") = " + alpha[(state, t)].__str__()
			print "\nTotal estimated probability of word \"" + word + "\": " + sum([alpha[(state, len(word))] for state in hmm.hidden_states]).__str__()
                    
		return alpha

Using this formula, we can use this with a randomly-initialized distribution to get the probability of some word:

In [10]:
forward(my_hmm, "asdfjkl")

print "\n\n--\n\n"
forward(my_hmm, raw_input("Now you enter something: "))



--


Now you enter something: worry


{(0, 0): 0.14062705893551103,
 (0, 1): 0.02982245745122407,
 (0, 2): 0.001347132451159962,
 (0, 3): 0.00017841294068976645,
 (0, 4): 5.975262019869549e-06,
 (0, 5): 1.1266806165470617e-07,
 (1, 0): 0.8583729410644889,
 (1, 1): 0.0063434040812072445,
 (1, 2): 0.001886362761747577,
 (1, 3): 5.096260804132496e-05,
 (1, 4): 2.994424183321664e-06,
 (1, 5): 2.7917308360974185e-08}

Time for a sanity check: why, for the word "asdfjkl," are we getting values on the order of $10^{-10}\sim10^{-13}$? Well, if you think about how many possible seven-letter strings there are on an alphabet of twenty-six letters, even if the probability distribution is uniform, we an expect to get each string exactly $1.24 \cdot 10^{-10}$ of the time. So tiny probabilities, especially before we train the model at all, are entirely expected. And, actually, after training, you'd hope that "asdfjkl" is reported even more highly improbable.

## Estimating String Probability: Backward Method

Now, part of the forward-backward algorithm is the fact that we can use Bayes's theorem to take partial products in reverse. To wit, we define another function, this one representing the amount of liquid flowing out of a hidden state $q$ at time $t$ given word $w$:

$$\beta_w(q, |w|) = 1, \ \ \ \ \ \ \beta_w(q, t) = \sum_{q' \in \Sigma} \beta_w(q', t+1) A(q, q') B(q, w_t)$$

Now note that, at any time $t$, for a given word $w$, the probability distributions we've just defined line up as follows:

$$ \hat{P}_{\theta}(w) = \sum_{q \in \Sigma} \alpha_w(q, t) \beta_w(q, t)$$

Quickly, here's the backward algorithm in Python:

In [5]:
def backward(hmm, word):
		# Initialize the distribution
		beta = dict()

		# Print if flagged
		if verbose_flag:
			print "\n--\n\nComputing backward probabilities."

		for s in hmm.hidden_states:
			beta[(s, len(word))] = 1

		for t in range(len(word), 0, -1):
			
			# Print if flagged
			if verbose_flag:
				print "\n\n    Time " + t.__str__() + ": \'" + word[t-1] + "\'"
			
			# Keep a running sum at each time t
			t_sum = 0.0
			
			for from_state in hmm.hidden_states:
				# Initialize \beta
				beta[(from_state, t-1)] = 0.0

				# Print if flagged
				if verbose_flag:
					print "        From state " + from_state.__str__()

				# Find the backward probability given the last letter
				for to_state in hmm.hidden_states:
					increment = beta[(to_state, t)] * hmm.B[(from_state, word[t-1])] * hmm.A[(from_state, to_state)]
					beta[(from_state, t-1)] += increment

					# Print if flagged
					if verbose_flag:
						print "            To state " + to_state.__str__() + ": \\beta_w(" + to_state.__str__() + ", " + (t+1).__str__() + ") * B(" + from_state.__str__() + ", " +  word[t-1] + ") * A(" + from_state.__str__() + ", " + to_state.__str__() + ") = " + increment.__str__() 
				
				# Add the probability from the current state to the sum for t
				t_sum += beta[(from_state, t-1)]

				# Print if flagged
				if verbose_flag:
					print "\n    \sum_{q \in S} \\beta_w(s, " + t.__str__() + ") = " + t_sum.__str__()
			
		

		# Print if flagged
		if verbose_flag:
			print "\n--\n\n"
			for t in range(0, len(word)+1):
				print "Time " + t.__str__() + ":"
				for state in hmm.hidden_states:
					print "    \\beta_w(" + state.__str__() + ", " + t.__str__() + ") = " + beta[(state, t)].__str__()
		return beta

And a test-run to show it lines up:

In [None]:
word = "qwerty"

alpha = forward(my_hmm, word)
beta = backward(my_hmm, word)

print "\n--\n\nSumming over distributions at each position in the input word:\n\n"

for t in range(6):
    sum_t = 0
    for state in my_hmm.hidden_states:
        sum_t += alpha[(state, t)] * beta[state, t]
    print "P('" + word + "') at time t = " + t.__str__() + " (w_t = " + word[t] + ") is: " + sum_t.__str__()

## Training the Model: The Baum-Welch Algorithm

Now that we see the "flow" idea play out in estimating word probability, we can actully use this to solve the third problem and train the model. Returning to the analogy of clustering, what we've in essence accomplished thus far is the random initialization of mean points (probability distributions) and we've defined two equivalent ways to norm vectors (compute word probabilities). What remains to be done is to compute new means from the words (reassign probability distributions). This learning process is called the [Baum-Welch Algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm.)

What we'll do at this point is compute the atomized expectations across the entire corpus: we define a function called "softcount" that gives the expectation that state $q$ will transition to state $q'$ and emit symbol $o$ at time $t$.

$$SC(q, q', o, t) = \mathbb{E}_K\left[ \alpha(q, t) A(q, q') B(q, o) \beta(q, t) \mid w_t = o \right] = \sum_{w \in K, \ w_t = o} \frac{\alpha(q, t) A(q, q') B(q, o) \beta(q, t)}{\hat{P}_{\theta}(w)} $$

In this step, we also need to determine the probability that the model assigns to the corpus. This requires some discussion: remember that we're trying to assign a probability to the model, taking the corpus as an observation. We've set up the model to estimate a probability for each possible string of output characters, and we can use these estimates along with the actual observed spread of words in the corpus to assign a probability, or, more accurately, a likelihood, to the model. This is a method called [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). Define the _likelihood_ of the HMM with parameters $\theta$ respect to the corpus $K$ by

$$\hat{\ell}_\theta(K) = - \frac{1}{|K|} \sum_{w \in K} \log \left( \hat{P}_{\theta}(w) \right) $$

(The negative sign is added because all the probabilities are less than one.) We now simply minimize this value, which is a measure of how well the model fails to account for the data, where strong outliers are penalized less due to the logarithm function. One can argue for the correctness of this evaluation metric by looking at loanwords and strange exceptions: every language has them, and there should be insigificant penalty assigned to a description of a language for the fact of some (even many) outliers. That's just the nature of the data.

Because we need both the likelihood and the softcounts, the softcounts are kept in the HMM as member variables and the expectation function simply returns the likelihood. The computation is implemented in Python below, also checking at every step that the forward and backward algorithms agree:

In [6]:
def forward_probability(hmm, alpha, length):
		alpha_sum = 0.0
		for state in hmm.hidden_states:
			alpha_sum += alpha[(state, length)]
		return alpha_sum


def backward_probability(hmm, beta):
    beta_sum = 0.0
    for state in hmm.hidden_states:
        beta_sum += hmm.Pi[state] * beta[(state, 0)]
    return beta_sum

def expectation(hmm):
		hmm.softcount = dict()
		# Set initial values
		plog_sum = 0.0
		# Open and read file
		corpus_file = open(hmm.corpus_name, "r")
		line = corpus_file.readline().strip('\n').lower()
		if print_flag:
			print "\n\nPlogs:\n"
		while line:
			# Append endline character
			line += "#"
			
			# Compute probabilities
			alpha = forward(hmm, line)
			beta = backward(hmm, line)
			f_prob = forward_probability(hmm, alpha, len(line))
			b_prob = backward_probability(hmm, beta)
			
			# Run through the word and tabulate softcounts
			for t in range(len(line)):
				for from_state in hmm.hidden_states:
					for to_state in hmm.hidden_states:
						if (t, line[t], from_state, to_state) not in hmm.softcount:
							hmm.softcount[(t, line[t], from_state, to_state)] = 0.0	
						hmm.softcount[(t, line[t], from_state, to_state)] += (alpha[(from_state, t)] * hmm.A[(from_state, to_state)] * hmm.B[(from_state, line[t])] * beta[(to_state, t+1)]) / f_prob

			# If we have agreement on the probailities, more or less
			if (fabs(f_prob - b_prob) < 0.00001):
				plog = -1 * log(f_prob, 2)
				plog_sum += plog

				# Print if flagged
				if print_flag:
					print "plog(\"" + line + "\") = " + plog.__str__()

			else:
				print "Unacceptable probability mismatch at word " + line + ": forward (" + f_prob.__str__() + ") != backward (" + b_prob.__str__() + ")."	
			line = corpus_file.readline().strip('\n').lower()
		
		# Print if flagged
		if print_flag:
			print "\n--\n\nSum of positive logs: " + plog_sum.__str__() + "\n\n--\nSoftcounts:\n"

			for t in range(hmm.longest):
				print "\n    At time " + t.__str__() + ": "
				mysum = 0.0
				for from_state in hmm.hidden_states:
					print "\n        From state " + from_state.__str__() + ": "
					for to_state in hmm.hidden_states:
						print "\n            To state " + to_state.__str__() + ": "
						for char in hmm.output_states:
							if (t, char, from_state, to_state) in hmm.softcount:
								print "                Emitting " + char + ": " + hmm.softcount[(t, char, from_state, to_state)].__str__()
								mysum += hmm.softcount[(t, char, from_state, to_state)]
				print "Sum = " + mysum.__str__()
		
		# Return
		return plog_sum

From here, we can compute the expectations we're after. The maximization step, like in the clustering algorithm, simply involves resetting the probability distributions to the new expectations. Thus, for $\pi$, we simply want to set $\pi(q)$ to be equal to the expectation of transitioning to state $q$ at time 1:

$$\pi'(q) := \sum_{w \in K} \sum_{q' \in \Sigma} \sum_{o \in \Delta} SC(q, q', o, 1)$$

And for $A$, we want the total expectation of transitioning from state $q$ to state $q'$ at any time and emitting any symbol:

$$A'(q, q') := \sum_{t = 1}^{\max |w|} \sum_{o \in \Delta} SC(q, q', o, t)$$

And for $B$, we want the total expectation of emitting symbol $o$ at state $q$ regardless of time or destination:

$$B'(q, 0) := \sum_{t = 1}^{\max |w|} \sum_{q' \in \Sigma} SC(q, q', o, t)$$

In Python:

In [7]:
    def maximization(hmm):
		# Reset Pi

		# Print if flagged
		if print_flag:
			print "Distribution \\pi:"
		
		for from_state in hmm.hidden_states:
			# Print if flagged
			if print_flag:
				print "    For state " + from_state.__str__() + " was    " + hmm.Pi[from_state].__str__()
				print "Recomputing... \n"

			softcount_i = sum([hmm.softcount[(0, char, from_state, to_state)] for char in hmm.output_states for to_state in hmm.hidden_states if (0, char, from_state, to_state) in hmm.softcount])

			hmm.Pi[from_state] = 1/float(hmm.word_count) * softcount_i 
		
			# Print if flagged
			if print_flag:
				print "    For state " + from_state.__str__() + " is now " + hmm.Pi[from_state].__str__() + "\n"


		# For each (i, j), assign A_{i, j}
		
		# Print if flagged
		if print_flag:
			print "\nDistribution A:"
		for from_state in hmm.hidden_states:
			
			# Print if flagged
			if print_flag:
				print "\n    From state " + from_state.__str__() + ":\n"
			
			a_denom = sum([hmm.softcount[(t, l, from_state, k)] for t in range(hmm.longest) for l in hmm.output_states for k in hmm.hidden_states if (t, l, from_state, k) in hmm.softcount])

			if print_flag:
				print "    Computed the denominator at state i = " + from_state.__str__() + " (sum over hidden_states, output_states, and t): " + a_denom.__str__() + "\n"

			for to_state in hmm.hidden_states:
				a_num = sum([hmm.softcount[(t, l, from_state, to_state)] for t in range(hmm.longest) for l in hmm.output_states if (t, l, from_state, to_state) in hmm.softcount])

				# Print if flagged
				if print_flag:
					print "\n        Computed the numerator at states i = " + from_state.__str__() + "; j = " + to_state.__str__() + " (sum over output_states and t): " + a_num.__str__() + "\n"
					print "        To state " + to_state.__str__() + " was    " + hmm.A[(from_state, to_state)].__str__()

				hmm.A[(from_state, to_state)] = a_num / a_denom
				
				# Print if flagged
				if print_flag:
					print "        To state " + to_state.__str__() + " is now " + hmm.A[(from_state, to_state)].__str__()


		# For each (i, l), assign B_{i, l}
		
		# Print if flagged
		if print_flag:
			print "\nDistribution B:"
		for from_state in hmm.hidden_states:
			# Print if flagged
			if print_flag:
				print "\n    From state " + from_state.__str__() + ": "

			b_denom = sum([hmm.softcount[(t, m, from_state, j)] for t in range(hmm.longest) for m in hmm.output_states for j in hmm.hidden_states if (t, m, from_state, j) in hmm.softcount])

			# Print if flagged
			if print_flag:
				print "    Computed the denominator at state i = " + from_state.__str__() + " (sum over hidden_states, output_states, and t): " + b_denom.__str__() + "\n"

			for char in hmm.output_states:
				b_num = sum([hmm.softcount[(t, char, from_state, j)] for t in range(hmm.longest) for j in hmm.hidden_states if (t, char, from_state, j) in hmm.softcount])

				# Print if flagged
				if print_flag:
					print "\n        Computed the numerator at states i = " + from_state.__str__() + "; \\ell = " + char + " (sum over output_states and t): " + b_num.__str__() + "\n"
					print "        To state " + char + " was    " + hmm.B[(from_state, char)].__str__()

				hmm.B[(from_state, char)] = b_num / b_denom

				# Print if flagged
				if print_flag:
					print "        To state " + char + " is now " + hmm.B[(from_state, char)].__str__()

		# Print if flagged
		if print_flag:
			print "\n\n--\n"

Now, having done this, we can finally run the HMM on the corpus and watch it learn.

## Clustering Vowels and Consonants

We're going to declare a two-state HMM and see if the clustering agrees with our intuition that English phonemes are either vowels or consonants. The reason that this will line up is essentially the salience of the categorization: by learning these categories unsupervisedly, we can show something computationally that the formal linguist does not have the tools to. We can show that the corpus actually naturally presents a dichotomy between two classes of character. This statistical analysis will thus validate our grammatical model that predicts a certain outcome.

To do this, we take the HMM defined above (which already has two states) and we'll run it over the corpus provided until the plog sum converges. We'll also cap the number of iterations at 200, just to be safe. Additionally, because we only have two states, we can conveniently have the model print out the smoothed log of the ratio of probabilities between the two states per character. That is, we can easily see toward which of the two clusters each symbol skews. (If we had, say, twenty states trying to pick out all the English phonemes, then each symbol would have twenty different values and we'd have to choose a different representation.)

In [8]:
from math import fabs, log

print_flag = False
verbose_flag = False

states = set([0, 1])
my_hmm = HMM(states, "english1000.txt")

def train_hmm(hmm):
	plog_sum = expectation(my_hmm)
	delta = plog_sum
	i = 0


	# Run until the plog doesn't change very much
	while delta > 0.001 and i < 200:
		i += 1

		print "\n\n--\nITERATION " + i.__str__() + ":\n"

		# Run E-M
		maximization(my_hmm)
		new_plog = expectation(my_hmm)

		# Consider the change in plog
		delta = fabs(new_plog - plog_sum)
		plog_sum = new_plog
		print "\n\n--\n\nplog sum at iteration " + i.__str__() + ": " + plog_sum.__str__()
		print "\\Delta = " + delta.__str__()
		print "\n\nLog emission ratios ((log(B(l, 0) + 0.001) / (B(l, 1) + 0.001))):\n"
		ratios = sorted([(log((my_hmm.B[(0, char)] + 0.001) / (my_hmm.B[(1, char)] + 0.001), 2), char) for char in my_hmm.output_states])
		for (ratio, char) in ratios:
			print "    " + char + ": " + ratio.__str__()
		print ""
		print "\n\n--"
	print "HMM terminated after " + i.__str__() + " iterations; total plog = " + plog_sum.__str__()
    
train_hmm(my_hmm)



--
ITERATION 1:



--

plog sum at iteration 1: 24769.9347726
\Delta = 3567.27190067


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    m: -2.07355745683
    f: -1.70330888383
    a: -1.70047218066
    t: -1.42132196971
    d: -1.16926276393
    w: -0.932060509469
    n: -0.810630406917
    e: -0.790257413747
    b: -0.730049249451
    z: -0.639185416411
    s: -0.416492568457
    .: -0.311603497325
    l: -0.0657754096944
    c: 0.00586792381009
    r: 0.249901699225
    ': 0.397403387684
    h: 0.407491435826
    j: 0.549434948663
    x: 0.606694582674
    #: 0.628288925848
    q: 0.81119819473
    p: 0.816042036028
    o: 0.859964927478
    u: 0.99456693311
    i: 1.88009239888
    y: 1.91294894612
    k: 2.2662246555
    v: 2.35789884789
    g: 2.37686465041



--


--
ITERATION 2:



--

plog sum at iteration 2: 24733.771018
\Delta = 36.1637546469


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    m: -2.19654686693
    f: -1.999680



--

plog sum at iteration 13: 24016.4421032
\Delta = 24.264031347


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.41751672669
    b: -5.27666471251
    f: -5.21217629777
    s: -4.9307471502
    m: -3.96071662735
    c: -3.66720058772
    p: -3.00623890868
    j: -2.23173617898
    g: -1.86245690137
    q: -1.79271510687
    d: -1.61636376575
    h: -1.50073513667
    a: -1.16271229475
    t: -1.09307791418
    k: -0.111438108397
    l: -0.0457406146482
    .: 0.580511006361
    r: 0.681713256771
    y: 0.836411577479
    n: 1.04660150381
    z: 1.05084797707
    i: 1.27971607424
    e: 1.55902709001
    v: 1.61998720771
    o: 1.66299023528
    u: 1.76363149183
    x: 2.12498583581
    ': 3.08940010203
    #: 5.30959473515



--


--
ITERATION 14:



--

plog sum at iteration 14: 23996.413029
\Delta = 20.0290742225


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.49858767215
    b: -5.3118379568
    f: -5.27681368763
 



--

plog sum at iteration 25: 23912.7817162
\Delta = 1.66811715301


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -5.82894059192
    w: -5.74007473047
    b: -5.47509679633
    f: -5.46470011237
    c: -4.61834764271
    m: -3.75177276469
    g: -3.28645903645
    d: -2.93374047851
    p: -2.70537768865
    j: -2.39328282332
    q: -1.93948093364
    t: -0.947629894451
    h: -0.311284015171
    k: -0.281661200387
    y: -0.148619714591
    l: 0.165925844229
    a: 0.352815555719
    r: 0.482031039179
    .: 0.614323351992
    v: 0.91439137742
    n: 1.1936548899
    z: 1.21299753302
    i: 1.69978620307
    e: 2.14373374027
    x: 2.15468631618
    o: 2.17360541494
    u: 2.19959650148
    ': 3.36366540338
    #: 8.03445114772



--


--
ITERATION 26:



--

plog sum at iteration 26: 23911.4296478
\Delta = 1.35206833465


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -5.76648376986
    w: -5.7408066088
    b: -5.4757040577



--

plog sum at iteration 37: 23901.258595
\Delta = 0.739416445162


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.72770207576
    b: -5.46246071873
    s: -4.76889984901
    c: -4.63374954068
    f: -4.03769223634
    g: -3.94543685664
    d: -3.91665464142
    m: -3.22346951467
    p: -2.3925535219
    j: -2.38282456973
    q: -1.92993455187
    t: -1.13715509493
    y: -0.793973881405
    k: -0.25657128046
    h: -0.054844641163
    l: 0.234318430086
    r: 0.326338522022
    a: 0.471006313339
    v: 0.502717106471
    .: 0.617195866544
    n: 1.180367526
    z: 1.22220770112
    i: 1.76372065292
    x: 2.16037676283
    e: 2.16240761713
    o: 2.23652007676
    u: 2.25761404514
    ': 3.37060477983
    #: 8.06144999441



--


--
ITERATION 38:



--

plog sum at iteration 38: 23900.6267368
\Delta = 0.631858117049


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.72664486534
    b: -5.46140725358
    s: -4.68389250339



--

plog sum at iteration 50: 23897.4577717
\Delta = 0.1050707738


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.71296196153
    b: -5.44777715064
    c: -4.336667195
    d: -4.20995314444
    g: -4.12130631624
    s: -4.06497194204
    f: -3.408624476
    m: -3.09465748351
    j: -2.37069163598
    p: -2.34681526293
    q: -1.91886709363
    t: -1.32171324556
    y: -1.31430674436
    k: -0.312663059741
    h: 0.059284023624
    v: 0.146455486087
    r: 0.188524118252
    l: 0.267031949233
    a: 0.526349848024
    .: 0.62021047796
    n: 1.1279887384
    z: 1.22778900193
    i: 1.78946603653
    e: 2.16451267329
    x: 2.16708397091
    o: 2.26679255321
    u: 2.27222646588
    ': 3.37840648997
    #: 8.07008936199



--


--
ITERATION 51:



--

plog sum at iteration 51: 23897.3656276
\Delta = 0.0921441496175


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.71175610615
    b: -5.4465760253
    c: -4.31541070058
    



--

plog sum at iteration 62: 23896.8111649
\Delta = 0.0311915611601


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.69876193951
    b: -5.43363308918
    d: -4.25352333361
    c: -4.16333391447
    g: -4.16247963771
    s: -3.93535520768
    f: -3.33362110326
    m: -3.08822898309
    j: -2.35902484153
    p: -2.34089027736
    q: -1.90823257456
    y: -1.57738072663
    t: -1.40631039258
    k: -0.365064292835
    v: -0.0498165042606
    h: 0.0952115870335
    r: 0.11628994391
    l: 0.2801071795
    a: 0.549277700268
    .: 0.623170361081
    n: 1.09027805441
    z: 1.23277607131
    i: 1.80415630646
    e: 2.17062863141
    x: 2.17366011587
    u: 2.27957816125
    o: 2.28151229378
    ': 3.38604996543
    #: 8.07851234501



--


--
ITERATION 63:



--

plog sum at iteration 63: 23896.7818627
\Delta = 0.029302259547


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.69762366912
    b: -5.43249932861
    d: -4.2564310



--

plog sum at iteration 74: 23896.5306417
\Delta = 0.0195699565629


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.68565715462
    b: -5.42058044055
    d: -4.29398171281
    g: -4.23158704458
    c: -4.11818519521
    s: -3.93169489043
    f: -3.3177169785
    m: -3.12115908234
    p: -2.34906502212
    j: -2.34827613177
    q: -1.89844166409
    y: -1.72799306378
    t: -1.45855881079
    k: -0.400185022529
    v: -0.173203556406
    r: 0.0689313656176
    h: 0.110952165072
    l: 0.284045651648
    a: 0.564151138765
    .: 0.625954796678
    n: 1.06429543525
    z: 1.237369938
    i: 1.81653315926
    e: 2.17816320104
    x: 2.17983167257
    u: 2.28680845662
    o: 2.29163203219
    ': 3.39321805784
    #: 8.08640729329



--


--
ITERATION 75:



--

plog sum at iteration 75: 23896.5114581
\Delta = 0.0191836052581


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.68461743852
    b: -5.41954488111
    d: -4.2978766



--

plog sum at iteration 86: 23896.3181682
\Delta = 0.0165595560975


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.67364026187
    b: -5.40861177236
    d: -4.34389932481
    g: -4.32375996814
    c: -4.12405967045
    s: -3.96239215908
    f: -3.32023475334
    m: -3.16645311617
    p: -2.36225674779
    j: -2.33843519572
    q: -1.88948335352
    y: -1.83355449425
    t: -1.49891617496
    k: -0.425103638851
    v: -0.265631297484
    r: 0.0321281623861
    h: 0.120026118095
    l: 0.284904679707
    a: 0.57622843539
    .: 0.628553895716
    n: 1.04455619541
    z: 1.2416057588
    i: 1.82792715337
    x: 2.18557942672
    e: 2.18589879994
    u: 2.29421150396
    o: 2.30000063338
    ': 3.39988947431
    #: 8.09375163819



--


--
ITERATION 87:



--

plog sum at iteration 87: 23896.3017632
\Delta = 0.0164049494851


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.67268042782
    b: -5.40765580747
    d: -4.348273



--

plog sum at iteration 98: 23896.1294872
\Delta = 0.0151518024395


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.66247759713
    b: -5.39749426394
    g: -4.42678654667
    d: -4.39722481745
    c: -4.14654533615
    s: -4.00453551113
    f: -3.32897570858
    m: -3.21561425507
    p: -2.37692935493
    j: -2.32930713966
    y: -1.92102094089
    q: -1.88117889786
    t: -1.5344082412
    k: -0.446020610922
    v: -0.344957875879
    r: -8.16309575381e-05
    h: 0.126440121232
    l: 0.284639109156
    a: 0.587143053166
    .: 0.631008455671
    n: 1.02746536323
    z: 1.24555167669
    i: 1.83886908887
    x: 2.19099606093
    e: 2.19365725558
    u: 2.30178217958
    o: 2.30776373505
    ': 3.40617266057
    #: 8.100665469



--


--
ITERATION 99:



--

plog sum at iteration 99: 23896.1144161
\Delta = 0.0150711427195


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.66157958378
    b: -5.39659989847
    g: -4.4356



--

plog sum at iteration 110: 23895.952524
\Delta = 0.014516975265


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.65198071392
    b: -5.38704016241
    g: -4.53554131448
    d: -4.45122375494
    c: -4.17402070053
    s: -4.0512696452
    f: -3.34010368934
    m: -3.26623294991
    p: -2.39197767162
    j: -2.32073528582
    y: -2.0004580808
    q: -1.87338478679
    t: -1.56764014271
    k: -0.465553276781
    v: -0.418464706567
    r: -0.0301587124637
    h: 0.131577262774
    l: 0.283854136605
    a: 0.597593680517
    .: 0.633352754459
    n: 1.01139282089
    z: 1.24923362994
    i: 1.84979980992
    x: 2.19615902147
    e: 2.20159087855
    u: 2.30981513632
    o: 2.31557800287
    ': 3.41215808592
    #: 8.10724884888



--


--
ITERATION 111:



--

plog sum at iteration 111: 23895.9380253
\Delta = 0.0144986493469


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.651131566
    b: -5.38619448889
    g: -4.544805



--

plog sum at iteration 122: 23895.7778934
\Delta = 0.0147406679753


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.6420197075
    b: -5.37712004128
    g: -4.64867859672
    d: -4.50544692284
    c: -4.20284993029
    s: -4.10073350038
    f: -3.35257658163
    m: -3.31797568511
    p: -2.40716999359
    j: -2.31261163426
    y: -2.0761020263
    q: -1.8660020987
    t: -1.59978641127
    v: -0.489926676814
    k: -0.484751384385
    r: -0.0593131809019
    h: 0.136032189521
    l: 0.282741236344
    a: 0.608083588738
    .: 0.635610454505
    n: 0.995545289532
    z: 1.25260373485
    i: 1.86132113262
    x: 2.2011217522
    e: 2.21008873417
    u: 2.31904849967
    o: 2.32415137608
    ': 3.41790817573
    #: 8.11357081922



--


--
ITERATION 123:



--

plog sum at iteration 123: 23895.7630822
\Delta = 0.0148111712952


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.6412110456
    b: -5.37631471092
    g: -4.6583



--

plog sum at iteration 134: 23895.5922231
\Delta = 0.0163846133728


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.63251982137
    b: -5.36765941054
    g: -4.76646872937
    d: -4.56051911125
    c: -4.23215102094
    s: -4.15296257163
    m: -3.37143390579
    f: -3.36649876112
    p: -2.42262318504
    j: -2.30487370614
    y: -2.15042114988
    q: -1.85897348577
    t: -1.63160636514
    v: -0.56275035022
    k: -0.504142297744
    r: -0.088468608453
    h: 0.140207986679
    l: 0.281349989782
    a: 0.619269825569
    .: 0.637794260547
    n: 0.979316403719
    z: 1.25544581007
    i: 1.87451027863
    x: 2.20591322369
    e: 2.21995192305
    u: 2.33103274226
    o: 2.33468752314
    ': 3.42345686428
    #: 8.11966898738



--


--
ITERATION 135:



--

plog sum at iteration 135: 23895.5756016
\Delta = 0.0166215421159


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.63174816666
    b: -5.36689095578
    g: -4.7



--

plog sum at iteration 146: 23895.3701666
\Delta = 0.0209373165671


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.62347581503
    b: -5.35865301576
    g: -4.89007093522
    d: -4.61791116163
    c: -4.2622151217
    s: -4.20920308395
    m: -3.42809441659
    f: -3.38288373997
    p: -2.43871958589
    j: -2.29751590365
    y: -2.22578242044
    q: -1.85229336538
    t: -1.66388561034
    v: -0.642199921842
    k: -0.52416245403
    r: -0.118827122593
    h: 0.14459635316
    l: 0.279651727252
    a: 0.632361590232
    .: 0.639901587855
    n: 0.961865096374
    z: 1.25704331666
    i: 1.89156351765
    x: 2.21052870339
    e: 2.23289369845
    u: 2.34906730201
    o: 2.34963009257
    ': 3.42879900034
    #: 8.12553795881



--


--
ITERATION 147:



--

plog sum at iteration 147: 23895.3486266
\Delta = 0.0215400337729


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.62274504675
    b: -5.35792529682
    g: -4.900



--

plog sum at iteration 158: 23895.0553337
\Delta = 0.0322732383611


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.61500108322
    b: -5.35021374178
    g: -5.0215556833
    d: -4.68051704397
    c: -4.29435356595
    s: -4.27244828065
    m: -3.49119180841
    f: -3.40430556444
    p: -2.45629514375
    y: -2.30566887903
    j: -2.29062905714
    q: -1.8460436728
    t: -1.69782153014
    v: -0.738947406394
    k: -0.545432315428
    r: -0.152581694762
    h: 0.150062194015
    l: 0.277507236652
    .: 0.641901802287
    a: 0.649929994366
    n: 0.941538318197
    z: 1.25485995582
    i: 1.91728858587
    x: 2.21490217888
    e: 2.25278091582
    o: 2.37431645545
    u: 2.38044657922
    ': 3.43385855351
    #: 8.13109450744



--


--
ITERATION 159:



--

plog sum at iteration 159: 23895.0215618
\Delta = 0.0337718385272


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.61432894904
    b: -5.34954442867
    g: -5.03



--

plog sum at iteration 170: 23894.507655
\Delta = 0.061118008085


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.60744169807
    b: -5.3426861602
    g: -5.16419450729
    d: -4.75442907416
    s: -4.3494437175
    c: -4.33180345544
    m: -3.56825176578
    f: -3.43696167444
    p: -2.47722055964
    y: -2.39644987113
    j: -2.28449245851
    q: -1.84047714671
    t: -1.73567889863
    v: -0.877563208616
    k: -0.569161328362
    r: -0.194391687864
    h: 0.158344900598
    l: 0.274497234182
    .: 0.643707181013
    a: 0.67786671586
    n: 0.914558203138
    z: 1.23803841614
    i: 1.96273715813
    x: 2.21884348152
    e: 2.28885908778
    o: 2.42095731019
    u: 2.44188137013
    ': 3.4384160805
    #: 8.13609808078



--


--
ITERATION 171:



--

plog sum at iteration 171: 23894.4426161
\Delta = 0.0650388767972


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.60687323198
    b: -5.34212009245
    g: -5.176709



--

plog sum at iteration 182: 23893.341022
\Delta = 0.140337193745


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.60155162328
    b: -5.33682098118
    g: -5.32139675612
    d: -4.85321546866
    s: -4.45534819364
    c: -4.3825890512
    m: -3.67826822798
    f: -3.49604640186
    y: -2.51015442115
    p: -2.50573057579
    j: -2.27971518237
    q: -1.83614519085
    t: -1.78209375457
    v: -1.11997657592
    k: -0.598058873706
    r: -0.254789708982
    h: 0.172959960844
    l: 0.269335813029
    .: 0.645127954346
    a: 0.730383493854
    n: 0.871538651355
    z: 1.14526894457
    i: 2.05411662055
    x: 2.2219410608
    e: 2.36484441559
    o: 2.51843193199
    u: 2.57195469665
    ': 3.44199660351
    #: 8.14002794719



--


--
ITERATION 183:



--

plog sum at iteration 183: 23893.1893627
\Delta = 0.15165938827


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.60116901928
    b: -5.33643999786
    g: -5.3351344



--

plog sum at iteration 194: 23890.3458171
\Delta = 0.383294691343


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.59815129754
    g: -5.48711940586
    b: -5.33343507387
    d: -5.00675526889
    s: -4.62072151349
    c: -4.46694677178
    m: -3.87326584777
    f: -3.61776582023
    y: -2.66422686578
    p: -2.54989670832
    j: -2.27695895176
    t: -1.84631657912
    q: -1.8336464949
    v: -1.63581955526
    k: -0.638619483157
    r: -0.357584370827
    h: 0.199974766517
    l: 0.257698675726
    z: 0.595481048824
    .: 0.645953839502
    n: 0.787914414645
    a: 0.842551029376
    x: 2.22374000831
    i: 2.25281682446
    e: 2.54735489124
    o: 2.73539125819
    u: 2.84971191126
    ': 3.4440754754
    #: 8.14230921364



--


--
ITERATION 195:



--

plog sum at iteration 195: 23889.9285885
\Delta = 0.417228596951


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.59795895437
    g: -5.50036060471
    b: -5.3332

Now, sometimes it works and sometimes it fails. In the above example, you may see something like the following:

    plog sum at iteration 139: 23815.8468722
    \Delta = 0.000971715664491

    Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    #: -8.41894700151
    e: -7.7308759155
    a: -6.95966224615
    o: -6.92463221197
    i: -6.49724116567
    u: -5.84491933561
    ': -3.69822119505
    .: -0.752382788926
    h: 1.37603167112
    z: 1.42733351564
    l: 1.54862067232
    q: 1.59807453731
    j: 2.01474269367
    x: 2.43100197023
    v: 4.07734740352
    k: 4.80861729702
    b: 5.00390111339
    f: 5.04877419887
    w: 5.26703814239
    g: 5.35304104853
    y: 5.36509377385
    p: 5.38580307633
    m: 5.52169396818
    c: 5.78147983326
    d: 5.9181502138
    t: 6.10605486044
    n: 6.65363342073
    s: 6.66808653806
    r: 6.88569341469

The HMM is not guaranteed to avoid local extrema because this sort of algorithm, called expectation-maximization, is greedy.

Nonetheless, we now have a way to actually tune our probability distributions. We can, after training, go back and look at string probabilities for things that actually occurred in the corpus. Hopefully they'll be significantly higher:

In [19]:
states = set([0, 1])
my_hmm = HMM(states, "english1000.txt")

verbose_flag = True
forward(my_hmm, "worry")

verbose_flag = False
train_hmm(my_hmm)

verbose_flag = True
forward(my_hmm, "worry")

==   Initialization   ==

--

Creating state 0


Transitions:

    To state     0:    0.486194075109
    To state     1:    0.513805924891

Total: 1.0


Emissions:

    To letter    #:    0.0297224936385
    To letter    ':    0.00142852678354
    To letter    .:    0.0191907394081
    To letter    a:    0.00700679548795
    To letter    b:    0.0185892893014
    To letter    c:    0.0567774342881
    To letter    d:    0.037662060215
    To letter    e:    0.0260544196092
    To letter    f:    0.0340724015612
    To letter    g:    0.0456450911651
    To letter    h:    0.00282788782721
    To letter    i:    0.0283177794759
    To letter    j:    0.0408258851458
    To letter    k:    0.0649689028612
    To letter    l:    0.0626402336681
    To letter    m:    0.0424325405599
    To letter    n:    0.0305954127592
    To letter    o:    0.0542050764227
    To letter    p:    0.0625922344693
    To letter    q:    0.0319872783065
    To letter    r:    0.0251994740534
    To letter 



--

plog sum at iteration 4: 24765.4156269
\Delta = 14.7751048891


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    u: -4.39300704393
    y: -3.19979521264
    ': -2.95735366756
    h: -2.89244792684
    b: -1.6555165689
    a: -1.21661179674
    r: -1.06317101456
    n: -0.95081060293
    j: -0.665032306175
    m: -0.531083606057
    f: -0.516872104609
    d: -0.447127609794
    i: -0.308885002107
    v: -0.188815063805
    e: -0.178556117614
    x: -0.136854500312
    w: -0.133747490014
    t: -0.0434908295236
    c: -0.0391682113362
    q: 0.0108136230344
    l: 0.101067255537
    .: 0.170237291835
    z: 0.221443248439
    k: 0.278663163939
    g: 0.69683591713
    s: 0.909440573551
    p: 1.28705610957
    #: 2.33139764595
    o: 3.95823902018



--


--
ITERATION 5:



--

plog sum at iteration 5: 24741.0123367
\Delta = 24.403290124


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    u: -4.34785708181
    y: -3.32956998072
    h: 



--

plog sum at iteration 16: 23948.6254352
\Delta = 30.2971773427


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.40253726041
    b: -5.14789029032
    f: -4.78802005087
    y: -4.60484580572
    m: -4.51674656036
    s: -4.14328563891
    d: -3.9825737986
    p: -2.98129323732
    c: -2.88901870959
    g: -2.2553292265
    k: -2.15325244565
    j: -2.12994951904
    h: -1.76442277475
    q: -1.70103784185
    t: -1.4784453775
    v: -1.4701177335
    r: -0.975081755264
    n: -0.557811680663
    .: -0.113979554635
    l: 0.0257612224858
    z: 0.680871430269
    a: 0.950482983949
    u: 0.995524900023
    x: 1.68914154044
    e: 2.34652245816
    i: 2.58672270365
    ': 2.88604463808
    o: 4.74564734105
    #: 6.46039783359



--


--
ITERATION 17:



--

plog sum at iteration 17: 23926.6070321
\Delta = 22.018403111


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    w: -5.41445216056
    b: -5.15395869163
    f: -4.867234910



--

plog sum at iteration 28: 23871.270372
\Delta = 1.04287036787


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.08753834203
    d: -5.68796583221
    m: -5.63491382471
    w: -5.4396261743
    g: -5.22694358332
    b: -5.17562111104
    f: -5.09748197582
    y: -4.85944523067
    v: -4.1340667691
    c: -3.94358869564
    p: -3.25232069311
    t: -2.69333962403
    j: -2.14985942412
    k: -2.10256299022
    q: -1.71891917985
    r: -1.52085197017
    z: -1.22402119178
    n: -0.385372662752
    l: -0.0346525312514
    h: 0.0151476823193
    .: 0.0701386473599
    a: 1.49873757952
    x: 1.74425223726
    i: 3.31716008824
    e: 3.39951320457
    u: 3.49913235041
    ': 3.55161153869
    o: 4.18205232636
    #: 8.21572320787



--


--
ITERATION 29:



--

plog sum at iteration 29: 23870.363228
\Delta = 0.907144057419


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.16865493755
    d: -5.74627255692
    m: -5.65195549



--

plog sum at iteration 40: 23864.1327577
\Delta = 0.438399926083


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.68406042002
    d: -6.03796786804
    m: -5.69418969576
    g: -5.51807883846
    w: -5.43959699403
    f: -5.19837597647
    b: -5.17559208376
    y: -4.8102283513
    c: -4.65921402731
    v: -4.24409670591
    p: -3.39436655981
    t: -2.93670744291
    j: -2.14983622117
    k: -1.90379628612
    r: -1.83327335247
    q: -1.71889832642
    z: -1.50383696683
    n: -0.657940007023
    l: -0.138617592008
    h: 0.0333163579561
    .: 0.223096506575
    x: 1.13509247677
    a: 1.92629658076
    ': 3.55166821402
    e: 4.00792890102
    i: 4.02518643848
    u: 4.29291614317
    o: 5.16412410513
    #: 8.25921918063



--


--
ITERATION 41:



--

plog sum at iteration 41: 23863.7042756
\Delta = 0.428482036783


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.70549280928
    d: -6.04601473322
    m: -5.693702



--

plog sum at iteration 52: 23859.1339258
\Delta = 0.434027382471


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.80559521876
    d: -6.06995909597
    m: -5.67914795977
    g: -5.50990781151
    w: -5.42385479756
    f: -5.20346446945
    c: -5.18388805214
    b: -5.15992485044
    y: -4.79359397407
    v: -4.22907976156
    p: -3.57962943463
    t: -3.08324787967
    r: -2.20268355336
    j: -2.1373678543
    k: -1.92728289446
    q: -1.70769742346
    z: -1.52893592772
    n: -1.05051627958
    l: -0.264913450862
    h: -0.0330254346239
    .: 0.363383732948
    x: 0.486301578285
    a: 2.32031210641
    ': 3.5636201509
    e: 4.58027605079
    i: 4.75018372668
    u: 4.95028384617
    o: 6.14766223018
    #: 8.27295685337



--


--
ITERATION 53:



--

plog sum at iteration 53: 23858.6898127
\Delta = 0.44411308553


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.80723747668
    d: -6.06899309598
    m: -5.6771490



--

plog sum at iteration 64: 23852.5863256
\Delta = 0.687534433695


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.793765426
    d: -6.0441528215
    m: -5.64785247989
    c: -5.51148550994
    g: -5.47886141377
    w: -5.39267764653
    f: -5.17384760947
    b: -5.12889858691
    y: -4.83913792027
    v: -4.19887138269
    p: -3.81719594118
    t: -3.30555032815
    r: -2.72640367095
    k: -2.16576682578
    j: -2.11275841299
    q: -1.6856186826
    n: -1.61492786649
    z: -1.50959234228
    l: -0.434738807601
    h: -0.168728705724
    x: -0.108692433037
    .: 0.526059904755
    a: 2.78655449094
    ': 3.58805993461
    e: 5.20886774008
    u: 5.47160619229
    i: 5.56321563148
    o: 6.69370350023
    #: 8.29956274604



--


--
ITERATION 65:



--

plog sum at iteration 65: 23851.862369
\Delta = 0.723956552607


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.79078409225
    d: -6.04092881706
    m: -5.644519830



--

plog sum at iteration 76: 23840.8946841
\Delta = 1.24073033082


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.74909986675
    d: -5.99862637181
    c: -5.66815857047
    m: -5.60186254282
    g: -5.43299425101
    w: -5.34687169108
    f: -5.1282690339
    b: -5.08332028725
    y: -4.90982749191
    v: -4.15452587356
    p: -4.14249578442
    t: -3.66411050685
    r: -3.57326842292
    k: -2.68038884755
    n: -2.62570095916
    j: -2.07680672055
    q: -1.65343430266
    z: -1.47932370039
    x: -0.864736169876
    l: -0.696846798253
    h: -0.423180742901
    .: 0.6708149202
    a: 3.50873991325
    ': 3.62594026335
    u: 5.72801615927
    e: 5.95474363386
    i: 6.23798110063
    o: 6.83867185569
    #: 8.34070123448



--


--
ITERATION 77:



--

plog sum at iteration 77: 23839.6123317
\Delta = 1.28235240256


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.744881763
    d: -5.99442692878
    c: -5.67546206264



--

plog sum at iteration 88: 23825.391432
\Delta = 1.09569147984


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.70328579642
    d: -5.95310273752
    c: -5.72398331801
    m: -5.55647719714
    g: -5.38772943257
    w: -5.30167373211
    f: -5.08326074559
    b: -5.03835412293
    y: -4.99087867408
    r: -4.88113981526
    p: -4.53322892118
    n: -4.5061959685
    t: -4.19131830012
    v: -4.11081363129
    k: -3.49214093359
    j: -2.04157507834
    x: -1.82332686727
    q: -1.62197685078
    z: -1.44976606717
    l: -1.06739005914
    h: -0.818998859045
    .: 0.731339750808
    ': 3.66582093381
    a: 4.64462645521
    u: 5.80757506348
    i: 6.45124928441
    e: 6.72548783732
    o: 6.8894877515
    #: 8.38391294321



--


--
ITERATION 89:



--

plog sum at iteration 89: 23824.3723533
\Delta = 1.0190786997


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.70040215435
    d: -5.95023968389
    c: -5.7267077386
 



--

plog sum at iteration 100: 23818.0132599
\Delta = 0.287237006974


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.68020143024
    n: -6.27650640005
    r: -6.1744783353
    d: -5.93018205167
    c: -5.75045483572
    m: -5.53366350674
    g: -5.3649776942
    w: -5.27895647514
    y: -5.10823574355
    f: -5.06064084297
    b: -5.01575612646
    p: -4.86959873192
    t: -4.76065990181
    k: -4.24111542695
    v: -4.08886028655
    x: -2.34749025647
    j: -2.02395950373
    q: -1.60627940266
    z: -1.43503128292
    l: -1.36380730928
    h: -1.15776221153
    .: 0.746609969089
    ': 3.68688291892
    a: 5.75033540252
    u: 5.83268789186
    i: 6.48473939891
    o: 6.91243679884
    e: 7.23391361837
    #: 8.40669430935



--


--
ITERATION 101:



--

plog sum at iteration 101: 23817.7629486
\Delta = 0.25031130738


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    s: -6.67919020195
    n: -6.34793510483
    r: -6.2518518



--

plog sum at iteration 112: 23816.4262545
\Delta = 0.0587747620229


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.74400540186
    s: -6.67270404183
    n: -6.63479689922
    d: -5.92273639584
    c: -5.7670400106
    m: -5.52625590878
    g: -5.35759042075
    w: -5.27158051706
    y: -5.21031099605
    t: -5.20302066053
    p: -5.08972392868
    f: -5.05329682355
    b: -5.00841929396
    k: -4.61997402567
    v: -4.08173486946
    x: -2.42746061617
    j: -2.01825341469
    q: -1.6011991257
    l: -1.48781909588
    z: -1.43026467117
    h: -1.30694132319
    .: 0.750370383736
    ': 3.69387596538
    u: 5.84028353589
    a: 6.42505738971
    i: 6.4925752088
    o: 6.91996229113
    e: 7.48783022681
    #: 8.41425225269



--


--
ITERATION 113:



--

plog sum at iteration 113: 23816.3740371
\Delta = 0.0522173594654


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.76399932123
    s: -6.67238520143
    n: -6.63973



--

plog sum at iteration 124: 23816.0556935
\Delta = 0.0168312507194


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.86525848791
    s: -6.67020825326
    n: -6.65470615308
    d: -5.92025756736
    c: -5.77587403035
    m: -5.52379013027
    t: -5.51318341125
    g: -5.35513143068
    y: -5.27731006523
    w: -5.2691253065
    p: -5.21970789889
    f: -5.05085228042
    b: -5.0059771512
    k: -4.75368111757
    v: -4.07936333051
    x: -2.43227337586
    j: -2.01635551413
    q: -1.59950986512
    l: -1.52769765557
    z: -1.42867993828
    h: -1.35463284176
    .: 0.751488390107
    ': 3.6962209298
    u: 5.84278913881
    i: 6.49509755696
    a: 6.74729411937
    o: 6.92248270295
    e: 7.61265920385
    #: 8.41678598132



--


--
ITERATION 125:



--

plog sum at iteration 125: 23816.040303
\Delta = 0.0153904575673


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.8684881187
    s: -6.67008831274
    n: -6.6548286



--

plog sum at iteration 136: 23815.9336373
\Delta = 0.00640658605334


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.88358009618
    s: -6.66917500475
    n: -6.65467783484
    d: -5.9192313136
    c: -5.78001063739
    t: -5.73391324406
    m: -5.52276932115
    g: -5.35411343529
    y: -5.31863211792
    p: -5.29623215956
    w: -5.26810887768
    f: -5.04984027309
    b: -5.00496613878
    k: -4.79388347414
    v: -4.07838158105
    x: -2.43186856878
    j: -2.01557001986
    q: -1.59881079329
    l: -1.54067057055
    z: -1.42802415707
    h: -1.36898943079
    .: 0.75192867835
    ': 3.69719425467
    u: 5.84382597135
    i: 6.49614078603
    a: 6.8813359072
    o: 6.92352867997
    e: 7.67508817159
    #: 8.41783755929



--


--
ITERATION 137:



--

plog sum at iteration 137: 23815.9276722
\Delta = 0.0059650729163


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.88400796814
    s: -6.66911709254
    n: -6.6546



--

plog sum at iteration 148: 23815.8827638
\Delta = 0.00291889304572


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.8858010128
    s: -6.66862837995
    n: -6.65417341114
    d: -5.91868838797
    t: -5.89724274459
    c: -5.78177381889
    m: -5.5222292808
    g: -5.35357488429
    y: -5.34410133336
    p: -5.3424698706
    w: -5.2675711559
    f: -5.04930489166
    b: -5.00443128397
    k: -4.80507384037
    v: -4.07786221524
    x: -2.4314460159
    j: -2.01515452034
    q: -1.59844102522
    l: -1.54548272334
    z: -1.42767729515
    h: -1.37364688644
    .: 0.752157336691
    ': 3.69770977573
    u: 5.84437485897
    i: 6.4966931837
    o: 6.92408265719
    a: 6.93341020003
    e: 7.70652850952
    #: 8.41839450397



--


--
ITERATION 149:



--

plog sum at iteration 149: 23815.8800128
\Delta = 0.00275103375316


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.88583444481
    s: -6.66859389217
    n: -6.654139



--

plog sum at iteration 160: 23815.8580255
\Delta = 0.00151274286691


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.88584015382
    s: -6.66828267555
    n: -6.65382947051
    t: -6.02272003969
    d: -5.91834502351
    c: -5.78244593665
    m: -5.5218877419
    p: -5.37113500575
    y: -5.35997488662
    g: -5.3532342876
    w: -5.26723108378
    f: -5.0489663001
    b: -5.00409302557
    k: -4.80797911412
    v: -4.07753375544
    x: -2.43116323046
    j: -2.01489176359
    q: -1.59820719451
    l: -1.54763379383
    z: -1.42745795249
    h: -1.37541332261
    .: 0.75230122383
    ': 3.69803602296
    u: 5.84472219643
    i: 6.49704275683
    o: 6.92443323367
    a: 6.95312014271
    e: 7.72243107793
    #: 8.41874695784



--


--
ITERATION 161:



--

plog sum at iteration 161: 23815.8565864
\Delta = 0.00143910416591


Log emission ratios ((log(B_{l, 0} + 0.001) / (B_{l, 1} + 0.001))):

    r: -6.88582614026
    s: -6.66825931691
    n: -6.6538

{(0, 0): 0.15975842152914438,
 (0, 1): 0.023139506690383435,
 (0, 2): 0.0008521574738648654,
 (0, 3): 0.0001668518877502322,
 (0, 4): 5.182089563183865e-06,
 (0, 5): 5.612541192027499e-08,
 (1, 0): 0.8392415784708557,
 (1, 1): 0.008337853230005196,
 (1, 2): 0.0019357869278451285,
 (1, 3): 6.012170783173098e-05,
 (1, 4): 1.86726208993712e-06,
 (1, 5): 2.0265990520750236e-08}

I seem to get a value like $7.6 \cdot 10^{-8}$ as an output, but sometimes this is less than the initial value. This we would also anticipate can happen if the random initialization happens to over-weight the word.

## Viterbi Parsing

Now we can solve the third problem and use it to determine if the HMM has picked out a distinction that agrees with our intuitions about vowels and consonants being separated. To do this, we use the Viterbi algorithm, which will reconstruct the most likely sequence of hidden states that would produce a given observed string of surface states. To do this, it runs through the given string $w$ and stores, at every position $t$, the hidden state with maximum probability given $w_t$ and its probability. In other words, define

$$P(q, 0) := \pi(q) B(q, w_1)$$

$$P(q, t) := \max_{q' \in \Sigma} P(q, t-1) A(q, q') B(q, w_t)$$

And

$$Q(q, 0) := q$$

$$\DeclareMathOperator*{\argmax}{arg\,max}Q(q, t) := \argmax_{q' \in \Sigma} P(q, t-1) A(q, q') B(q, w_t)$$

Then we start at the end of the path, i.e. where the most information is contained, pick out the maximum-probability state, and move backward. The algorithm is implemented in Python below:

In [None]:
def viterbi_parse(hmm, word):
		path = [None for q in range(len(word))]
		
		# Keep track of the best guesses
		max_probability = dict()
		argmax_state = dict()

		# Keep track of the initial states
		for state in hmm.hidden_states:
			max_probability[(state, 0)] = hmm.Pi[state] * hmm.B[(state, word[0])]
			argmax_state[(state, 0)] = state

		# Moving forward, memoize the probability-maximizing next state given each possible underlying state and the inferred emission probability
		for i in range(1, len(word)):
			for state in hmm.hidden_states:
				func = lambda q : max_probability[(q, i-1)] * hmm.A[(q, state)] * hmm.B[(state, word[i])]
				max_probability[(state, i)] = max(map(func, hmm.hidden_states))
				argmax_state[(state, i)] = max(hmm.hidden_states, key=func)
		
		# Connect the path
		path[len(word) - 1] = max(hmm.hidden_states, key=(lambda q : max_probability[(q, len(word) - 1)]))
		for i in (range(1, len(word))[::-1]):
			path[i-1] = argmax_state[(path[i], i)]
		print "Viterbi parse: " + path.__str__()

We can now validate the HMM after training it. Try it!

In [None]:
states = set([0, 1])
my_hmm = HMM(states, "english1000.txt")

train_hmm(my_hmm)

print "\n\n--\n\nTesting Viterbi algorithm on the word 'Viterbi': "

viterbi_parse(my_hmm, "viterbi")

print "\n\n--\n\nAll right, have at!\n\n"

while True:
    word = raw_input("Enter word for Viterbi parse: ")
    viterbi_parse(my_hmm, word)

Notice that 0 and 1 won't consistently map to vowels and consonants. This is because the algorithm is unsupervised learning and the random initialization gives no preference in this regard. Additionally, sometimes we'll get all one cluster, and sometimes it'll get really messed up. This has, of course, to do with local extrema in the corpus.