# EECS 738: Hidden Markov Models
#### Liam Ormiston & Patrick Canny
### Background
We thought it would be fun to see how a Hidden Markov Model would do at generating a State of the Union Address. This is partially due to Liam's backround in Communications and speech writing in particular. 

### Code
Let's first import the code.

<i>Note: we adapted our code from https://github.com/ashwinmj/word-prediction​<i>

In [1]:
import hmm, tools, combine_all, combine_pres

<ul>
    <li>hmm: contains most of the logic for the hidden markov model</li>
    <li>tools: contains helper functions that support the hmm functions</li>
    <li>combine_all: gives us combined_all.txt a file of all State of the Union Addresses</li>
    <li>combine_pres: contains functions to create and remove combined_pres.txt, a file of a certain president's addresses. If given nothing, default is set to President Woodrow Wilson.</li>
</ul>

In [2]:
from hmm import HMM
initial_model = HMM('combined_all.txt')

In [3]:
initial_model.train()

Training on: combined_all.txt
Done Training on: combined_all.txt


initial_model is trained using combined_all_.txt which is all the State of the Union address since 1790-2018. Our code is only looking at two words at a time and trying to predict the third word.

Let's generate 5 lines and see how our model does!

In [4]:
initial_model.generate(5)

ending on the part of the immense sphere of national as well due
public opinion in favor of substituting a
the government has labored to establish its
should the war between france and belgium has been made to secure reasonableness in prices of
sentiments through a long day we began to flow between those


Ok, so this is somewhat gibberish, but it's a good place to start!

Now let's take a look at what happens if we limit our model to just a specific president. We can pick Lincoln by passing in a string.

In [5]:
combine_pres.create_txt('Lincoln')

file created


In [6]:
president_specific = HMM('combined_pres.txt')

In [7]:
president_specific.train()

Training on: combined_pres.txt
Done Training on: combined_pres.txt


Again, let's just do 5 lines and see what we get.

In [8]:
president_specific.generate(5)

the united states passed the
this plan is presented which may reach the loyal regions of east
state of nevada has been paid to pensioners of all the
long a line it has
improvement and governmental institutions over the new commercial treaty between the now living had been prodded it is some relief to know that the actual receipts for the fiscal year


Still gibberish but still a clear different use of words. It depends on what it generates for you, but we have seen words that have to do with the Civil War, territories, and supplies appear more often. Which makes sense since Lincoln served througout the entirity of the Civil War.

We can also give our model a sentence to start with and it will try to complete it. For this, it will only look at the last two words and start from there. 
Let's take a look at how this goes with our initial model trained over all address.

In [9]:
initial_model.generate(1, 'It has come to my attention that our nation')

It has come to my attention that our nation tonight is also true that


Kind of cool to see what it comes up with! 

If you give it words that do not appear in any of the addresses, it will throw an error. This is because you gave it words that it hasn't seen before. So it cannot predict what comes after. We've added an error catch to inform you if this happens.

### Potential Improvements
#### Look at more words
Currently our model only looks at two words and finds the third word based on the probability of it appearing after the last two words. This does a fairly good job. However, it would be more accurate if we analyzed a longer chain of words. Say, five or six. This would produce more coherent sentences but would also require storing more data. 
#### Add Grammar Rules
We could also weight probabilities of words using grammar rules. For instance, if the most likely word to come after another violates a grammar rule, reduce the probability by a set amount per violation. This could lower the probability enough to perhaps pick the second most likely word.
#### More Data
Obviously, more data would nice. However, a State of the Union address only occurs once a year. We could look at all presidential speeches and add those speeches with a lower weight than the State of the Union addresses. This would at least give us more words to work with and minimize the likelihood of feeding the model words that it has never seen before.