# An analysis of the State of the Union speeches

**Authors:** Aaron Ou, Brian Lin, Julien Yu

## Abstract

The State of the Union address is an annual address given by the President of the United States in which the President outlines the agenda he has for the country. We are curious whether exploring these addresses tell us anything interesting such as change of the English language over time and thus conducted some text processing to these speeches. We found that over time, while these addresses have used more sentences, these sentences are shorter in nature. Furthermore, presidents from around the same era have very similar speech patterns, an indication that these speeches do in fact reflect the change of the English language over time. The two current major parties, the Democrat Party and the Republican Party, have followed similar trends in their speech chracteristics, and both showed interest in the economy. 

## 1. Overview of Dataset

The dataset could be easily seperated into the individual speeches, as each speech is separated by three asterisks. This dataset of speeches contain **227** addresses delivered by **42** US Presidents from 1934 to 2017. Interestingly but unsurprisingly, most speeches occurred at the end of a year or the commencement of a new year possibly in order to address the goals of the new year. Even though there are 45 presidents, there are only a total of 42 presidents in this dataset. The missing presidents are: William Henry Harrison, who passed away one month into his presidency and James A. Garfield, who was assassinated in his first year. Thus they were not able to deliver addresses. Grover Cleveland counts as 2 presidents because he was elected in two non-consecutive terms.

![addresses_month](fig/addresses_month.png)

However, the dataset does not actually contain all the addresses delivered, as it is missing the addresses delivered during the second term of Grover Cleveland. 

![timeline](fig/timeline.png)

## 2. Speech Characteristics
First, we have summarised a few characteristics for each speech:
- `n_sent`: number of sentences in the speech
- `n_words_all`: number of words in the speech, including stop words or punctuations
- `n_words`: number of words in the speech, excluding stop words and punctuations
- `n_uwords`: number of *unique* words in the speech, excluding stop words and punctuations
- `n_swords`: number of *unique, stemmed* words in the speech, excluding stop words and punctuations
- `n_chars`: number of letters in the speech
---
- `logn_words`: number of words on a logarithmic scale, for plotting purpose
- `logn_sent`: number of sentences on a logarithmic scale, for plotting purpose
---
- `vocab_per_word`: vocabulary size per word
- `word_per_sent`: average sentence length
- `char_per_word`: average word length
- `frac_stop`: Fraction of words that are stop words

**Jimmy Carter** gave the longest speech in 1981 with 21041 words while **George Washington** gave the shortest speech in 1790 with 538 words. When Carter took presidency, the United States was facing a lot of problems: From international problems such as the Cold War and the Iran hostage crisis to domestic problems such as the recovering economy, Carter had a lot to talk about. Meanwhile, Washington's was the first state of the union ever delivered, so it makes sense for it to be the shortest. 

In [1]:
import pandas as pd
addresses = pd.read_hdf('results/df2.h5', 'addresses')
addresses.describe()

Unnamed: 0,n_sent,n_words_all,n_words,n_uwords,n_swords,n_chars,logn_words,logn_sent,vocab_per_word,word_per_sent,char_per_word,frac_stop
count,227.0,227.0,227.0,227.0,227.0,227.0,227.0,227.0,227.0,227.0,227.0,227.0
mean,266.145374,8305.678414,4048.22467,1639.810573,1290.480176,45474.770925,8.093769,5.366073,0.378614,15.773301,11.209835,1.05718
std,178.797065,5871.458447,2900.451196,747.674971,508.086636,33019.958825,0.659142,0.70817,0.103929,3.838767,0.898774,0.120146
min,24.0,1059.0,538.0,395.0,356.0,5649.0,6.287859,3.178054,0.143292,8.87766,9.4046,0.757236
25%,165.0,4396.0,2190.5,1088.5,914.5,23394.5,7.691806,5.105945,0.313566,12.176852,10.425492,0.951403
50%,240.0,6655.0,3388.0,1530.0,1246.0,36677.0,8.127995,5.480639,0.362074,16.22449,11.487615,1.076265
75%,347.5,10055.0,4801.5,1998.5,1552.0,55211.0,8.476678,5.850764,0.426852,18.62706,11.935899,1.166681
max,1343.0,36974.0,21041.0,4282.0,3015.0,218009.0,9.954228,7.202661,0.677892,26.677966,12.762673,1.286419


From the chart below, we can see an interesting pattern that can be attributed to changes in the English language over time.
> **Presendential speeches use more but shorter sentences, shorter words per sentence and fewer stop words over time.**

!['speech_changes'](fig/speech_changes.png)

Looking at the two major parties today, both parties seem to have a similar trend for all characteristics. The average word length, sentence length, and fraction of stop words all have a decreasing trend while there does not seem to be a trend for log number of sentences, vocabulary size per word, and log number of words for both parties. Note that we cannot do any comparisions between the two parties: The first <span style="color:blue">Democrat</span> president was **Andrew Jackson** in 1829 while the first <span style="color:red"> Republican </span> president was **Abraham Lincoln** in 1861, so there are so many external factors we do not account for due to difference in time. We are just looking at the trend within the party. 

![party_characteristics](fig/party_characteristics.png)

By president, a few fun facts are observed:
- Flexible: **Herbert Hoover**'s and **Jimmy Carter**'s speeches appeared in divergent styles. Maybe they had different speech writers in different years, or they had to deal with issues going on in the U.S at the time (Great Depression and stagflation respectively).
- Long vs concise: **James Madison** liked to use long sentences compared to other presidents while **George HW Bush** liked to use short sentences.


!['speech_characteristics'](fig/speech_characteristics.png)
<!--img src = fig/speech_characteristics.png style="transform: rotate(90deg);width=1000px;" /-->

## 3. Words matrix
We ran through the 227 speeches to count the frequency of unique words and got the following matrix. wmat is comprised of 93.26% zeros. Not only do we keep numbers in the speeches, but also the English language is constantly evolving. The words used by George Washington will differ significantly from the words used by Donald Trump. We saw this in the previous part.

In [2]:
wmat = pd.read_hdf('results/df3.h5', 'wmat')
wmat[500:510]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,217,218,219,220,221,222,223,224,225,226
beriberi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
decri,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
overtop,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
produc,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,2.0,2.0,1.0,2.0,2.0
one-room,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jenna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
14502250.67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
289303794.50,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
broke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


## 4. MDS

According to Wikipedia, "Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. It is a form of non-linear dimensionality reduction." In our case, the closer two points (speeches) are to each other on the plot, the more "similar" in style they are. 

We used two ways of measuring distance, the Euclidian way and the Jensen-Shannon way. Both results yielded pretty similar results in that presidents around the same era seem to be clustered together. For example, the "modern" presidents of Clinton, Trump, Bush, and Obama all are around each other, indicating that their word usages are pretty similar.

Funnily enough, in the Jensen-Shannon distance, Trump is basically isolated by himself.

!['mds_naive'](fig/mds_naive.png)

!['mds_jsdiv'](fig/mds_jsdiv.png)

We also look at the individual speeches themselves. Both distances still yielded similar results. For the most part, all the speeches by one president are near each other. Previously we noted that **Herbert Hoover** and **Jimmy Carter** had pretty divergent styles in their own speeches over time. We can this in these plots as well, since some of their points on the plots are pretty far from each other. We still see that presidents around the same era seem to be clustered together.

!['mds_naive_all'](fig/mds_naive_all.png)

![mds_jdsiv_all](fig/mds_jdsiv_all.png)

## 5. The Economy 

The economy is one of the most important aspects of our lives. As such, we would expect the presidents to talk about the state of the economy in their speeches. We looked at the following terms in a speech:

- `Number Count`: Overall how many numbers were used in the speech 
- `Percent Count`: How many times percents were used in the speech (%, percent, percentage)
- `Dollar Count`: How many times terms related to the dollar were used in the speech ($, dollar, dollars)
- `Employ Count`: How many times was "employ" part of a word in the speech
- `Inflation Count`: How many times was "inflation" part of a word in the speech
- `Economy Count`: How many times was "economy" part of a word in the speech

Out of the economic terms, the dollar had the highest mean mentions, which is not surprising since the dollar is used when talking about any kind of spending, stimulus, etc. Employment had the next highest mean mention while inflation had the lowest mean mention of which is not surprising since presidents tend to care more about employment while the Federal Reserve cares more about inflation. The economy term did not have that high of a mean mentions, which implies that the presidents go more into detail about statistics such as unemployment.

In [3]:
addresses = pd.read_hdf('results/df5.h5', 'addresses')
addresses[["Number Count", "Percent Count", "Dollar Count", "Employ Count", "Inflation Count", "Economy Count"]].describe()

Unnamed: 0,Number Count,Percent Count,Dollar Count,Employ Count,Inflation Count,Economy Count
count,227.0,227.0,227.0,227.0,227.0,227.0
mean,104.330396,2.052863,17.127753,5.726872,1.070485,3.704846
std,128.735653,4.716458,23.76203,8.495461,2.925984,4.942691
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.0,0.0,4.0,2.0,0.0,0.5
50%,61.0,0.0,11.0,3.0,0.0,2.0
75%,138.5,2.0,22.5,7.0,1.0,5.0
max,837.0,50.0,252.0,91.0,24.0,32.0


Not surprisingly, there seems to be spikes when the economy is in a downturn/recession. Examples:

There is a large spike for all of the statistics just before 1950. World War II ended in September of 1945. As such, the American people no longer needed to worry about war and instead can focus back on the economy. While World War II brought the United States out of the Great Depression, there is no doubt that the fear of an economic downturn still lingers in the minds of many, and **Harry Truman** needed to address these concerns.

There is another large spike for a lot of the statistics just around the 1980s.
During the 1970s, an economic phenomenon known as stagflation hit the United States. By conventional Keynesian theory, when unemployment goes up, inflation goes down and vice versa. Yet, in this decade both unemployment and inflation went up together, resulting in an economic crisis never seen before. By the late 1970s, new theory had been developed to tackle this problem. It makes sense that inflation and the economy were constantly mentioned throughout these years.

![numbers_graphs](fig/numbers_graphs.png)

Partisan-wise, while all presidents care about economy, it seems that <span style="color:red"> Republicans </span> talk about numbers (0-9) more than <span style="color:blue"> Democrats</span>, especially those outlier republican presidents. However, in general there does not seem to be any siginificant difference between how often economic terms are used by presidents of both parties.

![party_counts](fig/party_counts.png)