# An analysis of the State of the Union speeches - Part 3
# Word analysis

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from collections import Counter
import shelve

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (10, 6)

Load data we need from previous runs

In [2]:
# Read data from result created in p2
addresses = pd.read_hdf('results/df2.h5', 'addresses')
addresses.head()

Unnamed: 0,president,title,date,n_sent,n_words_all,n_words,n_uwords,n_swords,n_chars,logn_words,logn_sent,vocab_per_word,word_per_sent,char_per_word,frac_stop
0,George Washington,State of the Union Address,1790-01-08,24,1178,538,395,356,6753,6.287859,3.178054,0.66171,22.416667,12.552045,1.189591
1,George Washington,State of the Union Address,1790-12-08,40,1515,683,513,463,8455,6.526495,3.688879,0.677892,17.075,12.379209,1.218155
2,George Washington,State of the Union Address,1791-10-25,60,2487,1136,731,626,14203,7.035269,4.094345,0.551056,18.933333,12.502641,1.189261
3,George Washington,State of the Union Address,1792-11-06,61,2298,1042,682,580,12764,6.948897,4.110874,0.556622,17.081967,12.24952,1.205374
4,George Washington,State of the Union Address,1793-12-03,56,2132,972,714,652,11696,6.879356,4.025352,0.670782,17.357143,12.032922,1.193416


In [3]:
# Read data from result created in p1
with shelve.open('results/vars2') as db:
    speech_words = db['speech_words']
    speeches_cleaned = db['speeches_cleaned']

Let's make a single set of all unique words across all speeches

In [4]:
unique_words = list(set(speech_words))
n_words = len(unique_words)
n_words  # number of unique words across all speeches

19140

This is quite the number of unique words. Each president's word choice is quite unique.

Now we create a word matrix, whose columns are word vectors for each speech. A word vector contains the word counts for each word across the entire document set. 

In [5]:
def word_vector(doc, vocab):
    """Return a word vector for the input document in the context of a given vocabulary.
    
    Parameters
    ----------
    
    doc: iterable of words
       
    vocab : iterable of words
    integer, size of the entire vocabulary across documents.
    
    Return
    ------
    array
        An integer array, of length equal to `len(vocab)`, containing the count for each
        word in `doc` at its corresponding position in `vocab`.
        
    Example
    -------
    
    >>> doc = "b c b c e".split()
    ... vocab = "a b c d e f".split()
    ... word_vector(doc, vocab)
    ... 
    array([0, 2, 2, 0, 1, 0])
    """
    
    counter = Counter(doc)
    return np.array([counter[v] for v in vocab])

Let's write a simple unit test for this:

In [6]:
def test_word_vector():
    doc = "b c b c e".split()
    vocab = "a b c d e f".split()
    wv = word_vector(doc, vocab)
    np.testing.assert_equal(wv, np.array([0, 2, 2, 0, 1, 0]) )

test_word_vector()

Now let's make the word matrix for our entire set of documents

In [7]:
# Matrix where each row represents speech and each column represents word count of the word
mat = np.zeros((len(speeches_cleaned), n_words))
for i in range(len(speeches_cleaned)):
    row = word_vector(speeches_cleaned[i], unique_words)
    mat[i] = row
# Transpose it so that the column represents speech and row represents word count
wmat = pd.DataFrame(mat.T)
wmat.index = unique_words
wmat[500:510]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,217,218,219,220,221,222,223,224,225,226
beriberi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
decri,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
overtop,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
produc,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,2.0,2.0,1.0,2.0,2.0
one-room,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jenna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
14502250.67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
289303794.50,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
broke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


Just from looking at these rows we already expect this matrix to be very sparse.

How sparse is this matrix exactly?

In [8]:
sparsity = 1 - wmat.astype(bool).sum().sum() / int(len(wmat)*len(wmat.columns))
print(f"wmat is comprised of {100*sparsity:.2f}% zeros.")

wmat is comprised of 93.26% zeros.


Not surprisingly, this matrix very sparse. Not only do we keep numbers in the speeches, but also the English language is constantly evolving. The words used by George Washington will differ significantly from the words used by Donald Trump. We saw this in the previous part.

## Intermediate results storage

We'll need a few results for the next step, so let's store them in a new set of HDF5/shelve stores for this notebook:

In [9]:
unique_words = pd.DataFrame(unique_words)

In [10]:
wmat.to_hdf('results/df3.h5', 'wmat')
with shelve.open('results/vars3') as db:
    db['unique_words'] = unique_words