# An analysis of the 2016 President Election

**Authors: Ye Zhong, Yizhuang Kang, Omid Zargham**

## Abstract
The following paper explores and analyzes the text of US State of the Union Addresses from 1790-Present. After some basic exploratory analysis of the metadata, we use NLTK to compute general text analysis metrics, create a term-document matrix for direct corpus analysis, present scatterplots of the speeches using Multi-dimensional scaling techniques to collapse the high dimensional data onto two dimensions, and then compute a variety of complexity metrics to investigate the changes in complexity to US State of the Union Addresses over time. 

## Introduction
A common criticism of modern American politics is that it has gotten dumbed down over time. The level of discourse of our modern politics is said to pale in comparison to the discourse of such intellectual giants of the past, from Thomas Jefferson to FDR. We sought to examine whether such criticisms are valid, or are merely the results of malcontents looking at the past with rose-tinted glasses. By evaluating State of the Union speech complexity using a variety of metrics, we found that the complexity of State of the Union speeches show a strong trend toward decreasing complexity over time.


## Part 1: 
To begin, we represented each speech in the dataset as a row in a Pandas dataframe, with information such as the date that we use to explore our given dataset. Using the date, we computed the numbers of speeches given in each month:
<img src="fig/addresses_month.png">
Further exploration showed us a gap in the number of speeches, attributed to the dataset missing data for speeches made by Grover Cleveland in his second term between 1892 and 1897: 
<img src="fig/timeline.png">
Then, we represented the speeches as a list with each element being the raw text of each speech directly from the dataset. 



## Part 2: 
In part 2, we new package called plotly to make the scatter Plot of Adjusted Data by Time. Our graph shows the percentage of polls each president gets with the change of time frame.  Through out the whole year, the percentage of polls each president got changed.   



We then created graphs to visualize the changes in the percentage polls each president got over time:
<img src="fig/Percentage_change_among_time.jpeg">
The graph shows that comparing to Clinton, Trump began the election campaign with lower supportive figures. On the top right of the graph, we can see some polls indicate that Clinton had 80%-85% of the votes, which is so contraditary to the actual election result. However, we could see the trends of percentage of votes changes as time goes by. At the begining of 2016, Clinton got much higher percentage polls and gradually decreased through the year untile November 2016(right before the election.) In contrast, Trump received more and more percentage of polls through out the year. This probably gave us a sign of sharply increasing actual voting percentage in the election at the end of 2016.


As we could see 
<img src="fig/boxplot_vote_among_time.jpeg">

We clean the data and combine some results together. So A+ represents the grade-A+ institutions, B represents B-,B,B+ combined together. So as C and D.
According to the boxplot graph, the data did not include many reports from Grade-D institutions. Supprisingly, btoh grade-A+ and grade-A poll results indicate that Clinton has higher support rate than Trump. None of these indicates Trump has a higher rate than Clinton. Another surprise! On the boxplot of grade-B and grade-C, we see the data spread widely. The large range indicates that the public have a huge difference

## Part 3: 
During part 3 of the project, we have loaded the data from part 2 and save the variables as speech_word and speeches_cleaned. In the first cell, we use for loop to go through the speeches_cleaned line-by-line to get all the unique words and set all unique words into a new variable called unique_word.  For the future use, we used sorted() function to make our unique_word variable contains the numbers of unique words for each president in order.  The results gave us 18797 unique words in total.  To create a function called word_vector(doc, vocab), we use two for loops to add each word and the number of it appears into one matrix.  For stem words in speeches_cleaned, we append the speech(number of the word appears) and the unique word into a vector and create a matrix.  Lastly, we transformed the matrix into data frame using pd.DataFrame and np.array functions, and take the transpose to make “words” as rows, and count as “columns”.  Now the wmat matrix contains the number of counts for each word in the entire document.
For the last calculation part, we found out number of words that has 0 in wmat, and take the summation of number of zeros for each president and take summation again for each row.  This gave us the total number of zeros in the whole matrix, and divided by the total number of entries.  Total number of entries can be calculated as the number of rows multiple the number of columns.  Shape.[0] and shape.[1].

Finally we got 93.15% zeros amount all the data, which indicated that a huge amount of words are not been said for different presidents.  President in each term had their own way of giving speeches, and it was not too common to see a lot of repeating in the words they said.  That is the reason more than 93% of the wmat entries are zeros.

## Part 4: 
Using the Term-Document matrix calculated in part 3, we normalize the word counts to convert the counts into probability distributions, for both the speeches grouped by president and for each speech individually. We then perform multidimensional scaling on the count probabilities via manifold learning, with a stress majorization optimization strategy, using euclidean distances and Jensen-Shannon Divergence as our difference metrics.

The scatter plots of the word-document and president-aggregated word-document after being projected onto the planes of their stress majorizing axes, are as follows:

<img src="fig/mds_naive.png">

<img src="fig/mds_naive_all.png">

<img src="fig/mds_jsdiv.png">

<img src="fig/mds_jdsiv_all.png">




## Part 5: 
In our p5, we import the data from previous notebooks, and the results file need to be transferred as numbers for the future use and analysis.  We first used for loop to go through all elements to get rid of the “%,” and we pd.to_numeric to make all strings in the dataframe result into numbers.  We create a state directory which includes all the state names and their abbreviation.  Using for loop to go through every state in our dataset and sum all the poll votes for Trump and Clinton.  We use the command “(clinton/trump * samplesize).sum()” to calculate the percentage poll vote they received.  Then, use append to create a new datalist that contains each presidents’ voting percentage, state names, and the total sample size called df_diff.  We sorted the values in the results file to make the state in the alphabetic order, so that it will match with our df_diff dataframe.  To calculate the percentage change in the poll data with the actual voting result, we subtract two columns, since we sorted the value for our results dataframe data, it matches the order in df_diff.  

Visualizing the actual voting result, I draw the percentage of actual votes each president get in the election for each state.  

<img src="fig/points_of_election_results.png">

Clinton is in blue and Trump is in red.  As we could see in the graph, both presidents got really close voting percentage in the 13 swings state, especially if we look closer, states like Florida, Wisconsin, Pennsylvania, Michigan, Minnesota, Trump got only slightly more voting percentage than Clinton, and their votes are really close.  Only Nevada and Marina swing to Clinton as both states had really close voting percentage as well.  Except for those states that had really close percentage voting for those two presidents, only California, District of Columbia, skewed their votes to Clinton.  There are 29 states have the red dots higher than the blue ones, which indicates that those states are all giving their states votes to Trump. This drew an absolutely advantages for Trump to win the election.

Why does the actual voting results be so different from our poll prediction?  Let’s visualize the percentage change in our poll predictions and actual voting results.  

<img src="fig/percentage_differences.png">

We draw the barplots of the percentage differences between the poll prediction and the actual vote each president got.  The negative barplot indicates that the president received less percentage of vote than the poll prediction.  The higher barplot is, the more percentage of votes president received than the poll prediction.  Blue indicates the Clinton's percentage differences, while the red barplot represents Trump.  There are 11 states have less percentage votes to Clinton, meanwhile, only 3 states give less percentage votes to Trump.  Through vertical comparison, we are able to conclude that most states increase their voting percentage to Trump than Clinton.  And many more states reduce their voting percentage to Clinton than Trump.  


This concludes the reason why Trump received more votes in the actual election surprisingly, as many swinging states changed their votes to Trump.  We would actually see the reason why through notebook 3, after we use bootstrap to predict the presidency, the result was not accurate.  As many states and people changed their mind at last minute and it is unpredictable through the prediction data set.
