# Presidential Election 2016: An Exploratory Data Analysis

#### Table of Contents
1. Environment Setup
2. Preparing Packages and Loading Data
3. Plotting
4. Ceaning Another Data
    - From http://charts.realclearpolitics.com/charts/%i.xml
5. Predicting the Result Using Bootstrap 
    
    
## Environment Setup
Information regarding environment setup can be found under Prerequisites on the [NewREADME](../project-3-p2-zh-za-ka/NewREADME.md).

## Preparing Packages and Loading Data
We start off by loading the packages that we want to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objs as go

pd.set_option('display.max_columns', 100) #overrides default to display up to 100 columns in dataframes

ModuleNotFoundError: No module named 'plotly'

In [None]:
df = pd.read_csv('http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv')
df.head() #display the first fouur rows of dataframe

In [None]:
print("Number of rows (polls): " + str(df.shape[0]))
print("Number of columns (categories): " + str(df.shape[1]))
print("\nNumber of empty values for each column:")
print(df.isnull().sum())

We see that there are 12624 polls and 27 categories of data. Of these, we can subset the dataframe to select only the categories that we're interested in. Let's go ahead and do that:

In [None]:
categories = ['type', 'state', 'enddate', 'pollster', 'grade', 'samplesize', 'population',
             'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', 'poll_id']
df2 = df.loc[:, categories]
df2.head()

*Note: We've decided to use the adjusted poll data (adjpoll) instead of the raw poll data (rawpoll); this will give us a slight adjustment to account for sampling error. This information was found on the FiveThirtyEight website.*

Awesome! But what is this "type" variable? We can tell from `df2.head()` that there's a type called "polls-plus", but we can't tell much else.

In [None]:
print(df2.loc[:,'type'].unique()) #display unique values of the 'type' factor

We can see three unique types of polls. According to the source of the dataset on [FiveThirtyEight](https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/):
+ **Polls-plus**: Combines polls with an economic index. Since the economic index implies that this election should be a tossup, it assumes the race will tighten somewhat.
+ **Polls-only**: A simpler, what-you-see-is-what-you-get version of the model. It assumes current polls reflect the best forecast for November, although with a lot of uncertainty.
+ **Now-cast**: A projection of what would happen in a hypothetical election held today. Much more aggressive than the other models.

We want to work with the simple adjusted poll data, not combined with other data. So we're going to take out all the polls that have been adjusted to "polls-plus" and "now-cast."

In [None]:
df_po = df2[df2.loc[:,'type']=='polls-only'] #create df_po containing only the polls of type 'polls-only'
df_po = df_po.reset_index(drop=True) #reset the dataframe indices, and drop the original indices from memory
df_po.head()

In [None]:
df_po.describe() #display summary statistics for numerical variables

Before we can plot anything, there's an issue that prevents us from being able to place time on the x-axis. The original dataset contained `startdate`, `enddate`, and `forecastdate`; of these three, we've subsetted only the `enddate` into `df2` and `df_po` because it's the most accurate representation of the timeframe of each poll.

In [None]:
df_po.loc[:,'enddate'].head() #view first 5 'enddate' values

Each date is an `object` type; that means that Python will see these as individual discrete variables instead of a continuous variable of dates. To fix this, we use the `to_datetime` function from Pandas on each of the date entries.

In [None]:
df_po.loc[:,'enddate'] = pd.to_datetime(df_po.loc[:,'enddate']) #convert 'enddate' into 'datetime' variables
df_po.loc[:, 'enddate'].head()

In [None]:
df_po.loc[:, ['enddate', 'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin']].head(10)

In [None]:
df_po.loc[:,'grade'].unique() #display unique values of the 'grade' factor

We see that there are 10 different `grade` types: A+, A, A-, B+, B, B-, C+, C, C-, and D. In addition, there some polls do not have a ranking. That's a lot to work with, so we'll whittle it down to six: A+, A, B, C, D, and N/A. With the exception of A+, we drop the +/- from all the grades, then we'll plot scatterplots for each grade.