# Presidential Election 2016: An Exploratory Data Analysis

#### Table of Contents
1. Environment Setup
2. Preparing Packages and Loading Data
3. Plotting
4. Ceaning Another Data
    - From http://charts.realclearpolitics.com/charts/%i.xml
5. Predicting the Result Using Bootstrap 
    
    
## Environment Setup
Information regarding environment setup can be found under Prerequisites on the [NewREADME](../project-3-p2-zh-za-ka/NewREADME.md).

## Preparing Packages and Loading Data
We start off by loading the packages that we want to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 100) #overrides default to display up to 100 columns in dataframes

In [2]:
df = pd.read_csv('http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv')
df.head() #display the first fouur rows of dataframe

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,samplesize,population,poll_wt,rawpoll_clinton,rawpoll_trump,rawpoll_johnson,rawpoll_mcmullin,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
0,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,8.720654,47.0,43.0,4.0,,45.20163,41.7243,4.626221,,,https://www.washingtonpost.com/news/the-fix/wp...,48630,76192,11/7/16,09:35:33 8 Nov 2016
1,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/1/2016,11/7/2016,Google Consumer Surveys,B,26574.0,lv,7.628472,38.03,35.69,5.46,,43.34557,41.21439,5.175792,,,https://datastudio.google.com/u/0/#/org//repor...,48847,76443,11/7/16,09:35:33 8 Nov 2016
2,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,Ipsos,A-,2195.0,lv,6.424334,42.0,39.0,6.0,,42.02638,38.8162,6.844734,,,http://projects.fivethirtyeight.com/polls/2016...,48922,76636,11/8/16,09:35:33 8 Nov 2016
3,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,YouGov,B,3677.0,lv,6.087135,45.0,41.0,5.0,,45.65676,40.92004,6.069454,,,https://d25d2506sfb94s.cloudfront.net/cumulus_...,48687,76262,11/7/16,09:35:33 8 Nov 2016
4,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Gravis Marketing,B-,16639.0,rv,5.316449,47.0,43.0,3.0,,46.84089,42.33184,3.726098,,,http://www.gravispolls.com/2016/11/final-natio...,48848,76444,11/7/16,09:35:33 8 Nov 2016


In [3]:
print("Number of rows (polls): " + str(df.shape[0]))
print("Number of columns (categories): " + str(df.shape[1]))
print("\nNumber of empty values for each column:")
print(df.isnull().sum())

Number of rows (polls): 12624
Number of columns (categories): 27

Number of empty values for each column:
cycle                   0
branch                  0
type                    0
matchup                 0
forecastdate            0
state                   0
startdate               0
enddate                 0
pollster                0
grade                1287
samplesize              3
population              0
poll_wt                 0
rawpoll_clinton         0
rawpoll_trump           0
rawpoll_johnson      4227
rawpoll_mcmullin    12534
adjpoll_clinton         0
adjpoll_trump           0
adjpoll_johnson      4227
adjpoll_mcmullin    12534
multiversions       12588
url                     3
poll_id                 0
question_id             0
createddate             0
timestamp               0
dtype: int64


We see that there are 12624 polls and 27 categories of data. Of these, we can subset the dataframe to select only the categories that we're interested in. Let's go ahead and do that:

In [4]:
categories = ['type', 'state', 'enddate', 'pollster', 'grade', 'samplesize', 'population',
             'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', 'poll_id']
df2 = df.loc[:, categories]
df2.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-plus,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,45.20163,41.7243,4.626221,,48630
1,polls-plus,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,43.34557,41.21439,5.175792,,48847
2,polls-plus,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,42.02638,38.8162,6.844734,,48922
3,polls-plus,U.S.,11/7/2016,YouGov,B,3677.0,lv,45.65676,40.92004,6.069454,,48687
4,polls-plus,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,46.84089,42.33184,3.726098,,48848


*Note: We've decided to use the adjusted poll data (adjpoll) instead of the raw poll data (rawpoll); this will give us a slight adjustment to account for sampling error. This information was found on the FiveThirtyEight website.*

Awesome! But what is this "type" variable? We can tell from `df2.head()` that there's a type called "polls-plus", but we can't tell much else.

In [5]:
print(df2.loc[:,'type'].unique()) #display unique values of the 'type' factor

['polls-plus' 'now-cast' 'polls-only']


We can see three unique types of polls. According to the source of the dataset on [FiveThirtyEight](https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/):
+ **Polls-plus**: Combines polls with an economic index. Since the economic index implies that this election should be a tossup, it assumes the race will tighten somewhat.
+ **Polls-only**: A simpler, what-you-see-is-what-you-get version of the model. It assumes current polls reflect the best forecast for November, although with a lot of uncertainty.
+ **Now-cast**: A projection of what would happen in a hypothetical election held today. Much more aggressive than the other models.

We want to work with the simple adjusted poll data, not combined with other data. So we're going to take out all the polls that have been adjusted to "polls-plus" and "now-cast."

In [6]:
df_po = df2[df2.loc[:,'type']=='polls-only'] #create df_po containing only the polls of type 'polls-only'
df_po = df_po.reset_index(drop=True) #reset the dataframe indices, and drop the original indices from memory
df_po.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-only,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,45.21947,41.70754,4.606925,,48630
1,polls-only,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,43.40083,41.14659,5.164047,,48847
2,polls-only,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,42.01984,38.74365,6.816055,,48922
3,polls-only,U.S.,11/7/2016,YouGov,B,3677.0,lv,45.68214,40.90047,6.118311,,48687
4,polls-only,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,46.83107,42.27754,3.749071,,48848


In [7]:
df_po.describe() #display summary statistics for numerical variables

Unnamed: 0,samplesize,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
count,4207.0,4208.0,4208.0,2799.0,30.0,4208.0
mean,1148.216068,43.322517,42.654425,4.651088,24.508827,45910.899477
std,2630.856265,7.097772,6.948612,2.47239,5.235812,2864.763228
min,35.0,17.11589,4.488276,-3.677883,11.02832,35362.0
25%,447.5,40.22023,38.449348,3.130344,23.108497,45151.75
50%,772.0,44.142125,42.70472,4.36681,25.135225,46384.5
75%,1236.5,46.901398,46.315503,5.763004,27.976062,47741.25
max,84292.0,86.7132,72.37661,20.357,31.57469,48922.0


Before we can plot anything, there's an issue that prevents us from being able to place time on the x-axis. The original dataset contained `startdate`, `enddate`, and `forecastdate`; of these three, we've subsetted only the `enddate` into `df2` and `df_po` because it's the most accurate representation of the timeframe of each poll.

In [8]:
df_po.loc[:,'enddate'].head() #view first 5 'enddate' values

0    11/6/2016
1    11/7/2016
2    11/6/2016
3    11/7/2016
4    11/6/2016
Name: enddate, dtype: object

Each date is an `object` type; that means that Python will see these as individual discrete variables instead of a continuous variable of dates. To fix this, we use the `to_datetime` function from Pandas on each of the date entries.

In [9]:
df_po.loc[:,'enddate'] = pd.to_datetime(df_po.loc[:,'enddate']) #convert 'enddate' into 'datetime' variables
df_po.loc[:, 'enddate'].head()

0   2016-11-06
1   2016-11-07
2   2016-11-06
3   2016-11-07
4   2016-11-06
Name: enddate, dtype: datetime64[ns]

In [10]:
df_po.loc[:, ['enddate', 'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin']].head(10)

Unnamed: 0,enddate,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin
0,2016-11-06,45.21947,41.70754,4.606925,
1,2016-11-07,43.40083,41.14659,5.164047,
2,2016-11-06,42.01984,38.74365,6.816055,
3,2016-11-07,45.68214,40.90047,6.118311,
4,2016-11-06,46.83107,42.27754,3.749071,
5,2016-11-06,49.05626,43.87898,3.018706,
6,2016-11-06,45.31196,40.80614,4.230162,
7,2016-11-05,43.68695,40.80897,5.381917,
8,2016-11-06,45.03026,41.83415,8.034579,
9,2016-11-07,42.88452,42.18602,6.367243,


In [11]:
df_po.loc[:,'grade'].unique() #display unique values of the 'grade' factor

array(['A+', 'B', 'A-', 'B-', 'A', nan, 'B+', 'C+', 'C-', 'C', 'D'], dtype=object)

We see that there are 10 different `grade` types: A+, A, A-, B+, B, B-, C+, C, C-, and D. In addition, there some polls do not have a ranking. That's a lot to work with, so we'll whittle it down to six: A+, A, B, C, D, and N/A. With the exception of A+, we drop the +/- from all the grades, then we'll plot scatterplots for each grade.

In [12]:
"""
    Function
    --------
    get_poll_xml
    Given a poll_id, return the XML data as a text string
"""
def get_poll_xml(poll_id):
    url = "http://charts.realclearpolitics.com/charts/%i.xml" % int(poll_id)
    return requests.get(url).text

In [13]:
"""
    Function
    ---------
    rcp_poll_data
    Extract poll information from an XML string, and convert to a DataFrame
    Parameters
    ----------
    xml : str
        A string, containing the XML data from a page like 
        get_poll_xml(1044)
    Returns
    -------
    A pandas DataFrame with the following columns:
        date: The date for each entry
        title_n: The data value for the gid=n graph (take the column name 
        from the `title` tag)
"""
import xml.etree.ElementTree as ET

def rcp_poll_data(input):
    tree = ET.fromstring(input)
    dictionary = dict()
    
    dates = list()
    series = tree.findall('series')
    for value in series[0].findall('value'):
        dates.append(value.text)
    dictionary['date'] = pd.to_datetime(dates)

    graphs = tree.findall('graphs/graph')
    for graph in graphs:
            values = list()
            title = graph.get('title')
            for value in graph.findall('value'):
                try:
                    values.append(float(str(value.text)))
                except:
                    values.append(value.text)
            dictionary[title] = values
    
    df = pd.DataFrame(dictionary)
    df_clean = df.dropna()
    return df_clean

In [14]:
"""
    Function
    --------
    find_governor_races

    Find and return links to RCP races on a page like
    http://www.realclearpolitics.com/epolls/2010/governor/
                                        2010_elections_governor_map.html
    Parameters
    ----------
    html : str
        The HTML content of a page to scan
    Returns
    -------
    A list of urls for Governor race pages 
"""
import re

def find_governor_races(url):
    text = requests.get(url).text
    links = re.findall('http://www.realclearpolitics.com/epolls/\d{4}/governor/\D{2}/.*?-\d{,4}.html',text)
    links = list(set(links))
    return links

In [23]:
"""
    Function
    --------
    race_result
    Return the actual voting results on a race page
    Parameters
    ----------
    url : string
        The website to search through       
    Returns
    -------
    A dictionary whose keys are candidate names,
    and whose values is the percentage of votes they received.
"""
from bs4 import BeautifulSoup 
import requests 
page = requests.get("https://www.google.dz/search?q=see") 
soup = BeautifulSoup(page.content) 
links = soup.findAll("a") 
for link in links: 
    if link['href'].startswith('/url?q='): 
        print (link['href'].replace('/url?q=',''))


def race_result(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    tables = soup.findAll('table', {'class': 'data'})
    table = tables[0]
    rows = [row for row in table.find_all("tr")]
    columns = [str(col.get_text()) for col in rows[0].find_all("th")]
    candidates = [column.split('(')[0].strip() for column in columns[3:-1]]

    row = rows[1]
    tds = row.find_all("td")
    results = [float(str(t.get_text())) for t in tds[3:-1]]
    # convert to percentage 
    tot = sum(results)/100 
 
    return {l:r / tot for l, r in zip(candidates, results)}



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


https://www.merriam-webster.com/dictionary/see&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQqoUBCBkwAA&usg=AOvVaw1lr45-4un7R8R6moIuhCSb
https://www.merriam-webster.com/dictionary/see&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQFggcMAI&usg=AOvVaw2lYExJ7hrT5SjLeBZyjm7r
https://www.merriam-webster.com/dictionary/see&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQFgggMAM&usg=AOvVaw1obUlLfz1mg3mOkoO1dVq1
http://webcache.googleusercontent.com/search%3Fq%3Dcache:DLYLs545-qIJ:https://www.merriam-webster.com/dictionary/see%252Bsee%26hl%3Den%26ct%3Dclnk&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQIAgjMAM&usg=AOvVaw36B6smoZq77zGJPbixR4zV
http://www.thesaurus.com/browse/see&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQFggmMAQ&usg=AOvVaw0blqMBKb5vHtLiHa5hZ3rv
http://webcache.googleusercontent.com/search%3Fq%3Dcache:AwV--wOauWUJ:http://www.thesaurus.com/browse/see%252Bsee%26hl%3Den%26ct%3Dclnk&sa=U&ved=0ahUKEwjJ84eN5u7XAhUI8GMKHRrlBzsQIAgpMAQ&usg=AOvVaw2HbYyRpTtCmkoG7h_c86y5
https://www.seeeyewear.com/&sa=U&ved=0a

In [24]:
## Data Visualization*



In [31]:
import re

"""
This function removes non-letter characters from a word 
"""
def _strip(s):
    return re.sub(r'[\W_]+', '', s)

"""
Given an XML document from Real Clear Politics, returns a python dictionary
that maps a graph title to a graph color. 
"""
def plot_colors(xml):
    dom = web.Element(xml)
    result = {}
    for graph in dom.by_tag('graph'):
        title = _strip(graph.attributes['title'])
        result[title] = graph.attributes['color']
    return result

In [32]:
"""
    Make a plot of an RCP Poll over time
    Parameters
    ----------
    poll_id : int
        An RCP poll identifier
"""
def poll_plot(poll_id):
    xml = get_poll_xml(poll_id)
    data = rcp_poll_data(xml)
    colors = plot_colors(xml)
    data = data.rename(columns = {c: _strip(c) for c in data.columns})

    #normalize poll numbers so they add to 100%    
    norm = data[colors.keys()].sum(axis=1) / 100    
    for c in colors.keys():
        data[c] /= norm
    
    for label, color in colors.items():
        plt.plot(data.date, data[label], color=color, label=label)        
        
    plt.xticks(rotation=70)
    plt.legend(loc='best')
    plt.xlabel("Date")
    plt.ylabel("Normalized Poll Percentage")

In [30]:
poll_plot(1113)


NameError: name 'web' is not defined

In [None]:
def id_from_url(url):
    """Given a URL, look up the RCP identifier number"""
    return url.split('-')[-1].split('.html')[0]

def plot_race(url):
    """Make a plot summarizing the historical poll data and the actual results
    """
    id = id_from_url(url)
    xml = get_poll_xml(id)    
    colors = plot_colors(xml)
    if len(colors) == 0:
        return
    result = race_result(url)
    poll_plot(id)
    plt.xlabel("Date")
    plt.ylabel("Polling Percentage")
    for r in result:
        plt.axhline(result[r], color=colors[_strip(r)], alpha=0.6, ls='--')

In [None]:
url = 'http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html'
for race in find_governor_races(url):
    plot_race(race)
    plt.show()