## temp

For our project, we would like to collect data to see if voter preferences correlate to how affected certain populations were by the Great Recession. We can measure “recession effect” in a number of different ways. For example, we can use statistics from the US Bureau of Labor to map certain geographical areas, and we can compare statistics from pre-2008 and post-2008 to view how the recession affected jobs in those regions. From there, we could look at polls or voter data to see how those regions voted in either the 2010 midterm elections or the 2012 Presidential election. Comparing these two sets of data, we would be interested in finding out which parties/administrations people blamed more for the recession, and whether or not voting habits changed in areas most affected by the recession, and why

This sounds good, but because political landscapes are changing constantly, it is difficult to assign certain changes as being "caused" to the 2008 recession as opposed to just brought on by other factors that happen to correlate with areas hit by the recession (the classic correlation/causation issue). But if you keep this fact in mind, then I think the analysis can be interesting.

1. recession effect data
	- industries effected by 08’ recession
	- recovery rate of effected industries (pre ’08 / post ’08 industry stats)
	- outsourcing in respective industries
    - [BLS api](http://www.bls.gov/developers/api_signature_v2.htm#multiple)
    - [BLS series](http://www.bls.gov/help/hlpforma.htm#OE)
2. voter preference data
	-  pre / post ’08
3. find way to partition each data set geographically
	- west coast / midwest / east coast
	- rural / metropolitan


## Analyzing the Election

To analyze the election data, we'll be looking towards [Politico](http://www.politico.com/) and the [New York Times](http://www.nytimes.com) and their election coverage pages. Unfortunately, due to formatting changes over the years, it's a little more difficult than we'd like it to be to fetch county voter data from the 2008, 2012, and 2016 elections with one script. However, it's not too hard to write scripts for each of those elections individually.

In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# List of states and D.C.
states = [
    'Alabama','Alaska','Arizona','Arkansas','California','Colorado',
    'Connecticut','Delaware','District Of Columbia','Florida','Georgia','Hawaii','Idaho', 
    'Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana',
    'Maine','Maryland','Massachusetts','Michigan','Minnesota',
    'Mississippi', 'Missouri','Montana','Nebraska','Nevada',
    'New Hampshire','New Jersey','New Mexico','New York',
    'North Carolina','North Dakota','Ohio',    
    'Oklahoma','Oregon','Pennsylvania','Rhode Island',
    'South Carolina','South Dakota','Tennessee','Texas','Utah',
    'Vermont','Virginia','Washington','West Virginia',
    'Wisconsin','Wyoming'
]

In [21]:
'''
Converts data taken from the table in the html
    Input: a table row of voting data, in html
    Output: a list of data in a more desirable format
'''
def voter_html_to_data(voter_html):
    data = [datum.get_text().strip() for datum in voter_html.find_all('td')]
    # For the 2008 elections, the data list will be in the following format:
    # County name, % votes for Obama, Votes for Obama, % votes for McCain, Votes for McCain
    # For simplicity's sake, we'll calculate the percentages ourselves based on the raw vote counts
    county_name = data[0].lower()
    blue_votes = int(data[2][:-5].replace(',',''))
    red_votes = int(data[4][:-5].replace(',',''))
    blue_percentage = float(blue_votes) / (blue_votes + red_votes)
    red_percentage = float(red_votes) / (blue_votes + red_votes)
    return [county_name, blue_votes, blue_percentage, red_votes, red_percentage]

def get_voter_data_2008():
    # Set up our data frame
    df = pd.DataFrame(columns=('election_year', 'state', 'county', 'dem_votes', 'dem_percentage', 'rep_votes', 'rep_percentage'))
    
    # Base url we'll be getting data from 
    base_url = 'http://elections.nytimes.com/2008/results/states/president/'
    
    # Get state names for url endings
    state_urls = sorted([state.lower().replace(' ', '-') for state in states])

    # Iterate through the states (except for Alaska and D.C.)
    num_counties = 0
    for state in state_urls:
        if state != 'alaska' and state != 'district-of-columbia':
            # Get data from the site
            response = requests.get(base_url + state + '.html')
            election_soup = BeautifulSoup(response.text, 'html.parser')

            # Data for all states except for Alaska and D.C.
            data_rows = election_soup.find(id='winners-by-county-table').tbody.find_all('tr')
            
            # Data that every row will have (election year and state)
            header_data = ['2008', state]
            for row in data_rows:
                voter_data = header_data + voter_html_to_data(row)
                df.loc[num_counties] = voter_data
                num_counties += 1
    
    # Since Alaska and D.C. don't have counties, we process them slightly differently
    # First, Alaska
    alaska_html = requests.get('http://elections.nytimes.com/2008/results/states/alaska.html')
    alaska_election_soup = BeautifulSoup(alaska_html.text, 'html.parser')
    
    alaska_obama = alaska_election_soup.find(id='presidential-results-table').tbody.find_all('tr')[:2][1].find_all('td')
    alaska_mccain = alaska_election_soup.find(id='presidential-results-table').tbody.find_all('tr')[:2][0].find_all('td')
    alaska_blue_votes = int(alaska_obama[1].get_text().strip().replace(',',''))
    alaska_red_votes = int(alaska_mccain[2].get_text().strip().replace(',',''))
    alaska_blue_percent = float(alaska_blue_votes) / (alaska_blue_votes + alaska_red_votes)
    alaska_red_percent = float(alaska_red_votes) / (alaska_blue_votes + alaska_red_votes)
    alaska_data = ['2008', 'alaska', 'alaska', alaska_blue_votes, alaska_blue_percent, alaska_red_votes, alaska_red_percent]
    df.loc[num_counties] = alaska_data
    num_counties += 1
    
    # Finally, D.C.
    dc_html = requests.get('http://elections.nytimes.com/2008/results/states/district-of-columbia.html')
    dc_election_soup = BeautifulSoup(dc_html.text, 'html.parser')

    dc_obama = dc_election_soup.find(id='presidential-results-table').tbody.find_all('tr')[:2][0].find_all('td')
    dc_mccain = dc_election_soup.find(id='presidential-results-table').tbody.find_all('tr')[:2][1].find_all('td')
    dc_blue_votes = int(dc_obama[2].get_text().strip().replace(',',''))
    dc_red_votes = int(dc_mccain[1].get_text().strip().replace(',',''))
    dc_blue_percent = float(dc_blue_votes) / (dc_blue_votes + dc_red_votes)
    dc_red_percent = float(dc_red_votes) / (dc_blue_votes + dc_red_votes)
    dc_data = ['2008', 'district-of-columbia', 'district-of-columbia', dc_blue_votes, dc_blue_percent, dc_red_votes, dc_red_percent]
    df.loc[num_counties] = dc_data
    num_counties += 1
    
    return df

In [23]:
df = get_voter_data_2008()
print df

     election_year                 state                county  dem_votes  \
0             2008               alabama               autauga     6091.0   
1             2008               alabama               baldwin    19362.0   
2             2008               alabama               barbour     5685.0   
3             2008               alabama                  bibb     2289.0   
4             2008               alabama                blount     3518.0   
5             2008               alabama               bullock     4001.0   
6             2008               alabama                butler     4174.0   
7             2008               alabama               calhoun    16325.0   
8             2008               alabama              chambers     6782.0   
9             2008               alabama              cherokee     2299.0   
10            2008               alabama               chilton     3666.0   
11            2008               alabama               choctaw     3633.0   