In [3]:
import pandas as pd
import numpy as np

# Homework 3

The instructors/staff set the solution up using dictionaries. However, it can easily be solved using dataframes, which is the data structure I use most frequently.

The solution also doesn't do much for helping students understand the data or how to match data across different representations (polls to results). As such, I'm implementing a solution independent of the instructor written code in the `election.py` file.

## Problem 1: State edges

"Represent the result of a state election in terms of the “Democratic edge”, or the difference between the Democratic and Republican popular vote percentages in that state. For example, if the Democratic candidate receives 53% of the vote (actual or predicted), and the Republican candidate receives 47% of the vote, then the Democratic edge is 6 (percentage points). A positive edge indicates a Democratic lead, and a negative edge indicates a Republican lead. The sign associated with each party is arbitrary — no positive or negative connotation is intended."

First, the data must be read in. I always like to take a look at the head (or tail) of the data to get a feed for what it looks like, the column names, and data types I'm dealing with

In [6]:
polls_2008 = pd.read_csv('homework3/data/2008-polls.csv')
polls_2008.head()

Unnamed: 0,State,Dem,Rep,Date,Pollster
0,AK,33,64,Sep 09 2008,Rasmussen
1,AK,35,54,Sep 02 2008,IvanMooreResearch
2,AK,37,55,Sep 21 2008,FairleighDickinsonU
3,AK,38,55,Oct 06 2008,IvanMooreResearch
4,AK,39,44,Jul 30 2008,Rasmussen


Now that I know what the columns are called, I can easily create a new column for the democratic edge.

I call `head` again to make sure the calculation performs as expected. This method of spot checking a small number of rows isn't bomb proof, but it will let me know immediately if there is anything drastically unexpected or wrong.

In [7]:
polls_2008['dem_edge'] = polls_2008['Dem'] - polls_2008['Rep']
polls_2008.head()

Unnamed: 0,State,Dem,Rep,Date,Pollster,dem_edge
0,AK,33,64,Sep 09 2008,Rasmussen,-31
1,AK,35,54,Sep 02 2008,IvanMooreResearch,-19
2,AK,37,55,Sep 21 2008,FairleighDickinsonU,-18
3,AK,38,55,Oct 06 2008,IvanMooreResearch,-17
4,AK,39,44,Jul 30 2008,Rasmussen,-5


## Problem 2: Find the most recent poll per pollster per state

"Election sentiment ebbs and flows in the months leading up to an election, and as a result older polls are much less accurate than more recent polls. In order to prevent old poll data from unduly influencing our prediction of the 2012 election, our program will only consider the most recent poll from a pollster in a state."

In order to quickly and easily work with comparing dates, the first thing I do is turn the `date` column into a `datetime` field. This will allow me to use built-in functions like `max` and `min`.

I very lazily call this new column `date`. Keep in mind it's dangerous to use the same name differentiated only by title case vs lowercase.

In [9]:
polls_2008['date'] = pd.to_datetime(polls_2008['Date'])
polls_2008.head()

Unnamed: 0,State,Dem,Rep,Date,Pollster,dem_edge,date
0,AK,33,64,Sep 09 2008,Rasmussen,-31,2008-09-09
1,AK,35,54,Sep 02 2008,IvanMooreResearch,-19,2008-09-02
2,AK,37,55,Sep 21 2008,FairleighDickinsonU,-18,2008-09-21
3,AK,38,55,Oct 06 2008,IvanMooreResearch,-17,2008-10-06
4,AK,39,44,Jul 30 2008,Rasmussen,-5,2008-07-30


The `shape` method lets me know the original number of rows

In [23]:
polls_2008.shape

(1185, 7)

I was also curious about the date range in the 'polls' file. I was surprised to learn that polling started as early as January and included 200 different days from January 6 through November 2.

Also of interest is that no single date has more than 51 results. There are 101 different pollsters in the file (although only a handful do a meaningful number of polls) and 51 states, meaning that there can be up to 5,151 polls on a single day. In all fairness, if there were that many polls conducted on a single day the general population may revolt.

In [10]:
polls_2008['date'].value_counts()

2008-02-28    51
2008-10-26    42
2008-10-28    29
2008-10-22    27
2008-10-30    25
              ..
2008-08-27     1
2008-05-02     1
2008-07-20     1
2008-02-11     1
2008-04-05     1
Name: date, Length: 200, dtype: int64

In [11]:
print('Most recent date:', polls_2008['date'].max())
print('Oldest date:', polls_2008['date'].min())

Most recent date: 2008-11-02 00:00:00
Oldest date: 2008-01-06 00:00:00


In [32]:
polls_2008['Pollster'].value_counts()

Rasmussen                359
SurveyUSA                243
ARG                       78
QuinnipiacU               60
OpinionResearch           51
                        ... 
TelOpinionResearch         1
DFMResearch                1
MinnesotaStateU            1
DartmouthColl              1
WestVirginiaWesleyanU      1
Name: Pollster, Length: 101, dtype: int64

Now that I have the date as a `datetime` field and a general feel for the data, I get just the most recent polls. Per the instructions, I want the most recent poll for each pollster in every state.

I turn the results into a dataframe to facilitate easily pulling the rest of the information out of the original dataframe.

In [19]:
most_recent = pd.DataFrame(polls_2008.groupby(['State', 'Pollster'])['date'].max().reset_index())
most_recent.head()

Unnamed: 0,State,Pollster,date
0,AK,ARG,2008-09-11
1,AK,FairleighDickinsonU,2008-09-21
2,AK,IvanMooreResearch,2008-10-19
3,AK,Rasmussen,2008-10-28
4,AK,SurveyUSA,2008-02-28


An inner merge between the original dataframe and the one with only the most recent polls gives just the 374 most recent polls. However, from the returned head/tail data rows, we can see that these aren't all that recent; SurveyUSA's most recent poll in Alaska was in February.

If I were doing this independently, I would probably filter this further to include only the most recent X days. For example, only polls taken within the last month (28 or 30 days depending on how I want to define 'month'). Another option would be to weight the polls based on recency as discussed in the [directions](https://courses.cs.washington.edu/courses/cse140/13wi/homework/hw3/homework3.html#assignment-overview).

In [24]:
recent_polls = polls_2008.merge(most_recent)
recent_polls

Unnamed: 0,State,Dem,Rep,Date,Pollster,dem_edge,date
0,AK,37,55,Sep 21 2008,FairleighDickinsonU,-18,2008-09-21
1,AK,39,55,Sep 11 2008,ARG,-16,2008-09-11
2,AK,41,57,Oct 28 2008,Rasmussen,-16,2008-10-28
3,AK,42,53,Oct 19 2008,IvanMooreResearch,-11,2008-10-19
4,AK,43,48,Feb 28 2008,SurveyUSA,-5,2008-02-28
...,...,...,...,...,...,...,...
369,WV,50,42,Oct 08 2008,ARG,8,2008-10-08
370,WY,28,66,Sep 11 2008,ARG,-38,2008-09-11
371,WY,32,58,Oct 14 2008,MasonDixon,-26,2008-10-14
372,WY,37,58,Oct 19 2008,SurveyUSA,-21,2008-10-19


## Problem 3: Pollster predictions

"An election poll has two pieces of identifying information (“keys”): a state and a pollster. Thus, our Python representation should make it easy to look up a poll given both keys. Nesting a dictionary within another dictionary is a common way to facilitate lookup with multiple keys."

\* This step is unnecessary since I'm using a dataframe. One major advantage in using dataframes when working with large volumns of data is that you can use any column to look up values.

## Problem 4: Pollster errors

"Now that we can represent election results and polling data in Python, we can begin to implement Nate Silver's algorithm. A first step is to write a function that computes the rank (average error) of a pollster's predictions."

The first thing we need for this is the actual results.

In [28]:
results_2008 = pd.read_csv('homework3/data/2008-results.csv')
results_2008.head()

Unnamed: 0,State,Dem,Rep
0,AK,37.9,59.4
1,AL,38.7,60.3
2,AR,38.9,58.7
3,AZ,45.1,53.6
4,CA,61.0,37.0


The first few rows tells me that there is only one row per state and the `Dem` and `Rep` columns give me the actual percentages. I need to create `dem_edge` column again to compare it to the `polls_2008` dataframe.

In [30]:
results_2008['dem_edge'] = results_2008['Dem'] - results_2008['Rep']
results_2008.head()

Unnamed: 0,State,Dem,Rep,dem_edge
0,AK,37.9,59.4,-21.5
1,AL,38.7,60.3,-21.6
2,AR,38.9,58.7,-19.8
3,AZ,45.1,53.6,-8.5
4,CA,61.0,37.0,24.0


#### Number of States

To ensure I can combine the two dataframes, I checked the number of states. To my surprise, there were only 51. There should be 53 "states" as there are 52 states in the United States as well as the District of Columbia.

In [29]:
results_2008.shape

(51, 3)

In [31]:
polls_2008['State'].nunique()

51

Since the number of polled states and the results match, it *probably* won't be a problem. However, I'll want to check for a reduced number of states after I combine the datasets. A number less than 51 would tell me that the missing states are different between the two files.

For example, let's consider the following two files

- file 1 contains Alaska, California, Washington, Oregon, and Hawaii
- file 2 contains Nevada, California, Washington, Oregon, and Idaho

When the two files are combined you can get one of two results depending on how the files are combined:

- You get 7 states, where there are missing values for 4 of those states (outer join)
- You get 3 states, which is only the states shared between the two files (inner join)

Since I'll be doing an inner join, I'd get fewer states instead of missing values.

Now that I have the predicted and actual results, I can address the problem: compute the rank (average error) of a pollster's predictions.

"compute the average error of pollster edges. In each state, the error of a predicted edge is the absolute value of the difference between the predicted edge and actual edge. The average error of a collection of pollster edges is the average of these individual errors.

Hint: Not all pollsters conduct polls in every state. When computing an average error, be sure to divide by the number of states in which a pollster made a prediction, not by the total number of states.

Next, use average_error to implement the function pollster_errors. Again, refer to the data type reference below for more information about parameter and return types. Once completed, pollster_errors provides a quantitative method for measuring the accuracy of a pollster, based on their past predictions."

It took a couple readings to understand what's being asked: the average error per pollster across all states using their most recent polls.

- the error of a predicted edge is the absolute value of the difference between the predicted edge and actual edge:
  - absolute value of ('dem_edge' from `recent_polls` - 'dem_edge' from `results_2008`)
  - let's call this 'abs_error'
- computing an average error, be sure to divide by the number of states in which a pollster made a prediction, not by the total number of states
  - $\frac{\sum(\text{abs_error})\text{ by Pollster}}{\text{count of Pollster predictions}}$
  - let's call this 'avg_error'

I'm not sure why the first function in the `election.py` file is called `average_error`. The first thing I need to calculate is the error between predicted and actual results. Second I'll group by Pollster to get their average. Since average is calculated over a specific space (by Pollster, by State, across all results, etc) there literally isn't a way to calculate an average, then specify the specific space.

To get this information I need the following:

- 'State', 'Pollster', and 'dem_edge' columns from the `recent_polls` dataframe
- 'State' and 'dem_edge' from the `results_2008` dataframe
- These dataframes need to be merged using the 'State' column
- Error between predicted and actual results

Note that the 'dem_edge' column is named the same in both dataframes. the `merge` function will rename these columns with a suffix so they can be differentiated, however I prefer to have more descriptive names, so I'll rename just these columns before the join.

In [34]:
recent_polls.merge(results_2008, on='State')

Unnamed: 0,State,Dem_x,Rep_x,Date,Pollster,dem_edge_x,date,Dem_y,Rep_y,dem_edge_y
0,AK,37,55,Sep 21 2008,FairleighDickinsonU,-18,2008-09-21,37.9,59.4,-21.5
1,AK,39,55,Sep 11 2008,ARG,-16,2008-09-11,37.9,59.4,-21.5
2,AK,41,57,Oct 28 2008,Rasmussen,-16,2008-10-28,37.9,59.4,-21.5
3,AK,42,53,Oct 19 2008,IvanMooreResearch,-11,2008-10-19,37.9,59.4,-21.5
4,AK,43,48,Feb 28 2008,SurveyUSA,-5,2008-02-28,37.9,59.4,-21.5
...,...,...,...,...,...,...,...,...,...,...
369,WV,50,42,Oct 08 2008,ARG,8,2008-10-08,42.6,55.7,-13.1
370,WY,28,66,Sep 11 2008,ARG,-38,2008-09-11,32.4,64.4,-32.0
371,WY,32,58,Oct 14 2008,MasonDixon,-26,2008-10-14,32.4,64.4,-32.0
372,WY,37,58,Oct 19 2008,SurveyUSA,-21,2008-10-19,32.4,64.4,-32.0


In [39]:
# Reduce columns to just those needed
# recent_polls[['State', 'Pollster', 'dem_edge']]
# results_2008[['State', 'dem_edge']]

# Rename columns with same name
# recent_polls[['State', 'Pollster', 'dem_edge']].rename(columns={'dem_edge':'predicted_edge'})
# results_2008[['State', 'dem_edge']].rename(columns={'dem_edge':'actual_edge'})

# Combine the whole thing and merge the two dataframes
merged_2008 = recent_polls[['State', 'Pollster', 'dem_edge']
                          ].rename(columns={'dem_edge':'predicted_edge'}
                                  ).merge(results_2008[['State', 'dem_edge']
                                                      ].rename(columns={'dem_edge':'actual_edge'}))
merged_2008

Unnamed: 0,State,Pollster,predicted_edge,actual_edge
0,AK,FairleighDickinsonU,-18,-21.5
1,AK,ARG,-16,-21.5
2,AK,Rasmussen,-16,-21.5
3,AK,IvanMooreResearch,-11,-21.5
4,AK,SurveyUSA,-5,-21.5
...,...,...,...,...
369,WV,ARG,8,-13.1
370,WY,ARG,-38,-32.0
371,WY,MasonDixon,-26,-32.0
372,WY,SurveyUSA,-21,-32.0


In [41]:
# Create an error column

merged_2008['abs_error'] = abs(merged_2008['predicted_edge'] - merged_2008['actual_edge'])
merged_2008

Unnamed: 0,State,Pollster,predicted_edge,actual_edge,error,abs_error
0,AK,FairleighDickinsonU,-18,-21.5,3.5,3.5
1,AK,ARG,-16,-21.5,5.5,5.5
2,AK,Rasmussen,-16,-21.5,5.5,5.5
3,AK,IvanMooreResearch,-11,-21.5,10.5,10.5
4,AK,SurveyUSA,-5,-21.5,16.5,16.5
...,...,...,...,...,...,...
369,WV,ARG,8,-13.1,21.1,21.1
370,WY,ARG,-38,-32.0,6.0,6.0
371,WY,MasonDixon,-26,-32.0,6.0,6.0
372,WY,SurveyUSA,-21,-32.0,11.0,11.0


In [45]:
# Check our state count
merged_2008['State'].nunique()

51

In [46]:
# Get average by Pollster
pollster_weights = pd.DataFrame(merged_2008.groupby('Pollster')['abs_error'].mean()).reset_index()
pollster_weights

Unnamed: 0,Pollster,abs_error
0,ABCNews,2.350000
1,ARG,8.325490
2,ArizonaStateU,6.500000
3,BrownU,14.900000
4,CapitalSurvey,1.600000
...,...,...
96,WestVirginiaWesleyanU,8.100000
97,WinthropU,5.666667
98,WisconsinPolicyResInst,7.900000
99,Zimmerman,6.500000


## Problem 5: Pivot a nested dictionary

"Recall that in Problem 3, we implemented a function that produces a pollster prediction (a nested dictionary from pollster to state to edge). This nesting was ideal when implementing pollster_errors, however future problems will require the opposite nesting. Implement and test the function pivot_nested_dict, which converts pollster predictions to state predictions."

\* Since I used the dataframe structure, this translation/reversal between keys and values is unnecessary.

## Problem 6: Average the edges in a single state

"So far, we have focused on representing election and polling data in Python, and analyzing the accuracy of predictions from past elections. Now, we look to the future!

definition of weighted average

After implementing and testing weighted_average, use it to implement average_edge. We provide the function pollster_to_weight, which you should use to compute the weight of a pollster, based on their average error."

This problem is asking me to convert 'abs_error' into a weight that can be applied to a pollster's prediction. Based on the code in `election.py`, this conversion is:

- 'abs_error'**(-2)

If there's a pollster that hasn't been seen before the code specifies `DEFAULT_AVERAGE_ERROR = 5.0` (in my code this translates to `default_abs_error == 5.0`). This means I need to pull in the 2012 data and set the 'abs_error' column before I calculate the weights.

In [49]:
# Out of curosity, here is the actual mean for the 2008 abs_errors both from the full dataset and after grouping pollsters
print('Absolute error mean:', merged_2008['abs_error'].mean())
print('Absolute error mean by pollster:', pollster_weights['abs_error'].mean())
# To match the course code, I'll use the specified default of 5.0

Absolute error mean: 5.110695187165775
Absolute error mean by pollster: 5.102763975791845


In [54]:
# To facilitate easier processing, I turned the above steps into a function to process the data

# Reduces 2012 polls to just the recent polls by pollster by state and calculates the edge

def poll_processing(filepath):
    df = pd.read_csv(filepath)
    df['predicted_edge'] = df['Dem'] - df['Rep']
    df['date'] = pd.to_datetime(df['Date'])
    most_recent = pd.DataFrame(df.groupby(['State', 'Pollster'])['date'].max().reset_index())
    recent_polls = df.merge(most_recent)
    
    return recent_polls

In [68]:
polls_2012 = poll_processing('homework3/data/2012-polls.csv')
polls_2012

Unnamed: 0,State,Dem,Rep,Date,Pollster,predicted_edge,date
0,AK,37,59,Nov 04 2008,2008Election,-22,2008-11-04
1,AL,36,54,Aug 16 2012,CapitalSurvey,-18,2012-08-16
2,AR,31,58,Oct 14 2012,UofArkansas,-27,2012-10-14
3,AR,35,56,Sep 17 2012,HendrixColl,-21,2012-09-17
4,AZ,40,42,Apr 20 2012,ArizonaStateU,-2,2012-04-20
...,...,...,...,...,...,...,...
261,WI,51,43,Oct 28 2012,MarquetteLawSchool,8,2012-10-28
262,WI,51,46,Oct 20 2012,Angus,5,2012-10-20
263,WI,51,48,Nov 03 2012,PPP,3,2012-11-03
264,WV,38,52,Aug 25 2012,RLRepass,-14,2012-08-25


In [60]:
# Examine which pollsters are new and will need a default value

# Weird findings:
# Pollster '2008Election' is in the 2012 data
# FoxNews doesn't have any polls in 2008
# ABCNews doesn't have any polls in 2012

pollsters_2012 = set(polls_2012['Pollster'])
pollsters_2008 = set(polls_2008['Pollster'])
new_pollsters = pollsters_2012 - pollsters_2008
new_pollsters

{'2008Election',
 'AbtSRBI',
 'Angus',
 'AtlantaJournal',
 'CallFire',
 'CastletonStateColl',
 'ChilenskiStrategies',
 'DartmouthCollege',
 'DennoResearch',
 'EssmanResearch',
 'FlemingandAssocs',
 'FoxNews',
 'GlengariffGroup',
 'GlobalStrategy',
 'HendrixColl',
 'HighPointUniversity',
 'Howey',
 'IPSOS',
 'Jayhawk',
 'KeyResearch',
 'LosAngelesTimes',
 'MainePeopleResCtr',
 'MainePeoplesResCtr',
 'MarquetteLawSchool',
 'MassINC',
 'MercyhurstUniversity',
 'MerrimanRiverGroup',
 'MitchellResearch',
 'MooreConsulting',
 'NewEnglandCollege',
 'NielsonBros',
 'ORCInternational',
 'OldDominionU',
 'OldDominionUniv',
 'OpinionDynamics',
 'OpinionWorks',
 'PPP',
 'PepperdineU',
 'PharosResearchGroup',
 'ProjectNewAmerica',
 'PulseOpinionResearch',
 'PurpleStrategies',
 'RLRepass',
 'RichardStocktonCollege',
 'Rutgers',
 'SouthernIllinoisU',
 'SouthernMediaOpinionResearch',
 'TheWashingtonPoll',
 'TulchinResearch',
 'UofMass',
 'UofNorthFlorida',
 'VoterSurveyService',
 'WesternNewEnglandU',

In [61]:
# Pollsters that didn't return in 2012
pollsters_2008 - pollsters_2012

{'ABCNews',
 'ChristopherNewportU',
 'CiruliAssoc',
 'DFMResearch',
 'DakotaWesleyanU',
 'DartmouthColl',
 'Datamar',
 'DavisHibbittsMidghall',
 'FinancialDynamics',
 'GfKRoper',
 'GonzalesRes',
 'GregSmith',
 'HoweyGauge',
 'IPFWU',
 'IndLegisInsight',
 'IvanMooreResearch',
 'LATimes',
 'LoyolaU',
 'MarkBlankenship',
 'MarshallMarketing',
 'MinnesotaStateU',
 'MontanaStateU',
 'NYTimes',
 'NorthernArizonaU',
 'OhioU',
 'OpinionFactor',
 'OpinionRes',
 'OpinionResearch',
 'PollingCompany',
 'PubPolicyInstofCalif',
 'PublicOpinionStrat',
 'PulsarResearch',
 'RhodeIslandColl',
 'RileyRes',
 'RileyResearch',
 'SaintAnselmColl',
 'SchrothPollingCo',
 'SoutheasternLaU',
 'StCloudStateU',
 'SusquehannaPolling',
 'TVPoll',
 'TelOpinionResearch',
 'TempleU',
 'USAPollingGroup',
 'UofAkron',
 'UofMinnesota',
 'UofSouthAlabama',
 'UofWisconsin',
 'VirgCommonwealthU',
 'WestChesterU',
 'WestVirginiaWesleyanU',
 'WinthropU',
 'WisconsinPolicyResInst',
 'Zimmerman'}

In [69]:
# Bring in the 'abs_error' for our known pollsters and set new pollsters to a default 'abs_error' of 5.0
# Calculate the weight for all 2012 pollsters
polls_2012 = polls_2012.merge(pollster_weights[['Pollster', 'abs_error']], on='Pollster', how='left')
polls_2012['abs_error'].fillna(5.0, inplace=True)
polls_2012['weights'] = polls_2012['abs_error']**(-2)
polls_2012

Unnamed: 0,State,Dem,Rep,Date,Pollster,predicted_edge,date,abs_error,weights
0,AK,37,59,Nov 04 2008,2008Election,-22,2008-11-04,5.0,0.040000
1,AL,36,54,Aug 16 2012,CapitalSurvey,-18,2012-08-16,1.6,0.390625
2,AR,31,58,Oct 14 2012,UofArkansas,-27,2012-10-14,4.8,0.043403
3,AR,35,56,Sep 17 2012,HendrixColl,-21,2012-09-17,5.0,0.040000
4,AZ,40,42,Apr 20 2012,ArizonaStateU,-2,2012-04-20,6.5,0.023669
...,...,...,...,...,...,...,...,...,...
261,WI,51,43,Oct 28 2012,MarquetteLawSchool,8,2012-10-28,5.0,0.040000
262,WI,51,46,Oct 20 2012,Angus,5,2012-10-20,5.0,0.040000
263,WI,51,48,Nov 03 2012,PPP,3,2012-11-03,5.0,0.040000
264,WV,38,52,Aug 25 2012,RLRepass,-14,2012-08-25,5.0,0.040000


## Problem 7: Predict the 2012 election

"Finally, implement the function predict_state_edges, which predicts the result of the 2012 election in each state. Make use of the average_edge function from Problem 6.

Once your implementation predict_state_edges passes our tests, run election.py as a Python program to predict the outcome of the 2012 election! Your result should match the actual outcome of the 2012 election:

Dem 332.0
Rep 206.0

More details about the actual election appear in file data/2012-results.csv and in the Wikipedia article on the 2012 election."

Since I already pulled in the 2012 data and set the weights, I need five additional steps for this problem:

- Calcuate the weighted predictions ('predicted_edge x weight' for each row)
- Calcuate the mean prediction for each state
- Read in and join the Electoral college information
- Count the number of electorial votes by state
- Sum the results for a final count

In [71]:
# Calculate the weighted prediction for each row
polls_2012['weighted_prediction'] = polls_2012['weights'] * polls_2012['predicted_edge']
polls_2012

Unnamed: 0,State,Dem,Rep,Date,Pollster,predicted_edge,date,abs_error,weights,weighted_prediction
0,AK,37,59,Nov 04 2008,2008Election,-22,2008-11-04,5.0,0.040000,-0.880000
1,AL,36,54,Aug 16 2012,CapitalSurvey,-18,2012-08-16,1.6,0.390625,-7.031250
2,AR,31,58,Oct 14 2012,UofArkansas,-27,2012-10-14,4.8,0.043403,-1.171875
3,AR,35,56,Sep 17 2012,HendrixColl,-21,2012-09-17,5.0,0.040000,-0.840000
4,AZ,40,42,Apr 20 2012,ArizonaStateU,-2,2012-04-20,6.5,0.023669,-0.047337
...,...,...,...,...,...,...,...,...,...,...
261,WI,51,43,Oct 28 2012,MarquetteLawSchool,8,2012-10-28,5.0,0.040000,0.320000
262,WI,51,46,Oct 20 2012,Angus,5,2012-10-20,5.0,0.040000,0.200000
263,WI,51,48,Nov 03 2012,PPP,3,2012-11-03,5.0,0.040000,0.120000
264,WV,38,52,Aug 25 2012,RLRepass,-14,2012-08-25,5.0,0.040000,-0.560000


In [77]:
poll_predictions = polls_2012.groupby('State')['weighted_prediction'].mean().reset_index()
poll_predictions

Unnamed: 0,State,weighted_prediction
0,AK,-0.88
1,AL,-7.03125
2,AR,-1.005937
3,AZ,-0.188513
4,CA,3.20339
5,CO,0.070716
6,CT,0.94522
7,DC,3.2
8,DE,1.0
9,FL,0.040542


In [83]:
electors = pd.read_csv('homework3/data/2012-electoral-college.csv', usecols=['State', 'Electors'])
electors.head()

Unnamed: 0,State,Electors
0,AK,3
1,AL,9
2,AR,6
3,AZ,11
4,CA,55


In [84]:
poll_predictions = poll_predictions.merge(electors, on='State', how='left')

In [85]:
poll_predictions.head()

Unnamed: 0,State,weighted_prediction,Electors
0,AK,-0.88,3
1,AL,-7.03125,9
2,AR,-1.005937,6
3,AZ,-0.188513,11
4,CA,3.20339,55


In [89]:
poll_predictions['dem_electoral_votes'] = np.where(poll_predictions['weighted_prediction'] > 0, poll_predictions['Electors'],
                                          np.where(poll_predictions['weighted_prediction'] == 0, (poll_predictions['Electors']/2),
                                                                                                 0))

poll_predictions['rep_electoral_votes'] = np.where(poll_predictions['weighted_prediction'] < 0, poll_predictions['Electors'],
                                          np.where(poll_predictions['weighted_prediction'] == 0, (poll_predictions['Electors']/2),
                                                  0))

poll_predictions

Unnamed: 0,State,weighted_prediction,Electors,dem_electoral_votes,rep_electoral_votes
0,AK,-0.88,3,0.0,3.0
1,AL,-7.03125,9,0.0,9.0
2,AR,-1.005937,6,0.0,6.0
3,AZ,-0.188513,11,0.0,11.0
4,CA,3.20339,55,55.0,0.0
5,CO,0.070716,9,9.0,0.0
6,CT,0.94522,7,7.0,0.0
7,DC,3.2,3,3.0,0.0
8,DE,1.0,3,3.0,0.0
9,FL,0.040542,29,29.0,0.0


In [90]:
# Results match expected results
print('Dem:', poll_predictions['dem_electoral_votes'].sum())
print('Rep:', poll_predictions['rep_electoral_votes'].sum())

Dem: 332.0
Rep: 206.0
