# First Senate Predictions

After cleaning the data (in data-cleaning.ipynb), we will now try to use the data to train some models and make predictions. Initially, we will try to see how accurate we can get without using state-level polls as I think it will be interesting to see how well we can predict based only on state characteristics and the national mood (measured by generic ballot and presidential popularity. Additionally, omitting state-level polls will allow us to develop a useful baseline accuracy should we decide to add more polling data later. 

In [191]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.options.display.max_columns = None
%matplotlib inline
import datetime as dt

## Data

In [192]:
elections = pd.read_csv('combined_election_data.csv')

In [193]:
elections.head()

Unnamed: 0,state,year,partisan_score,old_score_avg,generic_ballot,pres_approval,dem_pres,pres_approval_int,unemployment_rate,unemp_rate_int,native_amer_perc,asian_perc,black_perc
0,ALABAMA,2008,-0.268418,-0.269701,0.1,-0.422,0,-0.0,0.057,0.0,0.007104,0.012499,0.266876
1,ALASKA,2008,0.012442,-0.030323,0.1,-0.422,0,-0.0,0.067,0.0,0.17074,0.06697,0.043748
2,COLORADO,2008,0.103035,-0.000784,0.1,-0.422,0,-0.0,0.048,0.0,0.017203,0.032577,0.047404
3,DELAWARE,2008,0.29373,0.29458,0.1,-0.422,0,-0.0,0.049,0.0,0.006676,0.033771,0.223802
4,GEORGIA,2008,-0.089013,-0.123685,0.1,-0.422,0,-0.0,0.062,0.0,0.005107,0.034483,0.311697


In [194]:
elections.shape

(218, 13)

In [195]:
elections.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   state              218 non-null    object 
 1   year               218 non-null    int64  
 2   partisan_score     218 non-null    float64
 3   old_score_avg      218 non-null    float64
 4   generic_ballot     218 non-null    float64
 5   pres_approval      218 non-null    float64
 6   dem_pres           218 non-null    int64  
 7   pres_approval_int  218 non-null    float64
 8   unemployment_rate  218 non-null    float64
 9   unemp_rate_int     218 non-null    float64
 10  native_amer_perc   218 non-null    float64
 11  asian_perc         218 non-null    float64
 12  black_perc         218 non-null    float64
dtypes: float64(10), int64(2), object(1)
memory usage: 22.3+ KB


## Dividing the Data into Train and Test Sets

We first divide the data into a training set with 80% of the rows, and a testing set with the remaining 20%. 

In [196]:
train = elections.sample(frac=.8, random_state=1)
test = elections.loc[~elections.index.isin(train.index)]

In [197]:
train.shape

(174, 13)

In [198]:
test.shape

(44, 13)

In [199]:
X = train.copy()
X.drop(['state', 'year', 'partisan_score'], axis=1, inplace=True)
y = train['partisan_score']

## Linear Regression

In [200]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFECV

In [201]:
def get_cross_val_score(model, features):
    kf = KFold(n_splits=3, shuffle=True, random_state=1)
    scores = cross_val_score(model, features, y, scoring='neg_mean_squared_error', cv=kf)
    return scores.mean()

In [202]:
get_cross_val_score(LinearRegression(), X)

-0.0227495428875895

In [203]:
def select_features(model):
    selector = RFECV(model, cv=3)
    selector.fit(X, y)
    features = X.columns[selector.support_]
    return features

In [204]:
features = select_features(LinearRegression())

In [205]:
features

Index(['old_score_avg', 'generic_ballot', 'unemp_rate_int', 'native_amer_perc',
       'asian_perc'],
      dtype='object')

In [206]:
get_cross_val_score(LinearRegression(),X[features])

-0.02083462759778859

## K-Nearest Neighbors

In [207]:
from sklearn.neighbors import KNeighborsRegressor

In [208]:
kn = KNeighborsRegressor()
get_cross_val_score(kn, X)

-0.03177287676624391

In [209]:
for n in range(1, 20):
    print(n, get_cross_val_score(KNeighborsRegressor(n_neighbors=n), X))

1 -0.05277729784219224
2 -0.038100586735982316
3 -0.03507304593827619
4 -0.032326237553448066
5 -0.03177287676624391
6 -0.031026628506679855
7 -0.029778201489366874
8 -0.029178876282217426
9 -0.02920521422398846
10 -0.02886810285636729
11 -0.028741347796891387
12 -0.029148913603975393
13 -0.029224521829889234
14 -0.02936731832233667
15 -0.029788214935260366
16 -0.02934784632135613
17 -0.02959674436036423
18 -0.02933331365676864
19 -0.029925307013890817


In [210]:
get_cross_val_score(KNeighborsRegressor(n_neighbors=10), X)

-0.02886810285636729

## Lasso Regression

In [211]:
from sklearn.linear_model import Lasso

In [212]:
lasso = Lasso(random_state=1)
get_cross_val_score(lasso, X)

-0.058043802096115815

## Random Forest

In [213]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [214]:
rf_features = select_features(RandomForestRegressor(random_state=1))

In [215]:
rf_features

Index(['old_score_avg', 'generic_ballot', 'unemployment_rate',
       'native_amer_perc', 'asian_perc', 'black_perc'],
      dtype='object')

In [216]:
searcher = GridSearchCV(RandomForestRegressor(), param_grid={"n_estimators": [4, 6, 9],
            "max_depth": [2, 5, 10],
            "max_features": [.5, .8, 1],
            "min_samples_leaf": [1, 5, 8],
            "min_samples_split": [2, 3, 5]}, cv=4)
searcher.fit(X, y)
rf_best_estimator = searcher.best_estimator_
rf_best_params = searcher.best_params_

In [217]:
rf_best_estimator

RandomForestRegressor(max_depth=5, max_features=0.8, min_samples_leaf=8,
                      n_estimators=4)

In [218]:
rf_best_params

{'max_depth': 5,
 'max_features': 0.8,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'n_estimators': 4}

In [219]:
get_cross_val_score(rf_best_estimator, X[rf_features])

-0.02600919170948179

## Final Predictions

In [220]:
def get_final_predictions(model, features):
    model.fit(features, y)
    predictions = model.predict(test[features.columns])
    rmse = mean_squared_error(test['partisan_score'], predictions, squared=False)
    return rmse

In [221]:
get_final_predictions(LinearRegression(), X)

0.12120906939944055

In [222]:
get_final_predictions(KNeighborsRegressor(n_neighbors=10), X)

0.12118281538185044

In [223]:
get_final_predictions(rf_best_estimator, X[rf_features])

0.12740429488482205

In [224]:
get_final_predictions(LinearRegression(), X[features])

0.11840618718751311

In [225]:
test['predict_vals'] = predictions

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['predict_vals'] = predictions


In [226]:
test[['state','year','partisan_score', 'predict_vals']]

Unnamed: 0,state,year,partisan_score,predict_vals
1,ALASKA,2008,0.012442,0.007044
7,IOWA,2008,0.253945,-0.041923
8,KANSAS,2008,-0.235999,-0.234311
20,NEW MEXICO,2008,0.22656,0.078714
22,OKLAHOMA,2008,-0.17496,-0.081354
25,SOUTH CAROLINA,2008,-0.152778,0.001694
30,WEST VIRGINIA,2008,0.27471,0.373667
37,CONNECTICUT,2010,0.092673,0.118947
50,NEW HAMPSHIRE,2010,-0.232396,-0.327053
57,PENNSYLVANIA,2010,-0.02017,-0.130874


In [227]:
abs(test['predict_vals'] - test['partisan_score']).mean()

0.09161619188634354

Above, we see that our final model was off by around 9 percentage points on average. Clearly, our model could be improved. 

## Comparison with Baseline Estimate

Here, we will compare our model with the baseline estimate, which is the old score average added to the generic ballot. 

In [228]:
compare_df = elections.copy()

In [229]:
compare_df['baseline_estimate'] = compare_df['old_score_avg'] + compare_df['generic_ballot']

In [230]:
abs(compare_df['baseline_estimate'] - compare_df['partisan_score']).mean()

0.10451792391650988

Our baseline estimate was off by around 10.5 percentage points, so our linear regression model represents a slight improvement. 