# First Senate Predictions

After cleaning the data (in data-cleaning.ipynb), we will now try to use the data to train some models and make predictions. Initially, we will try to see how accurate we can get without using state-level polls as I think it will be interesting to see how well we can predict based only on state characteristics and the national mood (measured by generic ballot and presidential popularity. Additionally, omitting state-level polls will allow us to develop a useful baseline accuracy should we decide to add more polling data later. 

In [170]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.options.display.max_columns = None
%matplotlib inline
import datetime as dt

## Data

In [171]:
elections = pd.read_csv('combined_election_data.csv')

In [172]:
elections.head()

Unnamed: 0,state,year,partisan_score,old_score_avg,generic_ballot,pres_approval,dem_pres,pres_approval_int,unemployment_rate,unemp_rate_int,native_amer_perc,asian_perc,black_perc
0,ALABAMA,2008,-0.268418,-0.269701,0.1,-0.422,0,-0.0,0.057,0.0,0.007104,0.012499,0.266876
1,ALASKA,2008,0.012442,-0.030323,0.1,-0.422,0,-0.0,0.067,0.0,0.17074,0.06697,0.043748
2,COLORADO,2008,0.103035,-0.000784,0.1,-0.422,0,-0.0,0.048,0.0,0.017203,0.032577,0.047404
3,DELAWARE,2008,0.29373,0.29458,0.1,-0.422,0,-0.0,0.049,0.0,0.006676,0.033771,0.223802
4,GEORGIA,2008,-0.089013,-0.123685,0.1,-0.422,0,-0.0,0.062,0.0,0.005107,0.034483,0.311697


In [173]:
elections.shape

(218, 13)

In [174]:
elections.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   state              218 non-null    object 
 1   year               218 non-null    int64  
 2   partisan_score     218 non-null    float64
 3   old_score_avg      218 non-null    float64
 4   generic_ballot     218 non-null    float64
 5   pres_approval      218 non-null    float64
 6   dem_pres           218 non-null    int64  
 7   pres_approval_int  218 non-null    float64
 8   unemployment_rate  218 non-null    float64
 9   unemp_rate_int     218 non-null    float64
 10  native_amer_perc   218 non-null    float64
 11  asian_perc         218 non-null    float64
 12  black_perc         218 non-null    float64
dtypes: float64(10), int64(2), object(1)
memory usage: 22.3+ KB


## Dividing the Data into Train and Test Sets

We first divide the data into a training set with 80% of the rows, and a testing set with the remaining 20%. 

In [175]:
train = elections.sample(frac=.8, random_state=1)
test = elections.loc[~elections.index.isin(train.index)]

In [176]:
train.shape

(174, 13)

In [177]:
test.shape

(44, 13)

In [178]:
X = train.copy()
X.drop(['state', 'year', 'partisan_score'], axis=1, inplace=True)
y = train['partisan_score']

## Linear Regression

In [179]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFECV

In [180]:
def get_cross_val_score(model, features):
    kf = KFold(n_splits=3, shuffle=True, random_state=1)
    scores = cross_val_score(model, features, y, scoring='neg_mean_squared_error', cv=kf)
    return scores.mean()

In [181]:
get_cross_val_score(LinearRegression(), X)

-0.0227495428875895

In [182]:
def select_features(model):
    selector = RFECV(model, cv=3)
    selector.fit(X, y)
    features = X.columns[selector.support_]
    return features

In [188]:
features = select_features(LinearRegression())

In [189]:
features

Index(['old_score_avg', 'generic_ballot', 'unemp_rate_int', 'native_amer_perc',
       'asian_perc'],
      dtype='object')

In [186]:
get_cross_val_score(LinearRegression(),X[select_features(LinearRegression)])

TypeError: _get_tags() missing 1 required positional argument: 'self'

## K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
kn = KNeighborsRegressor()
get_cross_val_score(kn, X)

In [None]:
for n in range(1, 20):
    print(n, get_cross_val_score(KNeighborsRegressor(n_neighbors=n), X))

In [None]:
get_cross_val_score(KNeighborsRegressor(n_neighbors=10), X)

## Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(random_state=1)
get_cross_val_score(lasso, X)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [None]:
rf_features = select_features(RandomForestRegressor(random_state=1))

In [None]:
rf_features

In [None]:
searcher = GridSearchCV(RandomForestRegressor(), param_grid={"n_estimators": [4, 6, 9],
            "max_depth": [2, 5, 10],
            "max_features": [.5, .8, 1],
            "min_samples_leaf": [1, 5, 8],
            "min_samples_split": [2, 3, 5]}, cv=4)
searcher.fit(X, y)
rf_best_estimator = searcher.best_estimator_
rf_best_params = searcher.best_params_

In [None]:
rf_best_estimator

In [None]:
rf_best_params

In [None]:
get_cross_val_score(rf_best_estimator, X[rf_features])

## Final Predictions

In [None]:
def get_final_predictions(model, features):
    model.fit(features, y)
    predictions = model.predict(test[features.columns])
    rmse = mean_squared_error(test['partisan_score'], predictions, squared=False)
    return rmse

In [None]:
get_final_predictions(LinearRegression(), X)

In [None]:
get_final_predictions(KNeighborsRegressor(n_neighbors=10), X)

In [None]:
get_final_predictions(rf_best_estimator, X[rf_features])

In [None]:
get_final_predictions(LinearRegression(), X[features])

In [None]:
test['predict_vals'] = predictions

In [None]:
test[['state','year','partisan_score', 'predict_vals']]

In [None]:
abs(test['predict_vals'] - test['partisan_score']).mean()

Above, we see that our final model was off by around 9 percentage points on average. Clearly, our model could be improved. 

## Comparison with Baseline Estimate

Here, we will compare our model with the baseline estimate, which is the old score average added to the generic ballot. 

In [None]:
compare_df = elections.copy()

In [None]:
compare_df['baseline_estimate'] = compare_df['old_score_avg'] + compare_df['generic_ballot']

In [None]:
abs(compare_df['baseline_estimate'] - compare_df['partisan_score']).mean()

Our baseline estimate was off by around 10.5 percentage points, so our linear regression model represents a slight improvement. 