# Final House Prediction - 11/7/22

## Model Overviews & Innovations

For my final prediction, I created two separate predictive models:

1. A "lite" model which produces a national level prediction based on incumbency, economic, and polling factors only
2. A "deluxe" model which produces a district-level prediction that attempts to improve off the gold-standard 538 prediction

Before jumping into the details of each of the models, I want to provide what I believe are the main innovations of my models, and whether they are predictive. 

1. Lite: Consumer Sentiment Economic Data -- this seems predictive! 
- Many have described the 2022 economy as unique, because of low unemployment, but high inflation, with rising wages and prives alike. Previously, to understand this, people have used Real Disposable Income as a proxy for inflation-adjusted tangible growth, but based on my prior economic work, this seems only marginally better than GDP. Furthermore, others have used gas prices as a proxy for tangible consumer perception of the economy, to more mixed results. Therefore, I use the University of Michigan's consumer sentiment survey, and found that this seems more predictive than other economic indicators.
2. Delixe: District-Level Donation Data -- so far, not predictive, but I believe to be promising. 
- For my deluxe model, I came in with the presumption that 538's model would be a gold standard baseline, with only two potential areas for improvement. The first is their CANTOR district similarity, which does not specify how they calculate similarity, but I presume is a cosine similarity based on the features of the district. I believe that a neural embedding trained to minimize reconstruction loss would be better performing, but did not implement this. The second area of improvement is district-specific donation data. 538 noted that they had improved their model since its launch to weight state-level donation data 5x more than national-level donation data. Likely, this is because the more targeted data better understands how many constituents are likely to vote, as opposed to outside financing hoping to sway. I believe the true goal should be limiting to solely constituents, by using district-level data. However, this data is not readily accessible from the FEC, so I created the first approximation (by associating donations with zip codes and zip codes with congressional districts). While I was unable to improve on the 538 model, I believe this data will ultimately be more predictive than state level data.

With those main innovations out of the way, let's jump into the details of each model.

## Prediction

My predictions come from my Lite Model, as my Deluxe model does not improve upon 538.

**Republican Two Party Seat Share:** 52.13\% -> 226.76 Seats -> 227 Seats (40.92\% - 56.55\% / 178 - 246 Seats 95\% Prediction Interval)

**Republican Two Party Popular Vote Share:** 50.58\% (44.43\% - 53.47\% 95\% Prediction Interval)

## Lite Model

### Features

For my lite model, I sought to predict the national two party vote share (of the Republican party) and the national two party vote share (of the Republican party). I used the following features:

1. Incumbency
- Year of Election
- Midterm Year Indicator
- Party of the President
- 2 Party seat share from the prior election (GOP)
- 2 party vote share from the prior election (GOP)
2. Economy
- Consumer Sentiment Index (absolute value at time of election)
- Consumer Sentiment Index (change from start of Congress (Q1 -> Q7 Change))
- GDP Growth (annual growth at time of election)
- GDP Growth (change from start of Congress (Q1 -> Q7 Change))
- RDI Growth (annual growth at time of election)
- RDI Growth (change from start of Congress (Q1 -> Q7 Change))
3. Polling
- Presidential Approval (election year average)
- Presidential Approval (Election October & November average)
- Generic Congressional Ballot (election year average)
- Generic Congressional Ballot (Election October & November average)

#### Incumbency

* Year of Election
    - My model is based on data since 1978 (due to the availability of consumer sentiment data), during which, the electoral dynamics have changed greatly. Accordingly, this should allow the model to predict the tightening elections of late.
    - Example: 2022
    - Source: [1948-2022 District-Level Results](lite_data/incumb_dist_1948-2022%20(2).csv)
* Midterm Year Indicator
    - This is a binary indicator for whether the election is a midterm year, considering that turnout is often lower, and the incumbent party's president tends to lose seats (in every election except 1998 and 2002)
    - Example: 1 (if Midterm)
    - Source: [1948-2022 District-Level Results](lite_data/incumb_dist_1948-2022%20(2).csv)
* Party of the President
    - I imagine this to be mostly useful as an interaction term with other variables (e.g. midterm year, presidential approval) to contextualize whether is postitive or negative.
    - Example: 1 (if Republican)
    - Source: [Wikipedia President List](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States), with some [post-processing](./final_prediction_lite.ipynb)
* 2 Party seat share from the prior election (GOP)
    - Often, elections are viewed more appropriately as seat change given the power of incumbency to set the norm. This should allow to account
    - Example: 0.52 (if GOP won 52% of seats won prior election)
    - Source: [1948-2022 District-Level Results](lite_data/incumb_dist_1948-2022%20(2).csv)
* 2 party vote share from the prior election (GOP)
    - Same reasoning as above, but vote share.
    - Example: 0.51 (if GOP won 51% of votes won prior election)
    - Source: [1948-2022 District-Level Results](lite_data/incumb_dist_1948-2022%20(2).csv)

#### Economy

* Consumer Sentiment Index (absolute value at time of election)
    - As discussed in the introduction, I believe this is a better proxy for consumer perception of the economy than GDP or RDI. I use the latest absolute value at the election's time because of recency bias, and because unlike GDP or RDI, consumer sentiment doesn't need to be normalized as a percent change. 
    - Example: 78
    - Source: [FRED University of Michigan: Consumer Sentiment](https://fred.stlouisfed.org/series/UMCSENT)
* Consumer Sentiment Index (change from start of Congress (Q1 -> Q7 Change))
    - I view this as less likely than absolute value, but this allows for voters to judge based on overall sentiment of the direction of the economy.
    - Example: -2.5
    - Source: [FRED University of Michigan: Consumer Sentiment](https://fred.stlouisfed.org/series/UMCSENT) 
* GDP Growth (annual growth at time of election)
    - This is a standard economic indicator, and I believe it is useful to include as a control. This allows for the most recent reading.
    - Example: 4.5
    - Source: [FRED Real GDP Per Capita](https://fred.stlouisfed.org/series/A939RX0Q048SBEA)
* GDP Growth (change from start of Congress (Q1 -> Q7 Change))
    - Allows for directional reading of GDP.
    - Example: 2.5
    - Source: [FRED Real GDP Per Capita](https://fred.stlouisfed.org/series/A939RX0Q048SBEA)
* RDI Growth (annual growth at time of election)
    - This is a standard economic indicator, and I believe it is useful to include as a control. This allows for the most recent reading.
    - Example: 4.5
    - Source: [FRED Real Disposable Personal Income: Per Capita](https://fred.stlouisfed.org/series/A229RX0)
* RDI Growth (change from start of Congress (Q1 -> Q7 Change))
    - Allows for directional reading of RDI.
    - Example: 2.5
    - Source: [FRED Real Disposable Personal Income: Per Capita](https://fred.stlouisfed.org/series/A229RX0)

#### Polling

* Presidential Approval (election year average)
    - This allows for general sentiment of the president to play into the model.
    - Example: 0.6
    - Source: For training [Gallup Presidential Approval](lite_data/pres_approval_gallup_1941-2022.csv), For 2022 [538 Biden Approval](https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv) 
* Presidential Approval (Election October & November average)
    - This allows for more recent sentiment of the president to play into the model.
    - Example: 0.6
    - Source: For training [Gallup Presidential Approval](lite_data/pres_approval_gallup_1941-2022.csv), For 2022, using latest reading on [538 Biden Ratings](https://projects.fivethirtyeight.com/biden-approval-rating/)
* Presidential Dispproval (election year average)
    - This allows for general sentiment of the president to play into the model.
    - Example: 0.4
    - Source: For training [Gallup Presidential Approval](lite_data/pres_approval_gallup_1941-2022.csv), For 2022 [538 Biden Approval](https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv) 
* Presidential Disapproval (Election October & November average)
    - This allows for more recent sentiment of the president to play into the model.
    - Example: 0.4
    - Source: For training [Gallup Presidential Approval](lite_data/pres_approval_gallup_1941-2022.csv), For 2022, using latest reading on [538 Biden Ratings](https://projects.fivethirtyeight.com/biden-approval-rating/)
* Generic Congressional Ballot (election year average)
    - This allows for general sentiment of the parties to play into the model.
    - Example: Dem: 0.41, Rep: 0.42
    - Source: For training [Generic Ballot Historical (1942-Present)](lite_data/GenericPolls1942_2020.csv), For 2022 [538 Generic Congressional Ballot](https://projects.fivethirtyeight.com/generic-ballot-data/generic_polllist.csv)
* Generic Congressional Ballot (Election October & November average)
    - This allows for more recent sentiment of the parties to play into the model.
    - Example: Dem: 0.41, Rep: 0.42
    - Source: For training [Generic Ballot Historical (1942-Present)](lite_data/GenericPolls1942_2020.csv), For 2022 [538 Generic Congressional Ballot](https://projects.fivethirtyeight.com/polls/data/generic_ballot_averages.csv)

### Model

I used 4 different model types, each for a different purpose:

* Linear Regression
    - A baseline model, good for understanding relative and interpretable coefficients
* Lasso Regression 
    - Performs feature selection, good for understanding the most important coefficients.
* Polynomial Regression
    - Allows for interactions, particularly important for the incumbency features
* Random Forest (Ensemble)
    - Allows for non-linear interactions, as well as state of the art performance for tabular data (which this is).

For an overview of their performance, you can see the following performance data with in sample and out of sample evaluation, including a reference for 2022 which is easy to understand:

In [3]:
import pandas as pd
model_performance_df = pd.read_csv("lite_data/all_model_data.csv")
model_performance_df[["name", "R^2", "CV Scores", "2020 Seat Share Error", "2020 Vote Share Error"]]

Unnamed: 0,name,R^2,CV Scores,2020 Seat Share Error,2020 Vote Share Error
0,Linear Regression,0.968144,[-240.07586988 -3.82854943 -2.22217343],-0.071126,-0.081332
1,Lasso Regression,0.362444,[-23.01198633 -4.22348342 0.04777647],0.032415,0.000403
2,Ridge Regression,0.933815,[-55.68251395 -3.0316546 -1.31870434],0.019496,-0.028757
3,Random Forest Regressor (10),0.848819,[-22.27635859 -8.46810079 0.24039067],-0.050625,-0.022352
4,Random Forest Regressor (100),0.905489,[-19.87034581 -5.07148077 0.13054906],-0.00532,0.002136
5,Random Forest Regressor (1000),0.909554,[-20.4300514 -4.73017352 0.16471045],-0.009174,-2.9e-05
6,Polynomial Regression (2),1.0,[-8.47923779 -2.15374549 -6.24067355],-0.03532,-0.052864
7,Polynomial Regression (3),1.0,[-44.29844793 -2.06639805 -1.49887154],-0.036479,-0.05357
8,Polynomial Lasso Regression (2),0.922338,[-28.30783143 -1.33919337 -2.14168412],0.020192,-0.013601
9,Polynomial Lasso Regression (3),0.999929,[ -9.67270879 -2.88299644 -10.81933603],-0.056378,-0.065906


#### Linear Regression

Since the goal of linear regression is to get a rough understanding of model coefficients, here is what I found:

With linear regression, we get the following coefficients:

year 0.0037737428021888198

is_midterm_year 0.034921969446327195

rep_pres -0.02982138361770056

RepSeatShare_prior -0.7186283376330896

RepVoteShare_prior 1.9695919576382186

absolute_sentiment -0.0013689150478568136

sentiment_change 0.0013559212685174777

absolute_gdp 0.0011535946478451894

gdp_change 0.5009893745445565

rdi 0.012444702538206487

rdi_change 0.012444702538211411

annual_poll_approval 0.024132311200005422

two_month_poll_approval -0.015353670135328549

annual_poll_disapproval 0.020252281353538407

two_month_poll_disapproval -0.013470520027339098

annual_dem 0.014538921653516185

two_month_dem -0.01848914678506335

annual_rep 0.002119113076802239

two_month_rep -0.0012655571198566043

Seemingly, we see that the most important variable is the prior performance, which is good intuitively, and the economic factors, with GDP as the largest coefficient but with other potential collinearities. To get rid of these collinearities, we can use Lasso Regression.

#### Lasso Regression

For Lasso Regression, we get the following coefficients (for seat share):

year 0.0030986477539816475

absolute_sentiment 0.0001432172822153464

sentiment_change 0.001029068985272151

annual_poll_approval 0.0006727147220908171

Interestingly, the most important economic variables are the sentiment scores, confirming my suspicion, and the presidential approval outweighs generic ballot. This seems promising that the sentiment was a valuable addition.

For the vote share, we get the following coefficients:

year 0.00026173713404008026

sentiment_change 0.0003507946764736049

Once again showing how ppwerful sentiment is as a predictor.

#### Polynomial Regression

For polynomial regression, the coefficients are less interpretable, but importantly, we would expect to see a marked improvement over the linear regression due to the deliberate inclusion of features that allow for interactions (e.g. president party).

Indeed we find that the polynomial Lasso Regression greatly improves performance on 2020, as we can imagine that it is only keeping the most important interaction terms.

#### Random Forests

Finally, our random forest models give us the state of the art performance given their high degree of non-linearity. Unfortunately, they are much less interpretable, but we can pull generalized feature important (roughly, how often a feature was used as a deciding factor).

year 0.1675186669850853

is_midterm_year 0.0026796887176913833

rep_pres 0.1353620017137392

RepSeatShare_prior 0.06888020563584256

RepVoteShare_prior 0.0835810571158828

absolute_sentiment 0.013916784642159823

sentiment_change 0.01591892960535057

absolute_gdp 0.010938538229963034

gdp_change 0.016124365325747004

rdi 0.011846376143070273

rdi_change 0.011413370424025366

annual_poll_approval 0.0247650251359856

two_month_poll_approval 0.022372655026794724

annual_poll_disapproval 0.02987771694575475

two_month_poll_disapproval 0.034254482187393964

annual_dem 0.033116011748911905

two_month_dem 0.025373972785702224

annual_rep 0.18500667395979925

two_month_rep 0.10705347767110034

In the Random forest model, there still appears to be some collinearity/confounding factors, as shown with the high important for republican generic polling average and low for the democratic polling average, but importantly, the model picks up on the recent moderation of races and on the importance of incumbency.

Furthermore, we can use the Random Forest to give us a prediction interval based on its structure as 1000 trees based on random subsets of the data. We can thus use these as a set of predictions. Accordingly, the distribution of their outcomes is as follows:

![](lite_data/seat_share_hist.png)

Random forest models essentially perform a nearest neighbor search, so you can think of this as showing which prior elections are the most similar to the current -- with the mode notably at a Republican sweep, not at the middle. 

I used this distribution to create my 95\% CI.

## Deluxe Model