# Project 3

## GDP per capita change and General Election results

- **Dataset(s) to be used:** [United States General Election Presidential Results by District and County from 2008 to 2024](https://github.com/tonmcg/US_County_Level_Election_Results_08-24)
- **Second dataset:** [Per capita personal income by County from 2020 to 2023](https://apps.bea.gov/itable/?ReqID=70&step=1&_gl=1*1q1ezha*_ga*MTYwMzEwNjAxOC4xNzMzNDIyODE0*_ga_J4698JNNFT*MTczMzQyMjgxMy4xLjEuMTczMzQyMjg5NC40Mi4wLjA.#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIyMCJdLFsiTWFqb3JfQXJlYSIsIjQiXSxbIlN0YXRlIixbIlhYIl1dLFsiQXJlYSIsWyJYWCJdXSxbIlN0YXRpc3RpYyIsIjMiXSxbIlVuaXRfb2ZfbWVhc3VyZSIsIkFBR1IiXSxbIlllYXIiLFsiLTEiXV0sWyJZZWFyQmVnaW4iLCIyMDIwIl0sWyJZZWFyX0VuZCIsIjIwMjMiXV1)
- **Analysis question:** What was the effect of growth rates in GDP per capita on the change in preference for Republican candidates between 2020 and 2024 at a County level? 
- **Columns that will (likely) be used:**
  - per_gop: Percentage of vote for Republican candidates
  - diff_gop (calculated column): Change in the percentage of vote for Republican candidates from 2020 to 2024
  - gdppc: Per capita personal income compound annual growth rate from 2020 and 2023
- **Columns to be used to merge/join them:**
  - General Election results data, merge on county fips code [county_fips]
  - Per capita personal income data, merge on county fips code [Geo_Fips]
- **Hypothesis**: 
  - [H0]: There is no effect of GDP per capita compound annual growth rates on the change in preference for Republican candidates between 2020 and 2024
  - [H1]: There is a significant negative relationship between GDP per capita compound annual growth rates on the change in preference for Republican candidates between 2020 and 2024
- **Site URL:** [Jali's website](https://project-3-jalipacker.readthedocs.io/en/latest/)

In [None]:
import pandas as pd
import plotly.express as px
import statsmodels.api as sm
import matplotlib.pyplot as plt

import plotly.io as pio
pio.renderers.default = "vscode+jupyterlab+notebook_connected"

### Introduction - Why is economic growth important to election results?
In this Project, I will explore GDP per capita change and voting preference in the United States, at the county-level. My hypothesis is that there will be a relationship between GDP per capita and voting preferences, because the economy was found to be the most important issue for Republican voters and far more important than for Democratic voters.

Given the Republican win of the General Election, the directionality of this relationship is expected to be negative: that is to say, counties which have experienced lower rates of growth in GDP per capita for the past 4 years will be more likely to vote Republican. 

This is assumed to be due to the following mechanism: the salience of lower GDP per capita growth and the higher importance of the economy as an issue to Republican voters leads to an increased share of vote towards Republican candidates in these lower GDP per capita growth counties. However, this mechanism is not explored empirically in the current project. 

The following interactive chart, sourced from Gallup, gives the importance of different issues to registered voters, split by political party. Hovering over dots on the chart will bring up the specific percentage of poll respondents who ranked an issue as extremely important.

<iframe src="https://datawrapper.dwcdn.net/mlBuq/26/" width="800" height="900" frameborder="0" scrolling="no"></iframe>

### Step 1 - Import election data
In this stage, I import county-level data on the 2024 and 2020 elections, sourced from this [Github repository](https://github.com/tonmcg/US_County_Level_Election_Results_08-24). After importing and inspecting the data, I perform data cleaning, and merge the two datasets to find the difference scores in proportion of votes towards Republican candidates. 

In [38]:
election_2020 = pd.read_csv("2020_US_County_Level_Presidential_Results.csv")

In [39]:
election_2020.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff
0,Alabama,1001,Autauga County,19838,7503,27770,12335,0.714368,0.270184,0.444184
1,Alabama,1003,Baldwin County,83544,24578,109679,58966,0.761714,0.22409,0.537623
2,Alabama,1005,Barbour County,5622,4816,10518,806,0.534512,0.457882,0.076631
3,Alabama,1007,Bibb County,7525,1986,9595,5539,0.784263,0.206983,0.57728
4,Alabama,1009,Blount County,24711,2640,27588,22071,0.895716,0.095694,0.800022


We now want to match this data, on county_fips, to the 2024 election results. Our independent variable will be the percentage point change from 2020 to 2024 in the percentage of votes for the Republican party.

In [40]:
election_2024 = pd.read_csv("2024_US_County_Level_Presidential_Results.csv")

In [41]:
election_2024.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff
0,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632
1,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791
2,Alabama,1005,Barbour County,5578,4120,9766,1458.0,0.571165,0.421872,0.149293
3,Alabama,1007,Bibb County,7563,1617,9230,5946.0,0.819393,0.17519,0.644204
4,Alabama,1009,Blount County,25271,2569,28024,22702.0,0.901763,0.091671,0.810091


Before merging, we will inspect our columns of interest - county_fips and per_gop - in both datasets to ensure that there is not missing data or other errors in the data reporting.

In [42]:
election_2020["county_fips"].unique()

array([ 1001,  1003,  1005, ..., 56041, 56043, 56045])

In [43]:
election_2020["county_fips"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 3152 entries, 0 to 3151
Series name: county_fips
Non-Null Count  Dtype
--------------  -----
3152 non-null   int64
dtypes: int64(1)
memory usage: 24.8 KB


In [44]:
election_2024["county_fips"].unique()

array([ 1001,  1003,  1005, ..., 56041, 56043, 56045])

In [45]:
election_2024["county_fips"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 3160 entries, 0 to 3159
Series name: county_fips
Non-Null Count  Dtype
--------------  -----
3160 non-null   int64
dtypes: int64(1)
memory usage: 24.8 KB


In both cases, our county fips codes are a consistent data type (integer) which we can work with. We will now inspect the column "per_gop".

In [46]:
election_2020["per_gop"].unique()

array([0.71436802, 0.76171373, 0.53451226, ..., 0.79727718, 0.80882353,
       0.87718803])

In [47]:
election_2020["per_gop"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 3152 entries, 0 to 3151
Series name: per_gop
Non-Null Count  Dtype  
--------------  -----  
3152 non-null   float64
dtypes: float64(1)
memory usage: 24.8 KB


In [48]:
election_2024["per_gop"].unique()

array([0.72664274, 0.78648955, 0.57116527, ..., 0.81055209, 0.81359021,
       0.87836291])

In [49]:
election_2024["per_gop"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 3160 entries, 0 to 3159
Series name: per_gop
Non-Null Count  Dtype  
--------------  -----  
3160 non-null   float64
dtypes: float64(1)
memory usage: 24.8 KB


Again, in both cases, our per_gop columns are a consistent data type (float) which we can work with. Next, we will merge the 2020 and 2024 datasets, for our columns of interest.

In [50]:
election_2020 = election_2020.rename(columns={'per_gop': 'per_gop_2020'})
election_2024 = election_2024.rename(columns={'per_gop': 'per_gop_2024'})

In [51]:
election_2020_selected = election_2020[['county_fips', 'county_name', 'per_gop_2020']]
election_2024_selected = election_2024[['county_fips', 'county_name', 'per_gop_2024']]


In [52]:
merged_data = pd.merge(election_2020_selected, election_2024_selected, on=['county_fips', 'county_name'])


In [53]:
merged_data.head()

Unnamed: 0,county_fips,county_name,per_gop_2020,per_gop_2024
0,1001,Autauga County,0.714368,0.726643
1,1003,Baldwin County,0.761714,0.78649
2,1005,Barbour County,0.534512,0.571165
3,1007,Bibb County,0.784263,0.819393
4,1009,Blount County,0.895716,0.901763


With our merged dataset, we can now calculate the difference in proportion of votes towards Republican candidates, from 2020 to 2024.

In [54]:
merged_data["diff_gop"] = merged_data["per_gop_2024"] - merged_data["per_gop_2020"]

In [55]:
merged_data.head()

Unnamed: 0,county_fips,county_name,per_gop_2020,per_gop_2024,diff_gop
0,1001,Autauga County,0.714368,0.726643,0.012275
1,1003,Baldwin County,0.761714,0.78649,0.024776
2,1005,Barbour County,0.534512,0.571165,0.036653
3,1007,Bibb County,0.784263,0.819393,0.035131
4,1009,Blount County,0.895716,0.901763,0.006047


In [56]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3103 entries, 0 to 3102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   county_fips   3103 non-null   int64  
 1   county_name   3103 non-null   object 
 2   per_gop_2020  3103 non-null   float64
 3   per_gop_2024  3103 non-null   float64
 4   diff_gop      3103 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 121.3+ KB


### Step 2 - Import GDP per capita data

Now, we need to add in our data on the change in GDP per capita, at the county-level, from 2020 to 2023 (no 2024 data was available). Sourced from the [Bureau of Economic Affairs](https://apps.bea.gov/itable/?ReqID=70&step=1&_gl=1*1q1ezha*_ga*MTYwMzEwNjAxOC4xNzMzNDIyODE0*_ga_J4698JNNFT*MTczMzQyMjgxMy4xLjEuMTczMzQyMjg5NC40Mi4wLjA.#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIyMCJdLFsiTWFqb3JfQXJlYSIsIjQiXSxbIlN0YXRlIixbIlhYIl1dLFsiQXJlYSIsWyJYWCJdXSxbIlN0YXRpc3RpYyIsIjMiXSxbIlVuaXRfb2ZfbWVhc3VyZSIsIkFBR1IiXSxbIlllYXIiLFsiLTEiXV0sWyJZZWFyQmVnaW4iLCIyMDIwIl0sWyJZZWFyX0VuZCIsIjIwMjMiXV19).

In [57]:
gdppc = pd.read_csv("Table.csv")

In [58]:
gdppc.head()

Unnamed: 0,GeoFips,GeoName,2020-2023
0,1001,"Autauga, AL",5.6
1,1003,"Baldwin, AL",6.2
2,1005,"Barbour, AL",3.9
3,1007,"Bibb, AL",4.6
4,1009,"Blount, AL",6.0


In [59]:
gdppc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   GeoFips    3140 non-null   int64 
 1   GeoName    3140 non-null   object
 2   2020-2023  3140 non-null   object
dtypes: int64(1), object(2)
memory usage: 73.7+ KB


In [60]:
gdppc["2020-2023"].unique()

array(['5.6', '6.2', '3.9', '4.6', '6.0', '5.5', '7.6', '4.2', '6.4',
       '6.8', '4.8', '3.8', '5.4', '5.9', '5.3', '5.1', '4.4', '7.0',
       '6.1', '5.2', '4.5', '3.5', '5.0', '2.8', '7.1', '4.1', '5.7',
       '6.5', '5.8', '4.7', '4.0', '(NA)', '4.9', '2.3', '12.9', '6.3',
       '9.8', '3.6', '4.3', '8.9', '7.9', '12.5', '3.4', '9.9', '8.2',
       '6.7', '7.2', '14.0', '2.2', '7.3', '7.4', '6.6', '3.7', '8.0',
       '0.6', '1.8', '2.6', '1.3', '3.0', '2.4', '1.5', '2.1', '3.1',
       '2.0', '7.5', '1.9', '6.9', '3.2', '2.9', '10.6', '22.6', '10.4',
       '2.5', '8.1', '9.7', '12.6', '8.3', '10.1', '16.1', '13.1', '1.4',
       '16.0', '10.7', '11.0', '7.7', '11.3', '8.8', '7.8', '3.3', '0.0',
       '0.9', '2.7', '-0.1', '1.7', '14.5', '-1.7', '0.3', '-2.0', '1.1',
       '12.4', '9.3', '8.5', '1.6', '8.4', '14.8', '20.6', '23.4', '11.8',
       '16.5', '8.7', '11.7', '24.4', '11.4', '11.1', '0.7', '12.2',
       '9.2', '8.6', '9.0', '-0.8', '0.8', '15.4', '9.4', '-0.2', '

In [61]:
gdppc = gdppc[~gdppc["2020-2023"].str.contains(r'\(NA\)', na=False)]

In [62]:
gdppc["2020-2023"].unique()

array(['5.6', '6.2', '3.9', '4.6', '6.0', '5.5', '7.6', '4.2', '6.4',
       '6.8', '4.8', '3.8', '5.4', '5.9', '5.3', '5.1', '4.4', '7.0',
       '6.1', '5.2', '4.5', '3.5', '5.0', '2.8', '7.1', '4.1', '5.7',
       '6.5', '5.8', '4.7', '4.0', '4.9', '2.3', '12.9', '6.3', '9.8',
       '3.6', '4.3', '8.9', '7.9', '12.5', '3.4', '9.9', '8.2', '6.7',
       '7.2', '14.0', '2.2', '7.3', '7.4', '6.6', '3.7', '8.0', '0.6',
       '1.8', '2.6', '1.3', '3.0', '2.4', '1.5', '2.1', '3.1', '2.0',
       '7.5', '1.9', '6.9', '3.2', '2.9', '10.6', '22.6', '10.4', '2.5',
       '8.1', '9.7', '12.6', '8.3', '10.1', '16.1', '13.1', '1.4', '16.0',
       '10.7', '11.0', '7.7', '11.3', '8.8', '7.8', '3.3', '0.0', '0.9',
       '2.7', '-0.1', '1.7', '14.5', '-1.7', '0.3', '-2.0', '1.1', '12.4',
       '9.3', '8.5', '1.6', '8.4', '14.8', '20.6', '23.4', '11.8', '16.5',
       '8.7', '11.7', '24.4', '11.4', '11.1', '0.7', '12.2', '9.2', '8.6',
       '9.0', '-0.8', '0.8', '15.4', '9.4', '-0.2', '0.2', '9

In [63]:
gdppc["2020-2023"].info()

<class 'pandas.core.series.Series'>
Index: 3114 entries, 0 to 3139
Series name: 2020-2023
Non-Null Count  Dtype 
--------------  ----- 
3114 non-null   object
dtypes: object(1)
memory usage: 48.7+ KB


In [64]:
gdppc["2020-2023"] = gdppc["2020-2023"].astype(float)

In [65]:
gdppc["2020-2023"].info()

<class 'pandas.core.series.Series'>
Index: 3114 entries, 0 to 3139
Series name: 2020-2023
Non-Null Count  Dtype  
--------------  -----  
3114 non-null   float64
dtypes: float64(1)
memory usage: 48.7 KB


In [66]:
gdppc.head()

Unnamed: 0,GeoFips,GeoName,2020-2023
0,1001,"Autauga, AL",5.6
1,1003,"Baldwin, AL",6.2
2,1005,"Barbour, AL",3.9
3,1007,"Bibb, AL",4.6
4,1009,"Blount, AL",6.0


In [67]:
gdppc.rename(columns={"2020-2023": "gdppc_20-23"}, inplace=True)

### Step 3 - Merge election and GDP per capita data

With the gdppc dataset cleaned, let's now merge with our dataset on the change in share of vote to Republicans.

In [68]:
rep_gdppc = pd.merge(merged_data, gdppc, left_on=['county_fips'], right_on=['GeoFips'])


In [69]:
rep_gdppc.head()

Unnamed: 0,county_fips,county_name,per_gop_2020,per_gop_2024,diff_gop,GeoFips,GeoName,gdppc_20-23
0,1001,Autauga County,0.714368,0.726643,0.012275,1001,"Autauga, AL",5.6
1,1003,Baldwin County,0.761714,0.78649,0.024776,1003,"Baldwin, AL",6.2
2,1005,Barbour County,0.534512,0.571165,0.036653,1005,"Barbour, AL",3.9
3,1007,Bibb County,0.784263,0.819393,0.035131,1007,"Bibb, AL",4.6
4,1009,Blount County,0.895716,0.901763,0.006047,1009,"Blount, AL",6.0


### Step 4 - Conduct data visualization 

Now we will perform comparative analysis, at a county-level, for our two variables of interest. First, we start off with a scatter to visualise the data.


In [70]:
px.scatter(
    rep_gdppc,
    x="gdppc_20-23",
    y="diff_gop",
    title="Change in Republican vote vs. change in GDP per capita, 2020 to 2024",
    hover_data = {"county_name": True},
)

There appears to be a weak negative relationship between growth and change in Republican vote share, but let's analyse this further. 

### Step 5 - Perform data analysis

Now we will understand the strength and significance of the relationship between gdppc growth and change in vote share for Republican candidates. We will fit a trend line to our previous scatter plot and then perform an OLS regression.  


In [71]:
px.scatter(
    rep_gdppc,
    x="gdppc_20-23",
    y="diff_gop",
    title="Change in Republican vote vs. change in GDP per capita, 2020 to 2024",
    hover_data = {"county_name": True},
    trendline="ols",
)

From our trendline, we can see that there is indeed a weak negative relationship between gdppc growth and the change in proportion of votes towards Republican candidates. As the growth rate increases, Republican vote proportion decreases. However, this is a weak relationship, with a coefficient of -0.0006. Is this relationship significant? How do we interpret it? Let's perform an OLS regression.

In [72]:
df = rep_gdppc

X = df['gdppc_20-23']  # Independent variable: GDPPC change
y = df['diff_gop']     # Dependent variable: Change in GOP percentage of vote

# Add a constant to the independent variable for statsmodels
X_with_const = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X_with_const).fit()

# Print the regression summary
regression_summary = model.summary()

regression_summary

0,1,2,3
Dep. Variable:,diff_gop,R-squared:,0.006
Model:,OLS,Adj. R-squared:,0.006
Method:,Least Squares,F-statistic:,19.66
Date:,"Fri, 06 Dec 2024",Prob (F-statistic):,9.59e-06
Time:,14:36:27,Log-Likelihood:,8202.9
No. Observations:,3051,AIC:,-16400.0
Df Residuals:,3049,BIC:,-16390.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0225,0.001,30.153,0.000,0.021,0.024
gdppc_20-23,-0.0006,0.000,-4.434,0.000,-0.001,-0.000

0,1,2,3
Omnibus:,667.925,Durbin-Watson:,1.515
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3157.58
Skew:,0.973,Prob(JB):,0.0
Kurtosis:,7.588,Cond. No.,15.1


### Conclusion

From our regression, we can see a few interesting findings. Firstly, if GDP per capita growth is 0, we can see there is a coefficient 'const' for the intercept of the OLS regression of 0.0225. This can be interpreted as follows: if GDP per capita growth is 0, the predicted change in Republican vote share is an increase in 2.25 percentage points, from 2020 to 2024.

Now, let's look at the slope - the coefficient of gdppc growth on change in Republican vote share. We see that the weak negative coefficient of -0.0006 is indeed significant at p < 0.000. For every 1% increase in compound annual GDP per capita growth, the predicted change in the Republican vote share is -0.0006 percentage points. Given the statistical significance, this relationship is unlikely due to random chance.

Finally, let's interpret the R-squared, of 0.006, which gives the 'model fit'. This can be interpreted as follows: only 0.6% of the variation in change in Republican vote share is explained by variation in GDP per capita growth rates. This means that, although significant, GDP per capita alone is not a strong predictor of the change in Republican vote share. 

In conclusion, we can reject the null hypothesis and state that there is indeed a significant negative relationship between GDP per capita growth and change in Republican vote share. However, our model has weak explanatory power and so further research should include additional independent variables to understand their impact.
