### Influential Ethicity Populations on Estimated Vaccine Hesitancy in American Counties

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

Our goal for this notebook is to build a model to predict estimated hesitancy of a county based on the ethnic/racial makeup of said county. We will make a county in the following form:
$y_{esthes} = \beta_0 + \beta_1x_{nhamer}+ \beta_2x_{asian} + \beta_3x_{hispan} + \beta_4x_{black} + \beta_5x_{natam} + \beta_6x_{white}$

We hope that with this model we examine the relationship between multiple ethnicities and vaccination hesitancy to evaluate if certain communities have a greater hesistancy than others. 

In [2]:
import requests
import pandas as pd
import numpy as np
import statsmodels.api as sm

<IPython.core.display.Javascript object>

In [3]:
# Request hesitancy data
hesitancy_endpoint = "https://data.cdc.gov/resource/q9mh-h2tw.json?$limit=4000"
r = requests.get(hesitancy_endpoint)
hesitancy_df = pd.DataFrame(r.json())
# hesitancy_df.head()

<IPython.core.display.Javascript object>

In [4]:
# Extracting political party data from our public github
url = "https://raw.githubusercontent.com/gspiga/STAT141Bfinal/main/data/raw_data.csv"
poli_df = pd.read_csv(url)
poli_df.head()

Unnamed: 0,Location,Governor Political Affiliation,State Senate Majority Political Affiliation,State House Majority Political Affiliation,State Attorney General Political Affiliation,State Insurance Commissioner Political Affiliation,Footnotes
0,Alabama,Republican,Republican,Republican,Republican,Appointed,
1,Alaska,Republican,Republican,Republican,Appointed,Appointed,
2,Arizona,Republican,Republican,Republican,Republican,Appointed,
3,Arkansas,Republican,Republican,Republican,Republican,Appointed,
4,California,Democrat,Democrat,Democrat,Democrat,Democrat,


<IPython.core.display.Javascript object>

In [5]:
# Changing one column in poli_df to upper case so that we may join them (pandas is case sensitive).
poli_df["Location"] = poli_df["Location"].apply(
    lambda x: x.upper()  # Uppercase all state names
)
poli_df.head()

Unnamed: 0,Location,Governor Political Affiliation,State Senate Majority Political Affiliation,State House Majority Political Affiliation,State Attorney General Political Affiliation,State Insurance Commissioner Political Affiliation,Footnotes
0,ALABAMA,Republican,Republican,Republican,Republican,Appointed,
1,ALASKA,Republican,Republican,Republican,Appointed,Appointed,
2,ARIZONA,Republican,Republican,Republican,Republican,Appointed,
3,ARKANSAS,Republican,Republican,Republican,Republican,Appointed,
4,CALIFORNIA,Democrat,Democrat,Democrat,Democrat,Democrat,


<IPython.core.display.Javascript object>

In [6]:
# We join the dataframe
polihes_df = hesitancy_df.merge(poli_df, left_on="state", right_on="Location")

# Have to change decimal values from string to numeric
cols = [
    "estimated_hesitant",
    "estimated_hesitant_or_unsure",
    "estimated_strongly_hesitant",
    "estimated_strongly_hesitant",
    "social_vulnerability_index",
    "ability_to_handle_a_covid",
    "percent_hispanic",
    "percent_non_hispanic_american",
    "percent_non_hispanic_asian",
    "percent_non_hispanic_black",
    "percent_non_hispanic_native",
    "percent_non_hispanic_white",
]
polihes_df[cols] = polihes_df[cols].apply(pd.to_numeric, errors="coerce", axis=1)

<IPython.core.display.Javascript object>

We will now form a linear regression model based on SVI, Death rate, ability to handle a covid outbreak, and governor political affiliation to predict hesistancy. 

In [7]:
X_eth = polihes_df[
    [
        "percent_non_hispanic_american",
        "percent_non_hispanic_asian",
        "percent_hispanic",
        "percent_non_hispanic_black",
        "percent_non_hispanic_native",
        "percent_non_hispanic_white",
    ]
]
X_eth = sm.add_constant(X_eth)
Y = polihes_df["estimated_hesitant"]

model = sm.OLS(Y, X_eth).fit()
predictions = model.predict(X_eth)

model.summary()

0,1,2,3
Dep. Variable:,estimated_hesitant,R-squared:,0.165
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,103.5
Date:,"Wed, 08 Dec 2021",Prob (F-statistic):,2.86e-119
Time:,23:53:24,Log-Likelihood:,5477.9
No. Observations:,3142,AIC:,-10940.0
Df Residuals:,3135,BIC:,-10900.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3897,0.051,7.702,0.000,0.290,0.489
percent_non_hispanic_american,-0.1629,0.055,-2.962,0.003,-0.271,-0.055
percent_non_hispanic_asian,-0.7552,0.067,-11.246,0.000,-0.887,-0.624
percent_hispanic,-0.3045,0.051,-5.944,0.000,-0.405,-0.204
percent_non_hispanic_black,-0.2031,0.052,-3.942,0.000,-0.304,-0.102
percent_non_hispanic_native,-0.0342,0.163,-0.210,0.833,-0.353,0.285
percent_non_hispanic_white,-0.2583,0.052,-5.011,0.000,-0.359,-0.157

0,1,2,3
Omnibus:,113.032,Durbin-Watson:,0.296
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152.412
Skew:,0.377,Prob(JB):,8.019999999999999e-34
Kurtosis:,3.772,Cond. No.,312.0


<IPython.core.display.Javascript object>

In [8]:
# Y_sh = polihes_df["estimated_strongly_hesitant"]

# model2 = sm.OLS(Y_sh, X_eth).fit()
# predictions = model2.predict(X_eth)

# model2.summary()

<IPython.core.display.Javascript object>

From our model summary we can see that we get a model for predicting hesistancy with equation:
$y_{esthes} = 0.3897 - 0.1629x_{nhamer}- 0.7552x_{asian} - 0.3045x_{hispan} - 0.2031x_{black} - 0.0342x_{natam} - 0.2583x_{white}$
We can see that the ethnicity population with the largest influence on vaccine hesitancy is non-hispanic asians with a coefficient of -0.7552. As with all the coefficients in this model, the negative values represent a relationship corresponding to the increase in population of a certain ethnicity/race lowers the estimated hesitancy of a population. Out of the ethnicities listed, only one is not significant to our model, the population of native americans. We can understand that this is probably due to the extremely low population of Native Americans in most counties.

If we look at the other lower-coefficient ethnicities such as non-hispanic americans and non-hispanic black, we can see that these communities have the lesser influence on hesitancy. So while an increase in non-hispanic blacks and non-hispanic americans may overall decrease hesitancy, it does so at a much slower rate than ethnicities such as Asian and Hispanic. This leads us to question if there larger proportions of the populations of blacks and non-hispanic Americans that tend to be more hesitant.  

Another factor to notice in the model is our adjusted R-squared value, which is only 0.164. We can understand by this that the explanatory power of ethnicity on vaccine hesiancy is low. 


In conclusion we can see that the ethnicity make up of a population can be used to predict the estimated hesitancy of a county. However, we should not expect ethnicity to have a high explanatory power as to why there might be a certain level of hesistancy in a given county. 