# POLSCI 3 Spring 2024 

## Week 11 Bivariate and Multivariate Regression (I)

We will look at geographical data in the US. Let's see what variables predict county-level votes for the 2020 US presidential election.

In [None]:
usvote <- readRDS("Data/project_24S/jw_us.RData")
usvote <- usvote[c("county_fips","county name","state","vshare_d20","vshare_r20","vshare_d16","vshare_r16","ineq_15","frac_religious","college_15","hispanic_15","fdi_job","manuf_empl_sh_chg_0015","imm_share_rich_2016","imm_share_high_2016","imm_share_low_2016","pop2000")]
head(usvote)

# the data is compiled from a variety of sources: B. Enke (2020), Immigration is from Mayda et al., Trade and industry-share related variables are from Autor et al. (2020), and FDI is from Bureau of Labor Statistics.

The variables include a wide range of economic and demographic and social measures of US counties.
For example,

Potential independent variables:
- `ineq_15` is the income inequality in a county level in 2015.
- `frac_religious` is the proportion of the population that are religious
- `college_15` is the share of college educated citizens in 2015
- `hispanic_15` is the share of hispanic citizens in 2015
- `fdi_job` is the share of foreign-owned companies' employment  in a county.
- `manuf_empl_sh_chg_0015` is the change in manufacturing employment share between 2015 and 2000, which captures the decline in the manufacturing sector in various US regions, such as the manufacturing industrial belt in Illinois, Pennsylvenia, Michigan and Wisconsin.
- `imm_share_rich_2016` is the share of the population with high-income-class immigrants in 2016
- `imm_share_high_2016` is the share of the population with high-skilled (college educated) immigrants in 2016
- `imm_share_low_2016` is the share of the population with low-skilled immigrants in 2016

Potential dependent variables--the variables related to voting are:
- `vshare_d20` indicates vote share for Joe Biden in the 2020 election,
- `vshare_d20` indicates votes hare for Donald Trump in the 2020 election, 
- `vshare_d16` indicates the share for Hillary Clinton in 2016, 
- `vshare_d08` indicates vote share for Trump in 2016, 
and so on.

Weights:

- For the analysis, we may want to upweight us counties with a larger population, so that the entire result is somewhat representatitve of the entire US population. That is, use `weight=pop2000`

## **Bivariate Regression**

Take a minute to look at the data. Try different variables. What variable would have a statistically significant positive relationship with vote share for Joe Biden?

Think for 5 minutes and write which variable you expect to have a positive or negative relationship with vote share and why.

In [None]:
#I expect:  WRITE HERE
#Because:   WRITE HERE

Now test your expectation.

In [None]:
#try different variables! 


<font color=white>*sample code: summary(lm(data= usvote, vshare_d20 ~ college_15, weights=pop2000))*</font> 


**Interpreting regression models**

Using your results, try to format it in the standard way of interpreting linear regression:

<font color=blue>*On average, one unit increase in the `independent variable` is associated with a `0000` increase in the `dependent variable`.*</font> 


(Replace the blanks with actual variable names and coefficients.)

In [None]:
# Your response:   WRITE HERE


**Application to Term Projects**

On your term projects, write the code that you would run as the analysis, and provide a statement to interpret the result. ex) for increase in one unit, ~~
- This will be difference_in_means() but also can be summary(lm()) 
- if you can and is applicable, create a plot! qplot() + geom_smooth()
- Project notebook can be found here:
https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdbroockman%2FPS3-SP24-Public&branch=main&urlpath=tree%2FPS3-SP24-Public%2FFinal+Project%2FPS3_Final_Project.ipynb

## Multivariate Regression 


You may have found out that there can be some differences in the predicted Y and actual Y from the above bivariate regression. 

How do we make the prediction better? 
Multivariate regression is just adding another (or more) independent variables in the lm() formula. Now let's add one or more variable!

**Is Bivariate Regression accurate in predicting the outcome variable?**

Recall that the line of best fit is a line that minimizes the residuals, which minimizes the difference between the predicted outcome variable Y and the actual Y. Can we make this difference even smaller?

Let's look at just one example: Alameda County

In [None]:
#Let's look at Alameda County. If you are interested in another county, search the fips code and add those counties to the subset.
alameda <-subset(usvote, county_fips=='06001'|county_fips=='26163'|county_fips=='01015')
alameda
alameda$vshare_d20
alameda$imm_share_low_2016
alameda$imm_share_high_2016

In [None]:
#what is the vote support for Biden in 2020?
0.802016

In [None]:
#What does your model predict the vote to support?  (hint: coefficient*actual data + intercept = predicted Y)
0.139 + 0.009*44.7

The difference between the actual Y and the predicted Y for a bivariate case is quite large! 

Let's do the same thing for the two other counties.

In [None]:
mod <- lm(data=usvote, formula= vshare_d20 ~ college_15) #this is the bivariate linear regression model
mod$coef #these are the coefficients and intercepts
predicted_y <- alameda$college_15*mod$coef[2] + mod$coef[1]  #this represents y = mx + b
predicted_y # mx+b is assigned to be 'predicted_y'

actual_y <- alameda$vshare_d20  
actual_y #actual y is the actual outcome variable


average_resid1 <- mean(predicted_y - actual_y)
average_resid1

Now let's see the residuals for a multivariate model! 

In [None]:
mod2 <- lm(data=usvote, formula=vshare_d20 ~ college_15 + imm_share_high_2016) #this model adds 'imm_share'
summary(mod2)
mod2$coef

In [None]:
predicted_y_2 <- alameda$college_15*mod2$coef[2] + alameda$imm_share_high_2016*mod2$coef[3] + mod2$coef[1]
predicted_y_2
actual_y_2 <- alameda$vshare_d20
actual_y_2

average_resid2<- mean(predicted_y_2 - actual_y_2)
average_resid2

**Improving Prediction**

Compare the two cases, of `average_resid1` and `average_resid2`.


Which residual has a smaller average error? Do you think bivariate or multivariate models are better at predicting?

In [None]:
average_resid1
average_resid2

In [None]:
# Your Answer: WRITE HERE

More on week 12...: **Reducing omitted variable bias**

