# Week 11: Data Visualization and Bivariate Regression (Solutions)

This week we will focus on visualizing relationships between two variables and then running regressions to see those relationships in numbers. Learning how to visualize data well is a very important skill - more often than not, you will present data analysis results to lay audiences who do not have the expertise to interpret numbers the way you do and plots are the best way to convey important results. 

The package we use in R to create plots is called `ggplot2` and it is a versatile package for making many different types of plots. The one we focus on is called a scatterplot and we add lines of best fit to it. You have already seen other types of plots in previous lectures that can also be created in `ggplot` such as histograms. 

In [None]:
#load the following library and data set 

library(ggplot2)

data = read.csv("ps3_w11.csv")
head(data)

Here is a quick rundown of what each column means:

- `state`: State (e.g., for CA-13, "CA")
- `district`: District number (e.g., for CA-13, 13)
- `name_dem_cand`: Democrat candidate name in the 2020 US House elections
- `name_rep_cand`: Republican candidate name in the 2020 US House elections
- `dem_us_house_percent_2020`: Democrat candidate's vote share in 2020 election (percent)
- `dem_us_house_percent_2018`: Democrat candidate's vote share in 2018 election (percent)
- `dem_won_ushouse_2018`: A Democrat won the US House election in 2018, and so is running for re-election in 2020 (0 = lost, 1 = won)
- `clinton_percent_2016`: Clinton vote share in 2016 in the district (percent)
- `spending_dem_ushouse_2020`: Democratic US House candidate's spending in 2020, in millions of dollars 
- `spending_rep_ushouse_2020`: Republican US House candidate's spending in 2020, in millions of dollars 

First, create a scatterplot to show the relationship between how much a Democratic candidate spent in 2020 (`spending_dem_ushouse_2020`) and their vote share in 2020 (`dem_us_house_percent_2020`). 

In [None]:
qplot(data = data, x = spending_dem_ushouse_2020, y = dem_us_house_percent_2020)

Now on this plot, add a line of best fit that shows the relationship between the two variables.

In [None]:
qplot(data = data, x = spending_dem_ushouse_2020, y = dem_us_house_percent_2020) + geom_smooth(method = "lm") + 
#you don't need to know these lines of code below but I am showing you ways to make your plots prettier!
theme_bw() + 
labs(title = "Relationship between Democrats Election Expenditure and Vote Share in 2020", 
     x = "Democrats Expenditure", 
     y = "Democrats Vote Share")

What is your plot telling you? Provide a brief interpretation of what you see visually in this plot.

It appears that there is a positive relationship between how much the Democratic candidate spent during the 2020 election and their vote share in the 2020 election.

Now let's see what we get if we run a regression with a Democratic candidate's vote share in 2020 as the outcome and the Democratic candidate's expenditure in 2020 as the predictor.

In [None]:
summary(lm(data = data, formula = dem_us_house_percent_2020 ~ spending_dem_ushouse_2020)) 

We got an estimate of $1.0491$. How do we interpret this number? Remember, this is not a causal relationship, just an association. 

A $1 million dollar increase in the Democratic candidate's expenditure during the 2020 election was associated with a 1.04 percent increase in their vote share. 

If I asked you to predict the increase in the Democratic candidate's vote share if they spent $9.8 million dollars, how would you get that number? Remember the formula for a line is y = mx + b, where m is the slope and b is the Y-intercept.

In [None]:
pred.vote.share <- 47.1213 + 1.0491 * 9.8
pred.vote.share

What about if they spent $45 million dollars? Would this be a reasonable outcome to predict? Why? 

In [None]:
extrap.vote.share <- 47.1213 + 1.0491 * 45
extrap.vote.share

This would not be a reasonable outcome to predict because we see there is no data on a Democratic candidate's vote share for spending $45 million dollars. We would be extrapolating out of our data, which is generally not good practice.