# W12: Multivariate Regression (Solutions)

This week we will focus on two uses of multivariate regression: observational analysis and experimental analysis. We will use a dataset from a study I recently conducted on peer-to-peer correction of online misinformation (Yadav and Xu, working paper). 

The study looked at whether we can use social norm nudges to increase peer to peer correction of online misinformation. There were two types of nudges used - one that emphasized the acceptability of correction and one that emphasized the user responsibility of correction. In the experiment, respondents were assigned to one of three treatment conditions: a control with no nudge, an acceptability nudge, or a responsibility nudge. These nudges were embedded on social media posts that had misinformation about climate change or microwaving a penny. We additionally collected covariates on gender, age, race, etc. 

We will use a reduced set of covariates from the experimental data. Some details about these covariates are included below: 
- `age`: Age of the respondent (numeric variable)
- `gender`: Gender of the respondent (Male, Female)
- `employment`: 1 if the respondent is employed and 0 if they are not (binary variable)
- `marital_status`: 1 if the respondent is married and 0 if not (binary variable)
- `treatment`: the variable for which treatment condition the respondent was assigned to
    - "control": assigned to social media posts that have no nudges
    - "acceptability": assigned to social media posts that have the acceptability nudge
    - "responsibility": assigned to social media posts that have the responsibility nudge
- `correction`: the outcome variable that is 1 if the respondent corrected at least one of the social media posts and 0 if they corrected neither.


In [9]:
#load library and data 
library(estimatr)

data <- read.csv("ps3_w12.csv")
head(data)

Unnamed: 0_level_0,X,age,gender,employment,marital_status,treatment,correction
Unnamed: 0_level_1,<int>,<int>,<chr>,<int>,<int>,<chr>,<int>
1,1,40,Male,1,0,acceptability,0
2,2,79,Male,0,1,acceptability,1
3,3,36,Female,1,0,control,1
4,4,22,Female,1,0,acceptability,0
5,5,34,Female,1,0,acceptability,0
6,6,71,Female,0,0,acceptability,1


## Observational Analysis 

First, run a multivariate regression to test the association between age, gender, and correction of misinformation. This means you regress `correction` (outcome) on `age` and `gender` (predictors). Then answer the following questions: 

- What is the baseline condition for `gender`?
- How do we interpret the coefficients for gender and age?
- Are these relationships causal?
- Could there be other omitted variables we have not included here that would affect someone's likelihood of correcting misinformation? 

In [10]:
summary(lm(data = data, formula = correction ~ gender + age))


Call:
lm(formula = correction ~ gender + age, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5301 -0.4795 -0.4342  0.5153  0.5862 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.5499910  0.0392802  14.002   <2e-16 ***
genderMale   0.0082930  0.0258420   0.321    0.748    
age         -0.0015653  0.0007537  -2.077    0.038 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4993 on 1507 degrees of freedom
Multiple R-squared:  0.002862,	Adjusted R-squared:  0.001538 
F-statistic: 2.162 on 2 and 1507 DF,  p-value: 0.1154


## Experimental Analysis 

First, run a bivariate regression, regressing `correction` (outcome) on `treatment`. 

- What is the baseline condition for `treatment`?
- How do you interpret the coefficients in your regression? 

In [12]:
#ignore this cell of code but make sure you run it
library(tidyverse)

data <- data %>% mutate(treatment=factor(treatment)) %>% 
  mutate(treatment=fct_relevel(treatment,c("control","acceptability","responsibility"))) %>%
 arrange(treatment)

In [13]:
summary(lm(data = data, formula = correction ~ treatment))


Call:
lm(formula = correction ~ treatment, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5029 -0.4818 -0.4503  0.5182  0.5497 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.48178    0.02248  21.434   <2e-16 ***
treatmentacceptability   0.02109    0.03134   0.673    0.501    
treatmentresponsibility -0.03148    0.03180  -0.990    0.322    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4996 on 1507 degrees of freedom
Multiple R-squared:  0.001878,	Adjusted R-squared:  0.0005533 
F-statistic: 1.418 on 2 and 1507 DF,  p-value: 0.2426


Now, confirm regression results by calculating difference in means in outcome (`correction`) between (1) control and acceptability conditions, and (2) control and responsibility conditions.

In [8]:
#compare control to acceptability
difference_in_means(correction ~ treatment, data = data, condition1 = "control", condition2 = "acceptability")

#compare control to responsibility 
difference_in_means(correction ~ treatment, data = data, condition1 = "control", condition2 = "responsibility")

Design:  Standard 
                         Estimate Std. Error   t value  Pr(>|t|)    CI Lower
treatmentacceptability 0.02108669 0.03139009 0.6717628 0.5018881 -0.04051043
                         CI Upper       DF
treatmentacceptability 0.08268382 1011.768

Design:  Standard 
                           Estimate Std. Error    t value  Pr(>|t|)    CI Lower
treatmentresponsibility -0.03147712 0.03177321 -0.9906812 0.3220846 -0.09382807
                          CI Upper       DF
treatmentresponsibility 0.03087384 984.9949

Now, let's add some covariates to our previous regression. These covariates must be pre-treatment, that is they are unaffected by treatment. Add covariates for gender, age, and employment to your regression. 

- Which coefficients are causal and which are not?
- How would you interpret the estimate for acceptability now?
- Are any of the variables predictive of the outcome?
- Why did we include these covariates in the regression?
- Did including covariates reduce the standard error of our treatment coefficients?

In [14]:
summary(lm(data = data, formula = correction ~ treatment + gender + age + employment))


Call:
lm(formula = correction ~ treatment + gender + age + employment, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5678 -0.4797 -0.4158  0.5161  0.6198 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.5660240  0.0492606  11.490   <2e-16 ***
treatmentacceptability   0.0217648  0.0313490   0.694    0.488    
treatmentresponsibility -0.0305093  0.0318404  -0.958    0.338    
genderMale               0.0104397  0.0260816   0.400    0.689    
age                     -0.0016905  0.0007874  -2.147    0.032 *  
employment              -0.0167245  0.0270330  -0.619    0.536    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4993 on 1504 degrees of freedom
Multiple R-squared:  0.004936,	Adjusted R-squared:  0.001628 
F-statistic: 1.492 on 5 and 1504 DF,  p-value: 0.1893
