# ECON 326: Multiple Regression using R and Jupyter

## Authors
* Jonathan Graves (jonathan.graves@ubc.ca)
* Devan Rawlings (rawling5@student.ubc.ca)

## Prerequisites
* Simple regression
* Data analysis and introduction

## Outcomes

* Understand how the theory of multiple regression models works in practice
* Be able to estimate multiple regression models using R
* Interpret and explain the estimates from multiple regression models.
* Understand the relationship between simple linear regressions and similar multiple regressions.
* Describe a control variable and regression relationship
* Explore the relationship between controls and causal interpretations of regression model estimates.

### Notes

<span id="fn1">[<sup>1</sup>](#fn1s)Data is provided under the Statistics Canada Open License.  Adapted from Statistics Canada, 2016 Census Public Use Microdata File (PUMF). Individuals File, 2020-08-29. This does not constitute an endorsement by Statistics Canada of this product.</span>

<span id="fn2">[<sup>2</sup>](#fn2s)Stargazer package is due to: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.2. https://CRAN.R-project.org/package=stargazer </span>



In [None]:
library(tidyverse)
library(haven)

#install.packages("stargazer")  #un-comment this and run this line if you don't have stargazer installed
library(stargazer)
source("hands_on_tests_3.r")

In [None]:
census_data <- read_dta("02_census2016.dta")

census_data <- as_factor(census_data)
census_data <- census_data %>%
               mutate(lnwages = log(wages))

census_data <- filter(census_data, !is.na(census_data$wages))
census_data <- filter(census_data, !is.na(census_data$mrkinc))

glimpse(census_data)

# Part 1: Introducing Multiple Regressions

At this point, you are familiar with the simple regression model and its relationship to the comparison-of-means $t$-test.  However, most econometric analysis doesn't use simple regression - this is because, in general, economic data and models are far too complicated to be summarized with a single relationship.  One of the features of most economic datasets is a complex, multi-dimensional relationship between different variables.  This leads to the two key motivations for **multiple regression**:

* First, it can improve the *predictive* properties of a regression model, by introducing other variables that play an important econometric role in the relationship being studied.
* Second, it allows the econometrician to *differentiate* the importance of different variables in a relationship.

This second motivation is usually part of **causal analysis** when we believe that our model has an interpretation as a cause-and-effect.  However, even if it does not, it is still useful to understand which variables are "driving" the relationship in the data.

Let's look at the following plot, which depict the relationships between ``wages``, ``immstat`` and ``sex``.  In the top panel, the colour of each cell is the (average) log of ``wages``.  In the bottom panel, the size each of each circle is the number of people in that combination of categories. 


In [None]:
options(repr.plot.width=6,repr.plot.height=4) #controls the image size

f <- ggplot(data = census_data, aes(x = immstat, y = sex)) + xlab("Immigration Status") + ylab("Sex")
f + geom_tile(aes(fill=lnwages)) + scale_fill_distiller(palette="Set1") #this gives us fancier colours

f <- ggplot(data = census_data, aes(x = immstat,y = sex))
f + geom_count()

You can see immediately that there are *three* relationships happening at the same time:

1. There is a relationship between ``wages`` and ``sex``
2. There is a relationship between ``wages`` and ``immstat``
3. There is a relationship between ``sex`` and ``immstat``

A simple regression can analyze any _one_ of these relationships in isolation, but it cannot assess more than one of them at a time.  For instance, let's look at these regressions.

In [None]:
regression1 <- lm(data = census_data, wages ~ immstat)
regression2 <- lm(data = census_data, wages ~ sex)

dummy_immstat = as.numeric(census_data$immstat) - 1 #what is this line of code doing?  
# hint, the as.numeric variable treats a factor as a number

regression3 <- lm(data = census_data, dummy_immstat ~ sex) 
#this is actually a very important regression model called "linear probability"
#we will learn more about it later in the course

stargazer(regression1, regression2, regression3, title="Comparison of Regression Results",
          align = TRUE, type="text", keep.stat = c("n","rsq")) #we will learn more about this command later on!

The problem here is that these results tell us:

* Men earn higher wages than women (positive coefficient on ``sexmale`` in (2))
* Men are also (slightly) more likely to be immigrants (positive coefficient on ``sexmale`` in (3))
* However, immigrants earn less than non-immigrants. (negative coefficient on ``immstatimmigrants`` in (1))

This implies that when we measure the immigrant wage gap alone, we are *indirectly* including part of the gender-wage gap as well.  This is bad; the "true" wage gap is probably higher, but it is being reduced because high-earning men are more likely to be immigrants.

This is both a practical and a theoretical problem.  It's not just about the model, it's also about what we mean when we say "the immigrant wage gap".
* If we mean "the difference in wages between immigrants and non-immigrants", then the simple regression result is what we want.
* However, this ignores all the other reasons immigrants could have a different wage (education, gender, age, etc.)
* If we mean "the difference in wages between immigrants and non-immigrants, holding other factors equal", then the simple regression result is not suitable.

The problem is that "holding other factors" equal is a debatable proposition.  Which factors?  Why?  This is why you see such a debate, particularly in the media about these kinds of gaps (e.g. the gender wage gap).  We will revisit this in the exercises.

### Multiple Regression Models

I think most people, when they discuss the immigrant wage gap, would agree that we do not want to conflate the immigrant wage gap measurement with the gender wage gap.  This implies that we *must* add in some other variables.

A multiple regression model simply adds more explanatory ($X_i$) variables to the model.  In our case, we would take our simple regression model:

$$W_i = \beta_0 + \beta_1 I_i + \epsilon_i$$

and augment with a variable which captures ``sex``:

$$W_i = \beta_0 + \beta_1 I_i + \color{red}{\beta_2 S_i} + \epsilon_i$$

Just as in a simple regression, the goal of estimating a multiple regression model using OLS is to solve the problem:

$$(\hat{\beta_0},\hat{\beta_1},\hat{\beta_3}) = \arg \min_{b_0,b_1,b_2} \sum_{i=1}^{n} (M_i - b_0 - b_1 W_i -b_3 S_i)^2 = \sum_{i=1}^{n} (e_i)^2$$

In general, you can have any number of explanatory variables in a multiple regression model (as long as it's not larger than $n-1$, your sample size).  However, there are costs to including more variables, which we will learn about more later.  For now, we want to focus on building an appropriate *model*, and worrying about the properties later.

A regression model like this is easy to estimate in R; you use the same command as in simple regression, and just add the new variable to the model:

``wages ~ immstat + sex``

Let's see it in action:

In [None]:
multiple_model_1 <- lm(data = census_data, wages ~ immstat + sex)

summary(multiple_model_1)

As you can see, there are now three coefficients: one for ``male``, one for ``immigrants`` and one for the intercept.  The important thing to remember is that these relationships are being calculated *jointly*.  Compare the result above to the two simple regressions we saw earlier:

In [None]:
stargazer(regression1, regression2, multiple_model_1, title="Comparison of Muliple and Simple Regression Results",
          align = TRUE, type="text", keep.stat = c("n","rsq"))

# which column is the multiple regression?

Notice the difference in the coefficients: *all* of them are different.

> _Think Deeper_: Why would all of these coefficients change?  Why not just the coefficient on ``immstat``?

You will also notice that the standard errors are different.  This is an important lesson: including (or not including) variables can change the statistical significance of a result.  This is why it is so important to be very careful when designing regression models and thinking them through: a coefficient estimate is a consequence of the *whole model*, and should not be considered in isolation.



## Interpreting Multiple Regression Coefficients

If you don't think about it too carefully, interpreting coefficients in a multiple regression is nearly the same as in a simple regression.  After all, our regression equation is:

$$W_i = \beta_0 + \beta_1 I_i + \beta_2 S_i + \epsilon_i$$

You could (let's pretend for a moment that $S_i$ was continuous) calculate:

$$\frac{\partial W_i}{\partial S_i} = \beta_3$$

This is the same interpretation as in a simple regression model:
* $\beta_3$ is the change in $W_i$ for a 1-unit change in $S_i$.
* As you will see in the exercises, when $S_i$ is a dummy, we have the same interpretation as in a simple regression model: the (average) difference in the dependent variable between the two levels of the dummy variable.

However, there is an important difference: we are *holding constant* the other explanatory variables.  That's what the $\partial$ means when we take a derivative.  This was actually always there (since we were holding constant the residual), but now this is something that is directly observable in our data (and in the model we are building).

In [None]:
summary(multiple_model_1)

> **Test your knowledge:**  Based on the results above, how much more money do men make, relative to women, once we hold fixed the immigration status?

In [None]:
# answer the question above by filling in the number (to 1 decimal place)
answer1 <- #fill me in

test_1()

## Control Variables: What Do They Mean?

One very common term you may have heard, especially in the context of a multiple regression model, is the idea of a **control variable**.  In a multiple regression model, control variables are just explanatory variables - there is nothing special about how they are included.  However, there *is* something special about how we think about them.

The idea of a control variable refers to how we *think about* a regression model, and in particular the different variables.  Recall that the interpretation of a coefficient in a multiple regression model is the effect of that variable *holding constant* the other variables.  This often referred to as **controlling** for the values of those other variables - we are not allowing their relationship with the variable in question, and the outcome variable, to affect our measurement of the result.  This is very common when we are discussing a *cause and effect* relationship - control is essential to these kinds of models.  However, it is also valuable even when we are just thinking about a predictive model.

You can see how this works directly if you think about a multiple regression as a series of "explanations" for the outcome variable.  Each variable, one-by-one "explains" part of the outcome variable.  When we "control" for a variable, we remove the part of the outcome that can be explained by that variable alone.  In terms of our model, this refers to the residual.

However, we must remember that our control variable *also* explains part of the other variables, so we must "control" for it as well.

For instance, our multiple regression:

$$W_i = \beta_0 + \beta_1 I_i + \beta_2 S_i + \epsilon_i$$

Can be thought of as three, sequential, simple regressions:

$$W_i = \gamma_0 + \gamma_1 S_i + u_i$$
$$I_i = \gamma_0 + \gamma_1 S_i + v_i$$

* These two regressions say "Explain ``wages`` and ``immstat`` using ``sex`` (in simple regressions)"

$$\hat{u_i} = \delta_0 + \delta_1 \hat{v_i} + \eta_i$$

* Then, explain whatever is leftover ($\hat{u_i}$) from the ``sex-wage`` relationship with whatever is leftover from the ``immstat-wage`` relationship.

This has effectively "isolated" the variation in the data which has to do with ``sex`` from the result of the model.

Let's see this in action:

In [None]:
regression1 <- lm(wages ~ sex, data = census_data)
# regress wages on sex


regression2 <- lm(dummy_immstat ~ sex, data = census_data)
# regress immigration status on sex

temp_data <-  tibble(wage_leftovers = regression1$residual, immstat_leftovers = regression2$residuals)
# take whatever is left-over from those regressions, save it

In [None]:
regression3 <- lm(wage_leftovers ~ immstat_leftovers, data = temp_data)
#regress the leftovers on immigration status

#compare the results with the multiple regression

stargazer(regression1, regression2, regression3, multiple_model_1, title="Comparison of Muliple and Simple Regression Results",
          align = TRUE, type="text", keep.stat = c("n","rsq"))

Look closely at these results.  You will notice that the coefficients on ``immstat_leftovers`` in the "control" regression and ``immstat`` in the multiple regression are *exactly the same*.


> *Think Deeper* What if we had done it in the other way you think about this relationship (``wages`` and ``sex`` on ``immstat``).  Which coefficients would match?  Why?

This result is a consequence of the **Frisch-Waugh-Lovell theorem** about OLS, and a variant of it is referred as the "regression anatomy" equation.

For our purposes, it does a very useful thing: it gives us a concrete way of thinking about what "controls" are doing: they are "subtracting" part of the variation from both the outcome and other explanatory variables.  In OLS, this is *exactly* what is happening - but for all variables at once!

## Part 2: Hands-On

Now, it's time to continue our investigation of the immigrant-wage gap, but now using our multiple regression tools.  As we discussed before, when we investigate the gender-wage gap, we usually want to "hold fixed" different kinds of variables.  We have already seen this, using the ``sex`` variable to control for the gender-wage gap.  However, there are many more variables we might want to include.

For example, immigrants are typically younger than the average Canadian.  This implies that we may want to control for their age in the analysis, since younger workers typically earn less than older workers.

Let's try that now:

In [None]:
age_regression1 <- lm(wages ~ immstat + sex + agegrp, data = census_data)

summary(age_regression1)

Once we control for age, what do you see?  How has the immigrant-wage gap changed?

Another possible explanation could be the age at immigration; this would be very relevant, since we would expect that individuals who immigrated very early in life would be much more similar to non-immigrants. 

In [None]:
age_regression2 <- lm(wages ~ immstat + sex + ageimm, data = census_data)

summary(age_regression2)

Look closely at this result.  Do you see anything odd or problematic here?

This is a topic we will revise later in this course, but this is **multicollinearity**.  Essentially, what this means is that one of the variables we have added to our model does not add any new information. 

In other words, once we control for the other variables, there's nothing left to explain.  Can you guess what variables are interacting to cause this problem?   R has auto-magically excluded one so that the regression will still run, but this a big problem for our model.

Let's dig deeper to see here:

In [None]:
age_reg1 <- lm(wages ~ sex + ageimm, data = census_data)
# regress wages on sex and ageimm

print("Leftovers from wage ~ sex + ageimm")
head(round(age_reg1$residuals,2))
#peek at the leftover part of wage

age_reg2 <- lm(dummy_immstat ~ sex + ageimm, data = census_data)
# regress immstat on sex and ageimm

print("Leftovers from immstat ~ sex + ageimm")
head(round(age_reg2$residuals,5))
#peek at the leftover part of immstat

print("Average Leftovers from immstat ~ sex + ageimm")
round(mean(age_reg2$residuals),5)
#look at the average residual!

As you can see, the residual from regressing ``dummy_immstat ~ sex + ageimm`` is exactly (to machine precision) zero.  In other words, one you "control" for ``sex`` and ``ageimm`` *there's nothing left to explain* about ``immstat``.

If we think about this, it makes sense: you can only have an "age at immigration" if you immigrated!  So, if I tell you what value "age at immigration" takes on, you will immediately know whether or not I am an immigrant or not.  This means the multiple regression, in the final step, would be trying to solve this equation:

$$\hat{u_i} = \delta_0 + \delta_1 0 + \eta_i$$

Which does not have a unique solution for $\delta_1$, meaning the regression model isn't well-posed.  R tries to "fix" this problem by getting rid of some variables, but this usually indicates that our model wasn't set-up properly in the first place.

The lesson is that we can't just include controls without thinking about them; we have to pay close attention to their role in our model, and their relationship to other variables.

For example, a *better* way to do this would be to just include ``ageimm`` instead of ``immstat``.  Now we will have a whole bunch of different "immigrant wage gaps" - one for each age of immigration:

In [None]:
regression3 <- lm(wages ~ ageimm + sex, data = census_data)

summary(regression3)

> **Test Your Knowledge**: which group of immigrants has the largest immigrant wage gap?  Answer in terms of the group category; e.g. "5 to 9 years"

In [None]:
#answer by replacing the appropriate label based on the table

answer2 <- "5 to 9 years" #it's not this one; change me

test_2()

> *Think Deeper:*  Why do you think this group might have the largest immigrant-wage gap?  What is going on here?  Think about what you know about Canadian immigration policy.

You can also include different sets of controls in your model; often adding different "layers" of controls is a very good way to understand how different variables interact and affect your conclusions.  Here's an example, adding on several different "layers" of controls:

In [None]:
regression1 <- lm(wages ~ ageimm, data = census_data)
regression2 <- lm(wages ~ ageimm + sex, data = census_data)
regression3 <- lm(wages ~ ageimm + sex + vismin, data = census_data)
regression4 <- lm(wages ~ ageimm + sex + vismin + pr, data = census_data)

stargazer(regression1, regression2, regression3, regression4, title="Comparison of Controls",
          align = TRUE, type="text", keep.stat = c("n","rsq"))

A pretty big table!  Often, when we want to focus on just a single variable, we will simplfy the table by just explaining which controls are included.  Here's an example which is much easier to read; it uses some formatting tricks which you don't need to worry about right now:

In [None]:
var_omit = c("(pr)\\w+","(vismin)\\w+", "(sex)\\w+") #don't worry about this right now!

stargazer(regression1, regression2, regression3, regression4, title="Comparison of Controls",
          align = TRUE, type="text", keep.stat = c("n","rsq"), 
          omit = var_omit,
          add.lines = list(c("Sex Controls", "No", "Yes", "Yes", "Yes"),
                           c("Visible Minority Controls", "No", "No", "Yes", "Yes"),
                           c("Province Controls", "No", "No", "No", "Yes")))

#this is very advanced code; don't worry about it right now; we will come back to it at the end of the course

Notice in the above how the coefficients change when we change the included control variables.  Understanding this kind of variation is really important to interpreting a model, and whether or not the results are credible.  For example - ask yourself if which group has the biggest gap changes from model to model.  What do you think?

### Omitted Variables

Another important topic comes up in the context of multiple regression: **omitted variables**.  In a simple regression, this didn't really mean anything, but now it does.  When we have a large number of variables in a dataset, which ones do we include in our regression?  All of them?  Some of them?

This is actually a very important problem, since it has crucial implication for the interpretation of our model.  For example, remember Assumption 1?  This is a statement about the "true" model - not what you are actually running.  It can very easily be violated when variables aren't included.

We will revisit this later in the course, since it only really makes sense in the context of causal models, but for now we should pay close attention to which variables we are including and why.  Let's explore this, using the exercises.

## Part 3: Exercises


### Theoretical Activity 1

Suppose you have a regression model that looks like:

$$Y_i = \beta_0 + \beta_1 X_{i} + \beta_2 D_{i} + \epsilon_i$$

Where $D_i$ is a dummy variable.  Recall that Assumption 1 implies that $E[\epsilon_i|D_{i}, X_{i}] = 0$.  Suppose this assumption holds true.  Answer the following:

1.  Compute $E[Y_i|X_i,D_i=1]$ and $E[Y_i|X_i,D_i=0]$
2.  What is the difference between these two terms?
3.  Interpret what the coefficient $\beta_2$ means in this regression, using your answers above.

#### Theoretical Answer 1

**Complete the Exercise**: Carefully write your solutions in the box below.  Use mathematical notation where appropriate, and explain your results.

**TA 1 Answer**: <font color="red">Answer in red here</font>

### Practical Activity 1

To explore the mechanics of multiple regressions, let's return to the analysis that we did in Worksheet 1; that is, let's re-examine the relationship between the immigrant wage gap and education. Recall that we simplified the education levels using the following code:

In [None]:
#Just run this!
census_data <- 
        census_data %>%
        mutate(educ = case_when(
              hdgree == "no certificate, diploma or degree" ~ "Less than high school",
              hdgree == "secondary (high) school diploma or equivalency certificate" ~ "High school diploma",
              hdgree == "trades certificate or diploma other than certificate of apprenticeship or certificate of qualification" ~ "Some college",
              hdgree == "certificate of apprenticeship or certificate of qualification" ~ "Some college",
              hdgree == "program of 3 months to less than 1 year (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "program of 1 to 2 years (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "program of more than 2 years (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "university certificate or diploma below bachelor level" ~ "Some college",
              hdgree == "bachelor's degree" ~ "Bachelor's degree",              
              hdgree == "university certificate or diploma above bachelor level" ~ "Graduate school",
              hdgree == "degree in medicine, dentistry, veterinary medicine or optometry" ~ "Graduate school",
              hdgree == "master's degree" ~ "Graduate school",
              hdgree == "earned doctorate" ~ "Graduate school",
              hdgree == "not available" ~ "not available"
              )) %>%
        mutate(educ = as_factor(educ))

census_data$educ <- relevel(census_data$educ, ref = "Less than high school") #Set "Less than high school" as default factor level

Run a simple regression for the immigrant wage gap (with a single regressor) for each education level. Then, run a multiple regression for the immigrant wage gap that includes education as a control.

<em>Tested objects:</em> ``reg_LESS`` (simple regression; less than high school), ``reg_HS`` (high school diploma), ``reg_SC`` (some college), ``reg_BACH`` (bachelor's degree), ``reg_GRAD`` (graduate school), ``reg2`` (multiple regression).

In [None]:

#Less than high school
reg_LESS <- lm(wages ~ immstat, data = filter(census_data, educ == ???))

#High school diploma
reg_HS <- lm(wages ~ immstat, data = filter(census_data, educ == ???))

#Some college
reg_SC <- lm(wages ~ immstat, data = filter(census_data, educ == ???))

#Bachelor's degree
reg_BACH <- lm(wages ~ immstat, data = filter(census_data, educ == ???))

#Graduate school
reg_GRAD <- lm(wages ~ immstat, data = filter(census_data, educ == ???))

#Multiple regression
reg2 <- lm(wages ~ immstat + ???, data = census_data)

#Table comparing regressions
stargazer(reg_LESS, reg_HS, reg_SC, reg_BACH, reg_GRAD, reg2, 
          title = "Comparing Conditional Regressions with Multiple Regression", align = TRUE, type = "text", keep.stat = c("n","rsq")) 
# uncomment the other tests if you want more
test_3() #For reg_LESS
#test_4() #For reg_HS 
test_5() #For reg_SC (QUIZ1)
#test_6() #For reg_BACH
#test_7() #For reg_GRAD
test_8() #For reg2

#### Short Answer 1
**Prompt**: What variable "value" appears to be missing from the multiple regression in the table? How can we interpret the average wage for the group associated with that value?

**SA 1 Answer**: <font color="red">Answer in red here</font>

#### Short Answer 2
Prompt: Compare the coefficient estimates for `immstat` across each of the simple regressions. How does the immigrant wage gap appear to vary across education levels? How should we interpret this variation?

**SA 2 Answer**: <font color="red">Answer in red here</font>

#### Short Answer 3
Prompt: Compare the simple regressions' estimates with those of the multiple regression. How does the multiple regression's coefficient estimate on `immstat` compare to those estimates in the simple regressions? How can we interpret this? Further, how do we interpret the coefficient estimates on the other regressors in the multiple regression?

**SA 3 Answer**: <font color="red">Answer in red here</font>

### Practical Activity 2
Consider the multiple regression that we estimated in the previous activity:

$$W_i = \beta_0 + \beta_1 I_i + \beta_2 S_i + \epsilon_i$$

Note that $I_i$ is `immstat` and $S_i$ is `educ`.

#### Short Answer 4
Prompt: Why might we be skeptical of the argument that $\beta_1$ captures the immigrant wage gap (i.e., the effect of being an immigrant on one's wages, all else being equal)? What can we do to address these concerns?

**SA 4 Answer**: <font color="red">Answer in red here</font>

#### Short Answer 5
Prompt: Suppose that a member of your research team suggests that we should add `agegrp` as a control in the regression. Do you agree with this group member that this variable would be a good control? Why or why not?

**SA 5 Answer**: <font color="red">Answer in red here</font>

Add `agegrp` to the given multiple regression and compare it with the model that we estimated in the previous activity.

<em>Tested Objects:</em> `reg3` (the same multiple regression that we estimated before, but with age added as a control).

In [None]:
#Quiz 2
#Add Age as Control
#Add them in the order: immigration, education, age
reg3 <- lm(???)

#Compare the regressions with and without this control
stargazer(reg2, reg3, 
          title = "Multiple Regressions with and without Age Controls", align = TRUE, type = "text", keep.stat = c("n","rsq")) 

test_9() #For reg3 (#QUIZ2)

#### Short Answer 6
Prompt: Compare the two regressions in the table above. What happens to the estimated immigrant wage gap when we add age as a control? What might explain this effect?

**SA 6 Answer**: <font color="red">Answer in red here</font>

#### Short Answer 7
Prompt: Suppose that one of your fellow researchers argues that `lfact` (employment status) should be added to the multiple regression as a control. That way, they reason, we can account for differences between employed and unemployed workers. Do you agree with their reasoning? Why or why not?

**SA 7 Answer**: <font color="red">Answer in red here</font>

Let's test this argument directly. Add `lfact` as a control to the multiple regression with all previous controls. Estimate this new regression (`reg4`).

In [None]:
#this will crash!
reg4 <- lm(wages ~ immstat + educ + agegrp + lfact, data = census_data)

summary(reg4)

Huh? That's odd. Let's investigate this by looking at the factor levels of `lfact`:

In [None]:
levels(census_data$lfact) #Run this!

filter(census_data, lfact == 'unemployed - temporary layoff - looked for full-time work') #If we try looking for observations that are unemployed...
filter(census_data, lfact == 'not in the labour force - last worked in 2016') #Observations not in the labour force...

#### Short Answer 8
Prompt: What happened when we tried to run the regression with `lfact`? Does this "result" agree or disagree with your explanation in Short Answer 7?

**SA 8 Answer**: <font color="red">Answer in red here</font>

### Practical Activity 3

In the middle of your team's discussion of which controls they should add to the multiple regression (the same one as the previous activity), your roommate bursts into the room and yells "Just add them all!" After a moment of confused silence, the roommate elaborates that it never hurts to add controls as long as they don't "break" the regression (like `lfact` and `ageimm`). "Data is hard to come by, so we should use as much of it as we can get," he says.

Recall: Below are all of the variables in the dataset.

In [None]:
glimpse(census_data) #Run Me!

#### Short Answer 9
Prompt: Do you agree with your roommate's argument? Why or why not?

**SA 9 Answer**: <font color="red">Answer in red here</font>

Let's back up our argument with regression analysis. Estimate a regression that has the same controls as `reg3` from the previous activity, but add `ppsort` as a control as well.

<em>Tested Objects:</em> `reg5`.

In [None]:
#Quiz 4
#Add ppsort to regression
#Keep the order (immigration, education, age, ppsort)
reg5 <- ???

#Table comparing regressions with and without ppsort
stargazer(reg3, reg5,
          title = "Multiple Regressions with and without ppsort", align = TRUE, type = "text", keep.stat = c("n","rsq")) 

test_10() #For reg5 $QUIZ3

#### Short Answer 10
Prompt: Does the table above suggest that we should add `ppsort` as a control?

**SA 10 Answer**: <font color="red">Answer in red here</font>

In [None]:
#Quiz 5
#Add cip2011 as controls 
#Keep the order immigration, education, age, CIP
reg6 <- ???

#Table that compares reg3 and reg6
stargazer(reg3, reg6,
          title = "Multiple Regressions with and without cip2011", align = TRUE, type = "text", keep.stat = c("n","rsq"))

test_11() #For reg6 ($QUIZ4)