 ## Multiple Predictors - multiple linear regression
 Similar to the examples with regression trees, linear regression can add multiple predictor attributes. When adding multiple attributes, these can be all continuous, all categorical, or a mix of continuous and categorical predictors. When more than one attribute is added as preditors, the model is commonly referred to as multiple linear regression.

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)
library(mosaic)
library(broom)

theme_set(theme_bw())

baby <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/baby.csv") %>%
  filter(gestational_days > 200)
head(baby)


 In this example, I want to go back to the baby data used earlier in the course. Previously, baby weight in ounces was used as the outcome and in separate analyses we considered gestational days and maternal smoker status attributes to predict the baby weight outcome. Below are some examples of the analysis with bootstrapped estimated effect distributions.

 ### Continuous Predictor

In [0]:
baby_reg <- lm(birth_weight ~ gestational_days, data = baby)
resample_baby <- function(...) {
  baby_resample <- baby %>%
    sample_n(nrow(baby), replace = TRUE)

  baby_resample %>%
    lm(birth_weight ~ gestational_days, data = .) %>%
    coef(.) %>%
    .[2] %>%
    data.frame()
}
baby_coef <- map(1:10000, resample_baby) %>%
  bind_rows()
names(baby_coef) <- 'slope'


In [0]:
gf_density(~ slope, data = baby_coef)
baby_coef %>%
  df_stats(~ slope, quantile(c(0.05, 0.5, 0.95)))


 ### Categorical Predictor

In [0]:
smoker_reg <- lm(birth_weight ~ maternal_smoker, data = baby)
resample_baby <- function(...) {
  baby_resample <- baby %>%
    sample_n(nrow(baby), replace = TRUE)

  baby_resample %>%
    lm(birth_weight ~ maternal_smoker, data = .) %>%
    coef(.) %>%
    .[2] %>%
    data.frame()
}
baby_coef <- map(1:10000, resample_baby) %>%
  bind_rows()
names(baby_coef) <- 'slope'

In [0]:
gf_density(~ slope, data = baby_coef)
baby_coef %>%
  df_stats(~ slope, quantile(c(0.05, 0.5, 0.95)))


 ## Combine the two predictors
 What happens if we would like to combine the two predictors? Shown above is that the number of gestational days has a moderate relationship to the baby weight, therefore exploring the effects of smoking, it would be nice to remove the effect of gestational days from the baby weight. More specifically, this essentially allows us to make comparisons on the effect of smoking for the **same** gestational days. One way to think about this is through conditional means. Exploration of these visually first can be particularly helpful.

In [0]:
gf_point(birth_weight ~ gestational_days, data = baby, size = 3) %>%
  gf_smooth() %>%
  gf_facet_wrap(~ maternal_smoker)


In [0]:
baby_reg_smoker <- lm(birth_weight ~ I(gestational_days - mean(gestational_days)) + maternal_smoker, data = baby)
coef(baby_reg_smoker)


 We can write out the regression equation similar to before:

 \begin{equation}
  birth\_weight = 122.67 + 0.49 (gestational\_days - mean(gestational\_days) - 8.17 maternal\_smoker + \epsilon
 \end{equation}

 Let's explore how these are interpreted.

 ### Distribution of Effects
 Similar to before, the distribution of effects can be obtained with the following steps:
 1. Resample the observed data available, with replacement
 2. Estimate linear model coefficients.
 3. Save terms of interest
 4. Repeat steps 1 - 3 many times
 5. Explore the distribution of median differences from the many resampled data sets.

In [0]:
resample_baby <- function(...) {
  baby_resample <- baby %>%
    sample_n(nrow(baby), replace = TRUE)

  baby_resample %>%
    lm(birth_weight ~ I(gestational_days - mean(gestational_days)) + maternal_smoker, data = .) %>%
    tidy(.) %>%
    select(term, estimate)
}
resample_baby()


In [0]:
coef_baby <- map(1:10000, resample_baby) %>%
  bind_rows()


In [0]:
coef_baby %>%
  gf_density(~ estimate) %>% 
  gf_facet_wrap(~ term, scales = 'free')


 ## Interactions
 One additional idea that can be quite powerful is the idea of interactions. This was indirectly shown earlier in the course with classification and regression trees, where the models after each split re-evaluated which attributes were most helpful. In this way, the same attribute could be used in different places with different scores identifying the split. A similar idea can be explored in the regression framework, where the idea is that there are differential effects for different groups. This can be shown visually:

In [0]:
gf_point(birth_weight ~ gestational_days, data = baby, size = 3) %>%
  gf_smooth() %>%
  gf_facet_wrap(~ maternal_smoker)


In [0]:
baby_reg_int <- lm(birth_weight ~ I(gestational_days - mean(gestational_days)) * maternal_smoker, data = baby)
coef(baby_reg_int)


In [0]:
resample_baby <- function(...) {
  baby_resample <- baby %>%
    sample_n(nrow(baby), replace = TRUE)

  baby_resample %>%
    lm(birth_weight ~ I(gestational_days - mean(gestational_days)) * maternal_smoker, data = .) %>%
    tidy(.) %>%
    select(term, estimate)
}
resample_baby()


In [0]:
coef_baby <- map(1:10000, resample_baby) %>%
  bind_rows()


In [0]:
coef_baby %>%
  gf_density(~ estimate) %>% 
  gf_facet_wrap(~ term, scales = 'free')


 ### Evaluating model fit
 As discussed earlier, R-square is a measure of overall model fit. These can be compared across the different models to see which one may be doing the best and explaining the most variation in the baby's birth weight.

In [0]:
summary(baby_reg)$r.square
summary(smoker_reg)$r.square
summary(baby_reg_smoker)$r.square
summary(baby_reg_int)$r.square