# Multiple linear regression  

In many data sets there may be several predictor variables that have an effect on a response variable.
 In fact, the *interaction* between variables may also be used to predict response.
 When we incorporate these additional predictor variables into the analysis the model is called *multiple regression* .
 The multiple regression model builds on the simple linear regression model by adding additional predictors with corresponding parameters.

## Multiple Regression Model
Let's suppose we are interested in determining what factors might influence a baby's birth weight.
 In our data set we have information on birth weight, our response, and predictors: motherâ€™s age, weight and height and gestation period.
 A *main effects model*  includes each of the possible predictors but no interactions.
 Suppose we name these features as in the chart below.
 
| Variable | Description       |
|----------|:-------------------|
| BW       | baby birth weight |
| MA       | mother's age      |
| MW       | mother's weight   |
| MH       | mother's height   |
| GP       | gestation period  |

Then the theoretical main effects multiple regression model is 

$$BW = \beta_0 + \beta_1 MA + \beta_2 MW + \beta_3 MH + \beta_4 GP+ \epsilon.$$ 

Now we have five parameters to estimate from the data, $\beta_0, \beta_1, \beta_2, \beta_3$ and $\beta_4$.
 The random error term, $\epsilon$ has the same interpretation as in simple linear regression and is assumed to come from a normal distribution with mean equal to zero and variance equal to $\sigma^2$.
 Note that multiple regression also includes the polynomial models discussed in the simple linear regression notebook.
 
One of the most important things to notice about the equation above is that each variable makes a contribution **independently** of the other variables.
This is sometimes called **additivity**: the effects of predictor variable are added together to get the total effect on `BW`.

## Interaction Effects

Suppose in the example, through exploratory data analysis, we discover that younger mothers with long gestational times tend to have heavier babies, but older mother with short gestational times tend to have lighter babies.
 This could indicate an interaction effect on the response.
 When there is an interaction effect, the effects of the variables involved are not additive.
 
 Different numbers of variables can be involved in an interaction.
 When two features are involved in the interaction it is called a *two-way interaction* .
 There are three-way and higher interactions possible as well, but they are less common in practice.
 The *full model*  includes main effects and all interactions.
 For the example given here there are 6 two-way interactions possible between the variables, 4 possible three-way, and 1 four-way interaction in the full model.
 
 Often in practice we fit the full model to check for significant interaction effects.
 If there are no interactions that are significantly different from zero, we can drop the interaction terms and fit the main effects model to see which of those effects are significant.
 If interaction effects are significant (important in predicting the behavior of the response) then we will interpret the effects of the model in terms of the interaction.
 
<!--  NOTE: not sure if correction for multiple comparisons is outside the scope here; I would in general not recommend to students that they test all possible interactions unless they had a theoretical reason to, or unless they were doinging something exploratory and then would collect new data to test any interaction found. -->

## Feature Selection

Suppose we run a full model for the four variables in our example and none of the interaction terms are significant.
 We then run a main effects model and we get parameter estimates as shown in the table below.
 
| Coefficients | Estimate | Std. Error | p-value |
|--------------|----------|------------|---------|
| Intercept    | 36.69    | 5.97       | 1.44e-6 |
| MA           | 0.36     | 1.00       | 0.7197  |
| MW           | 3.02     | 0.85       | 0.0014  |
| MH           | -0.02    | 0.01       | 0.1792  |
| GP           | -0.81    | 0.66       | 0.2311  |

Recall that the p-value is the probability of getting the estimate that we got from the data or something more extreme (further from zero).
 Small p-values (typically less than 0.05) indicate the associated parameter is different from zero, implying that the associated covariate is important to predict response.
 In our birth weight example, we see the p-value for the intercept is very low $1.44 \times 10^{-6}$ and so the intercept is not at zero.
 The mother's weight (MW) has p-value 0.0014 which is very small, indicating that mother's weight has an important (significant) impact on her baby's birth weight.
 The p-value from all other Wald tests are large: 0.7197, 0.1792, and 0.2311, so we know none of these variables are important when predicting the birth weight.
 
  We can modify the coefficient of determination to account for having more than one predictor in the model, called the *adjusted R-square* .
 R-square has the property that as you add more terms, it will always increase.
 The adjustment for more terms takes this into consideration.
 For this data the adjusted R-square is 0.8208, indicating a reasonably good fit.

  Different combinations of the variables included in the model may give better or worse fits to the data.
 We can use several methods to select the "best" model for the data.
 One example is called *forward selection* .
 This method begins with an empty model (intercept only) and adds variables to the model one by one until the full main effects model is reached.
 In each forward step, you add the one variable that gives the best improvement to the fit.
 There is also *backward selection*  where you start with the full model and then drop the least important variables one at a time until you are left with the intercept only.
 If there are not too many features, you can also look at all possible models.
 Typically these models are compared using the AIC (Akaike information criterion) which measures the relative quality of models.
 Given a set of models, the preferred model is the one with the minimum AIC value.
 
Previously we talked about splitting the data into training and test sets.
In statistics, this is not common, and the models are trained with all the data.
This is because statistics is generally more interested in the effect of a particular variable *across the entire dataset* than it is about using that variable to make a prediction about a particular datapoint.
Because of this, we typically have concerns about how well linear regression will work with new data, i.e. will it have the same $r^2$ for new data or a lower $r^2$?
Both forward and backward selection potentially enhance this problem because they tune the model to the data even more closely by removing variables that aren't "important."
You should always be very careful with such variable selection methods and their implications for model generalization.

# Categorical Variables

In the birth weight example, there is also information available about the mother's activity level during her pregnancy.
 Values for this categorical variable are: low, moderate, and high.
 How can we incorporate these into the model? 
 Since they are not numeric, we have to create *dummy variables*  that are numeric to use.
 A dummy variable represents the presence or absence of a level of the categorical variable by a 1 and the absence by a zero.
  Fortunately, most software packages that do multiple regression do this for us automatically.
 
Often, one of the levels of the categorical variable is considered the "baseline" and the contributions to the response of the other levels are in relation to baseline.
Let's look at the data again. 
 In the table below, the mother's age is dropped and the mother's activity level (MAL) is included.
 
 | Coefficients | Estimate | Std. Error | p-value  |
|--------------|----------|------------|----------|
| Intercept    | 31.35    | 4.65       | 3.68e-07 |
| MW           | 2.74     | 0.82       | 0.0026   |
| MH           | -0.04    | 0.02       | 0.0420   |
| GP           | 1.11     | 1.03       | 0.2917   |
| MALmoderate  | -2.97    | 1.44       | 0.049     |
| MALhigh      | -1.45    | 2.69       | 0.5946   |
 

For the categorical variable MAL,  MAL low has been chosen as the base line.
 The other two levels have parameter estimates that we can use to determine which are significantly different from the low level.
 This makes sense because all mothers will at least have low activity level, and the two additional dummy variables `MALhigh` and `MALmoderate` just get added on top of that.
 
 We can see that MAL moderate level is significantly different from the low level (p-value < 0.05).
 The parameter estimate for the moderate level of MAL is -2.97.
 This can be interpreted as: being in the moderately active group decreases birth weight by 2.97 units compared to babies in the low activity group.
 We also see that for babies with mothers in the high activity group, their birth weights are not different from birth weights in the low group, since the p-value is not low (0.5946 &gt; 0.05) and so this term does not have a significant effect on the response (birth weight).
 
  This example highlights a phenomenon that often happens in multiple regression.
 When we drop the variable MA (mother's age) from the model and the categorical variable is included, both MW (mother's weight) and MH (mother's height) are both important predictors of birth weight (p-values 0.0026 and 0.0420 respectively).
 This is why it is important to perform some systematic model selection (forward or backward or all possible) to find an optimum set of features.
 
# Diagnostics

As in the simple linear regression case, we can use the residuals to check the fit of the model.
 Recall that the residuals are the observed response minus the predicted response.
 
  - Plot the residuals against each independent variable to check whether higher order terms are needed  
  - Plot the residuals versus the predicted values to check whether the     variance is constant  
  - Plot a qq-plot of the residuals to check for normality  
  

# Multicollinearity

Multicollinearity occurs when two variables or features are linearly related, i.e.
 they have very strong correlation between them (close to -1 or 1).
 Practically this means that some of the independent variables are measuring the same thing and are not needed.
 In the extreme case (close to -1 or 1), the estimates of the parameters of the model cannot be obtained.
 This is because there is no unique solution for OLS when multicolinearity occurs.
 As a result, multicollinearity makes conclusions about which features should be used questionable.