# Section 4.2 — Multiple linear regression

This notebook contains the code examples from [Section 4.2 Multiple linear regression]() from the **No Bullshit Guide to Statistics**.

#### Notebook setup

In [1]:
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
# RCPARAMS.update({"figure.figsize": (10, 4)})   # good for screen
RCPARAMS.update({"figure.figsize": (5, 2.3)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = "retina"

ModuleNotFoundError: No module named 'plot_helpers'

<Figure size 640x480 with 0 Axes>

In [None]:
# simple float __repr__
np.set_printoptions(legacy='1.25')

# set random seed for repeatability
np.random.seed(42)

In [None]:
# Download datasets/ directory if necessary
from ministats import ensure_datasets
ensure_datasets()

## Definitions

TODO

## Doctors dataset

In [None]:
doctors = pd.read_csv("datasets/doctors.csv")
doctors.shape

In [None]:
doctors.head()

## Multiple linear regression model
$\newcommand{\Err}{ {\Large \varepsilon}}$

$$
   Y  = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \Err,
$$

where $p$ is the number of predictors
and $\Err$ represents Gaussian noise $\Err \sim \mathcal{N}(0,\sigma)$.


### Model assumptions

- **(LIN)**
- **(INDEPɛ)**
- **(NORMɛ)**
- **(EQVARɛ)**
- **(NOCOL)**


## Example: linear model for doctors' sleep scores

We want to know the influence of drinking alcohol, smoking weed, and exercise on sleep score?

In [None]:
import statsmodels.formula.api as smf

formula = "score ~ 1 + alc + weed + exrc"
lm2 = smf.ols(formula, data=doctors).fit()
lm2.params

In [None]:
sns.barplot(x=lm2.params.values[1:],
            y=lm2.params.index[1:]);

### Partial regression plots

#### Partial regression plot for the predictor `alc` 

**Step 1**: Obtain the data for the x-axis,
residuals of the model `alc ~ 1 + others`

In [None]:
#######################################################
lm_alc = smf.ols("alc~1+weed+exrc", data=doctors).fit()
xrs = lm_alc.resid

**Step 2**: Obtain the data for the y-axis,
residuals of the model `score ~ 1 + others`

In [None]:
#######################################################
lm_score=smf.ols("score~1+weed+exrc",data=doctors).fit()
yrs = lm_score.resid

**Step 3**: Fit a linear model for the y-residuals versus the x-residuals.

In [None]:
dfrs = pd.DataFrame({"xrs": xrs, "yrs": yrs})
lm_resids = smf.ols("yrs ~ 1 + xrs", data=dfrs).fit()

**Step 4**: Draw a scatter plot of the residuals
and the best-fitting linear model to the residuals.

In [None]:
from ministats import plot_reg

ax = sns.scatterplot(x=xrs, y=yrs, color="C0")
plot_reg(lm_resids, ax=ax);
ax.set_xlabel("alc~1+weed+exrc  residuals")
ax.set_ylabel("score~1+weed+exrc  residuals")
ax.set_title("Partial regression plot for `alc`");

The slope parameter for the residuals model
is the same as the slope parameter of the `alc` predictor in the original model `lm2`.

In [None]:
lm_resids.params["xrs"], lm2.params["alc"]

#### Partial regression plot for the predictor `weed` 

In [None]:
from ministats import plot_partreg
plot_partreg(lm2, pred="weed");

#### Partial regression plot for the predictor `exrc`

In [None]:
plot_partreg(lm2, pred="exrc");

#### (BONUS TOPIC) Partial regression plots using `statsmodels`

The function `plot_partregress` defined in `statsmodels.graphics.api`
performs the same steps as the function `plot_partreg` we used above.
but is a little more awkward to use.

When calling the function `plot_partregress`,
you must provide the following arguments:
- the outcome variable (`endog`)
- the predictor you're interested in (`exog_i`)
- the predictors you want to *regress out* (`exog_others`)
- the data frame that contains all these variable (`data`)


In [None]:
from statsmodels.graphics.api import plot_partregress

with plt.rc_context({"figure.figsize":(12,3)}):
    fig, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True)
    plot_partregress("score", "alc",  exog_others=["weed", "exrc"], data=doctors, obs_labels=False, ax=ax1)
    plot_partregress("score", "weed", exog_others=["alc",  "exrc"], data=doctors, obs_labels=False, ax=ax2)
    plot_partregress("score", "exrc", exog_others=["alc",  "weed"], data=doctors, obs_labels=False, ax=ax3)

The notation $|\textrm{X}$ you see in the axis labels stands for "given all other predictors,"
and is different for each subplot.
For example,
in the leftmost partial regression plot,
the predictor we are focussing on is `alc`,
so the other variables ($|\textrm{X}$) are `weed` and `exrc`.

### Plot residuals

In [None]:
from ministats import plot_resid
plot_resid(lm2);

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1, 3, sharey=True, figsize=(9,2.5))
plot_resid(lm2, pred="alc",  ax=ax1)
plot_resid(lm2, pred="weed", ax=ax2)
plot_resid(lm2, pred="exrc", ax=ax3);

### Model summary table

In [None]:
lm2.summary()

## Explanations

### Nonlinear terms in linear regression

#### Example: polynomial regression

In [None]:
howell30 = pd.read_csv("datasets/howell30.csv")
len(howell30)

In [None]:
# Fit quadratic model
formula2 = "height ~ 1 + age + np.square(age)"
lmq = smf.ols(formula2, data=howell30).fit()
lmq.params

In [None]:
# Plot the data
sns.scatterplot(data=howell30, x="age", y="height");

# Plot the best-fit quadratic model
intercept, b_lin, b_quad = lmq.params
ages = np.linspace(0.1, howell30["age"].max())
heighthats = intercept + b_lin*ages + b_quad*ages**2
sns.lineplot(x=ages, y=heighthats, color="b");

In [None]:
# ALT. using `plot_reg` function
from ministats import plot_reg
plot_reg(lmq);

### Feature engineering and transformed variables

#### Bonus example: polynomial regression up to degree 3

In [None]:
formula3 = "height ~ 1 + age + np.power(age,2) + np.power(age,3)"
exlm3 = smf.ols(formula3, data=howell30).fit()
exlm3.params

In [None]:
sns.scatterplot(data=howell30, x="age", y="height")
sns.lineplot(x=ages, y=exlm3.predict({"age":ages}));

## Discussion

## Exercises

### Exercise E??: marketing dataset

In [None]:
marketing = pd.read_csv("datasets/exercises/marketing.csv")
formula_mkt = "sales ~ 1 + youtube + facebook + newspaper"
lm_mkt2 = smf.ols(formula_mkt, data=marketing).fit()
lm_mkt2.params

## Links