Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Multiple linear regression: Problem solving

In this session, you'll apply multiple linear regression to a new dataset, `ToothGrowth`, which has the following variables:

| Variable | Type    | Description                |
|----------|---------|:----------------------------|
| len      | Ratio   | Tooth length               |
| supp     | Nominal | Supplement type (VC or OJ) |
| dose     | Ratio   | Dose in milligrams/day     |

These data were collected in an experiment measuring the effect of vitamin C supplements (`supp`; either orange juice, `OJ` or ascorbic acid, `VC`) on tooth length (`len`) at three different doses (`dose`; 0.5, 1, and 2 mg/day) in guinea pigs.

Our outcome variable that we'd like to predict is `len`

**QUESTION:**

What relationship to you expect between `dose` and `len`?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What relationship do you expect between `supp` and `len`?

**ANSWER: (click here to edit)**


<hr>

## Load data

Start with importing `pandas`.

Load a dataframe with `"datasets/toothgrowth.csv"` and display it.

Before converting `supp` into a dummy variable, save it to a variable so you can put it back into the dataframe later.

Now convert the nominal variables in the dataframe to dummies and save the result.

**QUESTION:**

Which supplement is base level (or reference level) after coverting `supp` to dummies?

**ANSWER: (click here to edit)**


<hr>

Next put the `supp` variable you saved back into the dataframe.

## Explore data

### Descriptive statistics

Display the overall descriptive statistics.

**QUESTION:**

If the mean of `supp_VC` is .50, what does that tell you?

**ANSWER: (click here to edit)**


<hr>

Group the dataframe by `supp`.

Display descriptive statistics for the groups.

**QUESTION:**

Which `supp` had a higher mean `len`?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Why is the mean `dose` the same for OJ and VC?

**ANSWER: (click here to edit)**


<hr>

### Plots

Import `plotly.express`.

Create a **boxplot** (`with px do box`) with `dose` as X, `len` as Y, and color as `supp`

The box plot matches the descriptive statistics in a useful way:

- The line in the middle of the boxplot is the **median**
- The top/bottoms of each box are the **75th percentile** and **25th percentile**, respectively
- The "whiskers" or bars above and below the box stop at the point furthest from the median that is within 1.5 times the difference between the 75th and 25th percentiles

Boxplots are a good way at getting a feel for the distribution of data, especially when you are comparing two groups like this.

**QUESTION:**

Why didn't we use scatterplots instead? Try it and see.

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Which `supp` has a higher median `len`? 
Is it the same across `dose`? 
What does this tell you?

**ANSWER: (click here to edit)**


<hr>

## Modeling

### Model 1

Start with a model that predicts `len` based on `dose`.

Import `sklearn.linear_model` and `numpy`.

Create a linear regression model.

Train the model to predict `len` based on `dose` using all the data.

Get the $r^2$.

**QUESTION:**

Do you think this is a good $r^2$?
Why or why not?

**ANSWER: (click here to edit)**


<hr>

### Diagnostics 1

Get the predictions from the model and put them in the dataframe.

**QUESTION:**

How many predictions did you get? Why?

**ANSWER: (click here to edit)**


<hr>

Add the residuals to `dataframe`.

**QUESTION:**

How do the residuals compare to the predictions? Why?

**ANSWER: (click here to edit)**


<hr>

Make a figure to check linearity and equal variance **with boxplots**.

**QUESTION:**

Do we have linearity and equal variance? Why?

**ANSWER: (click here to edit)**


<hr>

### Model 2

Create an interaction variable `ds` = `dose` * `supp_VC` and add it to the dataframe.

**QUESTION:**

What is value of `ds` across the dataframe? Why?

**ANSWER: (click here to edit)**


<hr>

Fit the model using `dose`, `supp_VC`, and the interaction `ds`.

### Diagnostics 2

Save the predictions from the model to the dataframe.

**QUESTION:**

What are the values of the new predictions across the dataframe? 
How do they differ between OJ and VC?

**ANSWER: (click here to edit)**


<hr>

Save the residuals from the model to the dataframe.

Plot the predicted vs residuals **as a boxplot** to check linearity and equal variance.

**QUESTION:**

Do we have linearity and equal variance? Why?

**ANSWER: (click here to edit)**


<hr>

With this new model, calculate $r^2$ :

**QUESTION:**

How does this $r^2$  compare to the model without the interaction?

**ANSWER: (click here to edit)**


<hr>

<!--  -->