## Overview

In the previous module, we fit a linear regression model to two variables in our car crash data set: total accidents and alcohol impairment. We found that there was a significant relationship between the two variables, and could reject the null hypothesis.

In this module, we're going to look at how adding multiple predictor variables to a linear regression affects the outcome. Can we improve the linear regression model by adding in more predictor variables? Let's load in the data, fit the model, and look at the results.

## Follow Along

For this module, we'll look at the whole data set again, instead of just focusing on two variables.

In [1]:
import pandas as pd
import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

crashes.head()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA


We'll fit our model using `alcohol` as the independent variable and `total` as the dependent variable. 

In [2]:
# Import the OLS model from statsmodels
from statsmodels.formula.api import ols

# Set-up and fit the model in one step
# (format Y ~ X)
model = ols('total ~ alcohol', data=crashes).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  total   R-squared:                       0.727
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     130.5
Date:                Wed, 21 Apr 2021   Prob (F-statistic):           2.04e-15
Time:                        13:48:00   Log-Likelihood:                -110.99
No. Observations:                  51   AIC:                             226.0
Df Residuals:                      49   BIC:                             229.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.8578      0.921      6.357      0.0

### R-squared

Now we're going to look at a new result in our model summary: *R-squared*. This term is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable (or variables) in a regression model. For our data, the R-squared value is the proportion of the variance for our variable `total` that is explained by our independent variable `alcohol`. 

Reading from the table, we have a R-squared value of 0.727 or 73% (this value is a proportion so we can express it as a percent). So 73% of the variance in total accidents is explained by alcohol impairment, but what about the other 27%? Looking at the data we loaded, we can see there are other variables including `speeding`, `not_distracted`, `ins_premiums`. Let's add in one of these other variables and see how they impact the model and R-squared.

### Multiple Linear Regression

For a single variable linear regression the equation was:

Single variable regresssion model: $$ y = \beta_0 + \beta_1x $$

To add in other variables, we add additional terms:

Multiple variable regresssion model: $$ y = \beta_0 + \beta_1x + \beta_2x + \beta_3x +...$$

Let's look at a scatter plot where we visualize another variable. For this data, it makes sense to also look at the `ins_premium` variable which is the amount that drives paid in their car insurance premium. If a driver has a lot of accidents, we would expect an increase in insurance premiums.

In [3]:
import matplotlib.pyplot as plt

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(14,6))

# Compare the two independent variables to each other - are they related?
sns.scatterplot(x='alcohol', y='ins_premium', data=crashes, s=50, ax=ax1)
# The color no represents the percentage of speeding drivers
sns.scatterplot(x='alcohol', y='total', hue='ins_premium', data=crashes, s=50, palette='magma', ax=ax2);

#plt.show()
plt.clf()

<Figure size 1008x432 with 0 Axes>

![mod3_obj1_2vars_new.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_1/sprint_3/new/mod3_obj1_2vars_new.png)

In the plot on the left, we can see that there really isn't much of a relationship between our two independent variables: alcohol impairment and insurance premiums don't really seem to have a strong correlation. On the plot on the right, we have our independent variable (`alcohol`) on the x-axis and the dependent variable (`total`) on the y-axis. We've chosen to plot the insurance premium variable (`ins_premium`) on the same axes but color-coded so we can visualize any correlations.

Here's what the equation would look like with an additional variable:

$$ y = \beta_0 + \beta_1*\text{alcohol} + \beta_2*\text{ins_premium} $$

Now, let's fit the model with two independent variables.

In [4]:
# Set-up and fit the model in one step
# (format Y ~ X1 + X2)
model = ols('total ~ alcohol + speeding', data=crashes).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  total   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.719
Method:                 Least Squares   F-statistic:                     64.87
Date:                Wed, 21 Apr 2021   Prob (F-statistic):           2.27e-14
Time:                        13:48:00   Log-Likelihood:                -110.71
No. Observations:                  51   AIC:                             227.4
Df Residuals:                      48   BIC:                             233.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.6807      0.957      5.934      0.0

Now that we've added in another variable, we have an additional line in our model for `speeding` which includes the value of the coefficient (remember this is the slope parameter), the standard error, and the t- statistics and p-value.

In the next objective, we're going to explore the t-value and p-value for the additional variables in our model.

## Challenge

There are more variables that could be added to this model! We still haven't explored the `no_previous`, `not_distracted` and `ins_premium` variables. Try adding a different variable in place of `speeding` and then look at the R-squared value. How does it change? In the next objective in this module, we'll look more closely at the p-value

## Additional Resources

* [Kaggle: Bad drivers dataset](https://www.kaggle.com/fivethirtyeight/fivethirtyeight-bad-drivers-dataset?select=README.md)