# Multiple Linear Regression

<img src="img/mult_reg_dog.jpg" width="400">

## Objectives: 


- Create a multiple linear regression model using statsmodels
- Interpret the output for multiple linear regression
- Explain why multicollinearity is an issue in multiple linear regression
- Evaluate if our variables are showing multicollinearity

### Turn and Talk:

<img src="img/talking.jpeg" width="60" align='left'>

</br>

You are a data scientist for the WMATA. For your first project, they want you to predict the number of metro riders for each day. You decide to do a linear regression model predict the riders but need to gather data first. With a partner brainstorm a list of different variables you think would explain the number of daily riders.  


## Predicting MPG

You are working for a car company.  They know that their customers are interested in fuel efficient vehicles.  They have asked you to determine the characteristics of a car that lead to fuel efficiency so they can develop their next car model.  To do your analysis they have given you the MPG dataset which contains data about previous cars.  Use this data to help the car company understand what characteristics their new car model should have.

<img src="img/efficient.jpg" width="500">


In [None]:
# importing modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set(style="white")

In [None]:
#read in car data
df = sns.load_dataset('mpg')
#examine the head of the dataframe
df.head()

In [None]:
#checking info
df.info()

In [None]:
# some descriptive analysis
df.describe()

## Starting with a Simple Linear Regression Model

In [None]:
# building a simple linear regression model using statsmodels
from statsmodels.formula.api import ols

slr_model = ols(formula='mpg~weight', data=df).fit()
slr_model.summary()

### Turn and Talk:

<p><img src="img/talking.jpeg" width="60" align='left' ></p>
<br />
<br />
<br />


**1. Describe what you think the following things are doing:**

`ols()` 

`formula = 'mpg~weight` 

`data=df`

`fit()` 

**2.  Is weight a significant predictor of mpg?  How do you know?**

Answer:  

**3.  Describe the impact of weight on mpg using the weight coefficient.**

Answer: 

## Multiple Linear Regression
Multiple linear regression is simply a linear regression with more than one predictor, or independent variables. Let's recall the interpretation of $R^2$ in simple linear regression represents the proportion of variance explained by the model. What if we make the model more complex by including more predictors in it such that it account for even more variance in the outcome?


$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + + \beta_3 X_3\cdots + \beta_k X_k + \epsilon$

### Predicting income example

#### One predictor

In [None]:
np.random.seed(1234)
sen = np.random.uniform(18, 65, 100)
income = np.random.normal((sen/10), 0.5)
sen = sen.reshape(-1,1)

fig = plt.figure(figsize=(7,5))
fig.suptitle('seniority vs. income', fontsize=16)
plt.scatter(sen, income)
plt.plot(sen, sen/10, c = "black")
plt.xlabel("seniority", fontsize=14)
plt.ylabel("monthly income", fontsize=14)
plt.show()

#### Two Predictors

What if we include another factor, such as years of education? All that is doing is adding a higher dimensional object to the model, so our model will be three dimensional. 
<img src="./img/multi_reg_graph.png" style="withd:300px;">

### Multiple regression on our MPG data

In [None]:
df.columns

In [None]:
mlr_model = ols(formula='mpg~weight+horsepower+displacement+cylinders+acceleration', data=df).fit()
mlr_model.summary()

## Interpretation of the Model Parameters
- Each β parameter represents the change in the mean target, E(y), per unit increase in the associated predictor variable **when all the other predictors are held constant**.
- For example, the β for weight indicates that with every one unit increase in weight the mpg will decrease by .0052 when horsepower, displacement, cylinders, and acceleration are constant.
- The intercept term, β0, represents the estimated mean response, E(y), when all the predictors x1, x2, ..., xp−1, are all zero (which may or may not have any practical meaning).


### Turn and Talk:

<p><img src="img/talking.jpeg" width="60" align='left' ></p>
<br />
<br />
<br />

With a classmate interpret the following:

**1. The coefficient and p-value associated with horsepower**

Answer:

**2. The coefficient and p-value associated with cylinders**

Answer: 

**3. The r-squared value**

Answer: 

___

## Inference vs Prediction

### Inference

- Goal: explain the association between outcome and predictors
- Focus on subset of features 
- Emphasis is on coefficients
- Simple models that are easily interpreted are preferred

Example question:  "How do years of education and IQ impact your adult salary level?"

### Prediction

- Goal: Develop a model that best predicts an outcome
- Use all available features
- Emphasis is on overall model accuracy
- More complex models that are less interpretable 

Example question:  "How can I use education and IQ to best predict adult salaries?"

## Multicollinearity 

**Multicollinearity** occurs when predictor variables in a regression model are very highly correlated with one another. This correlation is a problem because predictor variables should be independent of one another. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

### What Problems Do Multicollinearity Cause?

Multicollinearity causes the following two basic types of problems:

- The **coefficient estimates can swing wildly** based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
- Multicollinearity **reduces the precision of the estimated coefficients**, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

#### Detecting Multicollinearity

**1.  Review scatterplot and correlation matrix**

A review of a scatterplot and correlation matrix to see which predictors are highly correlated to one another.  NOTE:  We want the predictors to be related to the target, not one another!

**2.  Review the variance inflation Factor (VIF)**

VIF measures how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.

<img src="img/vif.png" width="400" align='left' >

In [None]:
#scatterplot matrix
sns.pairplot(df)

In [None]:
# Compute the correlation matrix
corr = df.corr()
corr

In [None]:
#create a heatmap to visualize the correlations
sns.heatmap(df.corr(), cmap='bwr', center=0, annot=True)

Even more examples to make your correlation heatmap look good
https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec

In [None]:
#examining VIF scores
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

predictors= df.drop(columns=['mpg', "model_year", "origin", "name"])
convert_dict={'cylinders': float, 'weight':float}
predictors = predictors.astype(convert_dict)
predictors['Intercept']=1.0
predictors = predictors.dropna()
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(predictors.values, i) for i in range(predictors.shape[1])]
vif["features"] = predictors.columns
vif

### Do I Have to Fix Multicollinearity?

The need to reduce multicollinearity depends on its severity and your primary goal for your regression model. Keep the following three points in mind:

- The severity of the problems increases with the **degree of the multicollinearity**. Therefore, if you have only moderate multicollinearity, you may not need to resolve it.
- Multicollinearity affects only the **specific predictor variables** that are correlated. Therefore, if multicollinearity is not present for the predictor variables that you are particularly interested in, you may not need to resolve it. 
- Multicollinearity **affects the coefficients and p-values, but it does not influence the predictions**, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

***That being said, the easies way to deal with multicollinearity is just to remove one of the variables from your model***

### Rerun the model after removing the highly correlated variables

Let's take out displacement since it was highly correlated to other predictor variables and it had the highest VIF score of the predictor variables.

In [None]:
mlr_model = ols(formula='mpg~weight+horsepower+cylinders+acceleration', data=df).fit()
mlr_model.summary()

### Turn and Talk:

<p><img src="img/talking.jpeg" width="60" align='left' ></p>
<br />
<br />
<br />


With a classmate, answer the following questions:

**1.  Did the r-squared values change from our previous model?**

**2.  Re-examine the coefficent and p-value for cylinders.  Are they different than our previous model?  How different?**

---

## Categorical Predictors

What if we were also interested in knowing if cars made in the USA were more fuel efficient than those made elsewhere???

Well we can **dummy code** that variable and use it as a predictor in addition to our continuous variables!!  Dummy coding is when we convert each category into a new column, and assign a 1 or 0 to the column.  

Since we only have two options for location the model is made we can just create a new column for USA containing values of 1 when the car was made in the USA and 0, when it was made elsewhere.

In [None]:
df['USA'] = df['origin'].apply(lambda x:  1 if x == "usa"  else 0)
df.sample(5)

**Now that we have our dummy coding done let's use our USA variable in our model.**

In [None]:
#model including USA
mlr_model = ols(formula='mpg~weight+horsepower+cylinders+acceleration+USA', data=df).fit()
mlr_model.summary()

#### Dummy variable coefficient interpretation

You see that we have coefficient and a p-value associated with USA just like our predictor variables. The coefficient for our dummy variable here tells is the average difference in the target between the variable and the **reference group**  The reference group is the categorical variable that is not represented explicitly by a dummy variable.  In this case our reference group is cars **not made in the USA**.

The coefficient of -2.0445 indicates that cars made in the USA have 2.0445 less mpg on average than those not made in the USA, **holding all other predictors constant**.  The p-value of .001 indicates that this is a statistically significant difference.

---

## Back to the business problem

Remember, you were hired by the car company to determine the characteristics of a car that lead to fuel efficiency so they can develop their next car model.  According to our modeling what might you recommend to the company?

---

## Practice time!

Using the `cleaned_movie_data.csv` run a multiple linear regression model to predict gross revenue.  Start with 3 continuous variables and then add on a categorical predictor to LEVEL UP.

## Resources

Everything about regression:  https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-tutorial-and-examples

Statsmodels example: https://datatofish.com/statsmodels-linear-regression/