# SECTION 19: MULTIPLE LINEAR REGRESSION

- online-ds-ft-070620
- 08/20/20

## Announcements

- **Questions re: YouTube Playlist Video Changes to reduce delay**
    - No title screens (just started)
    - 720p Video (have been doing for a while. Has anyone had issues when watching study groups?)
    

    

## Resources:

- **[OSEMN Data Science Workflow Notebook](https://github.com/jirvingphd/fsds-osemn-workflow)**
    - `student_OSEMN.ipynb`: also included in notes repo

## LEARNING OBJECTIVES

- Learn how to expand our last lesson to include multiple independent variables.
- Learn ways to deal with categorical variables.
- Learn about multicollinearity of features
- Learn about how to improve a baseline model based on results
- Learn how to run a multiple regression using statsmodels

<!-- ### TOPICS:

#### Part 1 
- Multiple Linear Regression
- Dealing with Categorical Variables
- Multicollinearity of Features
- Multiple Linear Regression in Statsmodels

#### Part 2
- Feature Scaling & Normalization
- Model Fit and Validation/Cross Validation -->

## Questions?



- Introduction to Cross-Validation - Lab - Cross-Validation using Scikit-Learn solution

- Multiple Linear Regression in Statsmodels - Lab 
    - is there a way to easily create the regression equation when there are so many coefficients?


- Can we possibly work through a full practical linear regression workflow from start to finish? 
    - Stats testing the data for assumptions, splitting, training, testing, validation, and inference or prediction results to get a fuller understanding of the realistic linear regression workflow as it pertains to the job.


# PREVIOUSLY ON...

## Single Linear Regression

- We discussed how the mean is our worst model.

- We discussed the assumptions for a linear regression:
    - Linear relationship between predictor and target variable.
    - Predictor (x) and its error terms have a normal distribution
    - Homoskedasticity ( variance of residuals is constant)
    
- We learned how to run a single regession in statsmodels

In [1]:
!pip install -U fsds
from fsds.imports import *

fsds v0.2.23 loaded.  Read the docs: https://fs-ds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


[i] Pandas .iplot() method activated.


In [2]:
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [10,6]

In [3]:
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import scipy.stats as stats

import warnings
# warnings.filterwarnings('ignore')

In [4]:
## Load in ames dataset
df = fs.datasets.load_ames_train(subset=True)

## Preview Data
# display(df.head());

## Save Columns of Interest
X = df['GrLivArea'].copy()
y = df['SalePrice'].copy()

In [None]:

## Formatting 
priceFmt = mpl.ticker.StrMethodFormatter("${x:,.0f}")

## Scatter Plots for Linearity Check
def plot_data(X,y,xlabel='GrLivArea',ylabel='SalePrice'):
    
    fig, ax = plt.subplots()
    
    ax.scatter(X,y,marker='.')
    
    ax.set(xlabel=xlabel,ylabel=ylabel)
    ax.set_title(f'{xlabel} vs {ylabel}')
    
    ax.yaxis.set_major_formatter(priceFmt)
    return fig,ax

plot_data(X,y)

In [None]:
## Check Linearity 
ax = sns.regplot(X,y,line_kws={'color':'green','ls':'--'})

In [None]:
## Check for outliers
from scipy import stats
def find_outliers_z(data):
    zFP = np.abs(stats.zscore(data))
    zFP = pd.Series(zFP, index=data.index)
    idx_outliers = zFP > 3
    return idx_outliers

In [None]:
## Get X outliers
X_outliers = find_outliers_z(X)
X_outliers.sum()

In [None]:
## Get y outliers
y_outliers = find_outliers_z(y)
y_outliers.sum()

In [None]:
# ## Make a DataFrame of Outliers
# df_outliers = pd.DataFrame({'X':X_outliers,
#                            'y':y_outliers})
# df_outliers['any'] = df_outliers.any(axis=1)
# ## Add column of any outliers
# df_outliers['any'].value_counts()

# idx_outliers = df_outliers['any'].copy()
# ~idx_outliers

In [None]:
len(X)

In [None]:
idx_outliers = np.any(np.stack([X_outliers,y_outliers],axis=1), axis=1)
idx_outliers.shape

In [None]:
## Create X_clean and y_clean wihtout outliers
X_clean = X[~idx_outliers].copy()
y_clean = y[~idx_outliers].copy()

## Check data with plot_data
plot_data(X_clean,y_clean)

In [None]:
## Turn code above for checking normality into a function 
def check_normality(X,y):
    ## Visualize Distributions of X and y and Check Normality
    fig, axes = plt.subplots(ncols=2,figsize=(10,5))

    sns.distplot(X, ax=axes[0], kde=False,bins='auto')
    axes[0].set(xlabel='X')

    sns.distplot(y, ax=axes[1], kde=False,bins='auto',color='orange')
    axes[1].set(xlabel='y')

    print("X Normality: ",stats.normaltest(X))
    print("y Normality: ",stats.normaltest(y))
    return fig, axes


In [None]:
check_normality(X_clean,y_clean)

### OLS with Statsmodels (non-formula version)

In [None]:
from statsmodels.regression.linear_model import OLS
## Add a constant to X to include an intercept in our regression
X_clean = sm.add_constant(X_clean)
display(X_clean)

In [None]:
## Make an OLS linear model using original X and y
model = OLS(y_clean,X_clean).fit()
## Check model summary
model.summary()

In [None]:
fig = sm.graphics.qqplot(model.resid,dist=stats.norm,fit=True,line='45')

### Not Covered Previously

In [None]:
fig = sm.graphics.plot_regress_exog(model, "GrLivArea", fig=plt.figure(figsize=(12,8)))

## Statsmodels OLS - Formula Version

In [None]:
## Make our formula-based Regression
import statsmodels.formula.api as smf
df_clean = df[~idx_outliers].copy()
df_clean

# Multiple Linear Regression

## Single Regression
 $$y=mx+b$$

 $$y = \beta_1 x_1 + \beta_0 $$

<br><br>
## Multiple Predictor/X Variables

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\ldots + \beta_n x_n $$

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-multiple-linear-regression-online-ds-ft-100719/master/images/multiple_reg.png" width=400>

#### $\hat Y$ vs $Y$


- Y: Actual value corresponding to a specific X value

- "Y hat" ($\hat Y$): Predicted value predicted fromn a specific X value.


$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

where $n$ is the number of predictors, $\beta_0$ is the intercept, and $\hat y$ is the so-called "fitted line" or the predicted value associated with the dependent variable.

In [None]:
pd.set_option('display.max_columns',0)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
# Load in ames dataset
df = fs.datasets.load_ames_train(subset=False)
columns = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice',
           'BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
df = df[columns].copy()
df.head()

# DEALING WITH CATEGORICAL VARIABLES

- What are categorical variables?
- Understand creating dummy variables for predictors.
- Use pandas and Scikit-Learn to create dumies
- Understand and avoid the "dummy variable trap"

## What are categorical variables?
- Variables that do not represent a continuous/ordinal number. 

## Identifying categorical variables:
What to look for?
1. Column dtype is 'object'
2. Use `df.describe()` -  check for min/max. Are they integers?
3. Use scatterplots & histograms -  look for columns of datapoints

In [None]:
## Separate List of Numeric vs Str Columns\


In [None]:
## Inspect the Value Counts for Each Str Col


In [None]:
## Visualize Num Cols vs Target and Num Col Distrubtions


## Transforming Categorical Variables

To use categorical variables for regression, they must be transformed.
There are 2 methods to dealing with them:
1. Label Encoding
    - Replace string categories with integer values (0 to n)
    - Can be done with:
        1. Pandas 
        2. Scikit Learn

2. One-hot / dummy encoding
    - Turn each category in a categorical variable into its own variable, that is either a 0 or 1. 0 for rows that do not belong to that sub-category. 1 for rows that belong to the sub-category
    - Can be done with:
        1. Pandas
        2. Scikit Learn


### Label Encoding

In [None]:
## Check the Value Counts for our test column - "BldgType"


#### Via pandas.cat.codes

In [None]:
## Label Encode with .cat.codes


#### Via Sklearn's LabelEncoder

In [None]:
## Using sklearn LabelEncoder
from sklearn.preprocessing import LabelEncoder


### Dummy Encoding / One-Hot Encoding

#### Via Pandas.get_dummies()

#### Via Scikit-Learn's LabelBinerizer

In [None]:
from sklearn.preprocessing import LabelBinarizer


### The Dummy Variable Trap


## Testing Out Our Model With Encoded Categorical Data

In [None]:
## Dummy Encode categorical vars


In [None]:
## EITHER USE FORMULA OLS or normal OLS 


In [None]:
## Non-Formula OLS


In [None]:
## View Model Summary

# Multicollinearity
- An additional concern to check for.
- Rule of thumb is if correlation between vars is >0.70 is too high


In [None]:
## Get the correlation matrix for our model_df


In [None]:
# Checking Visually 

def multiplot():

    return fig, ax


# 