# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [53]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Continuous Features

In [28]:
i=0
normalized=[]
for cat in continuous:
    i=np.log(ames[cat])
    scaled = (i-min(i)/max(i)-min(i))
    normalized.append(scaled)
    i+= 1
norm = pd.concat(normalized,axis=1)
norm.head()

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice
0,1.287894,0.253714,0.960366,1.014593
1,1.415491,0.641897,0.65657,0.87591
2,1.574096,0.325818,1.003851,1.084065
3,1.410269,0.369418,0.964451,0.616296
4,1.811186,0.544604,1.21142,1.196115


## Categorical Features

In [20]:
i=0
dummies = []
for cat in categoricals:
    i= pd.get_dummies(ames[cat],prefix=cat,drop_first=True)
    dummies.append(i)
    i+=1
dummified = pd.concat(dummies,axis=1) 
dummified.head()

Unnamed: 0,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,SaleType_CWD,SaleType_Con,SaleType_ConLD,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,1,1,1,1,1,2,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,2,1,1,1,...,1,1,1,1,1,1,1,1,1,2
2,1,1,1,1,1,2,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,2,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,2,1,1,1,1,...,2,1,1,1,1,1,1,1,1,1


## Combine Categorical and Continuous Features

In [59]:
preprocessed = norm.join(dummified)
preprocessed.columns

Index(['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice', 'BldgType_2fmCon',
       'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
       'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'SaleType_CWD',
       'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw',
       'SaleType_New', 'SaleType_Oth', 'SaleType_WD', 'MSZoning_FV',
       'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'Street_Pave',
       'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
       'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneB

## Run a linear model with SalePrice as the target variable in statsmodels

In [62]:
X = preprocessed.drop('SalePrice', axis=1)
y = preprocessed['SalePrice']
X_int = sm.add_constant(X)
model = sm.OLS(y,X_int).fit()
model.summary()

# outcome = 'SalePrice'
# predictors = preprocessed.drop('SalePrice',axis=1)
# pred_sum='+'.join(predictors.columns)
# formula=outcome+'~'+pred_sum
# # model = ols(formula=formula,data=preprocessed).fit()
# pred_sum

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Thu, 01 Jul 2021",Prob (F-statistic):,0.0
Time:,12:51:45,Log-Likelihood:,601.65
No. Observations:,1460,AIC:,-1107.0
Df Residuals:,1412,BIC:,-853.6
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.4329,1.273,1.126,0.261,-1.064,3.930
LotArea,0.0797,0.015,5.475,0.000,0.051,0.108
1stFlrSF,0.1724,0.020,8.584,0.000,0.133,0.212
GrLivArea,0.4513,0.019,24.114,0.000,0.415,0.488
BldgType_2fmCon,-0.0685,0.032,-2.173,0.030,-0.130,-0.007
BldgType_Duplex,-0.1679,0.025,-6.813,0.000,-0.216,-0.120
BldgType_Twnhs,-0.0561,0.037,-1.513,0.130,-0.129,0.017
BldgType_TwnhsE,-0.0205,0.024,-0.858,0.391,-0.067,0.026
KitchenQual_Fa,-0.3994,0.035,-11.315,0.000,-0.469,-0.330

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,2360.0


## Run the same model in scikit-learn

In [66]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
linreg.coef_

array([ 0.07972232,  0.17239913,  0.45127205, -0.06849094, -0.16790514,
       -0.05605952, -0.02045271, -0.39939595, -0.15259939, -0.2673328 ,
        0.09126571,  0.23411019,  0.12586955,  0.0132195 ,  0.00642584,
        0.11977699,  0.04707234,  0.06982549,  0.42606959,  0.35024342,
        0.39789053,  0.4403098 , -0.08512761,  0.02114409, -0.18483139,
       -0.25957286, -0.08396174, -0.0303953 , -0.0328894 , -0.30408946,
       -0.03914605, -0.3842061 , -0.27635109, -0.10198873, -0.17602786,
       -0.00637144, -0.10690515,  0.14505362,  0.14483992, -0.37350736,
       -0.27952174, -0.18991196, -0.09311116,  0.03795979,  0.17159285,
        0.00227384,  0.0509805 ])

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!