# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [42]:
# Log transform and normalize
ames_cont = pd.DataFrame([])
for col in continuous:
    ames_cont[col] = np.log(ames[col])

    
def min_max(x):
    ans = (x - min(x))/(max(x)-min(x))
    return ans


ames_min_max = ames_cont.apply(min_max)
ames_min_max.drop(columns=['SalePrice'], inplace=True)
    

## Categorical Features

In [43]:
# One hot encode categoricals
#categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
bldg_dum = pd.get_dummies(ames['BldgType'], drop_first=True)
kitchen_dum = pd.get_dummies(ames['KitchenQual'], drop_first=True)
sale_dum = pd.get_dummies(ames['SaleType'], drop_first=True)
zone_dum = pd.get_dummies(ames['MSZoning'], drop_first=True)
street_dum = pd.get_dummies(ames['Street'], drop_first=True)
neighborhood_dum = pd.get_dummies(ames['Neighborhood'], drop_first=True)



## Combine Categorical and Continuous Features

In [44]:
# combine features into a single dataframe called preprocessed
sales = ames['SalePrice']
preprocessed = pd.concat([sales, ames_min_max, bldg_dum, kitchen_dum, sale_dum, zone_dum, street_dum, neighborhood_dum],axis=1)
preprocessed.head()

Unnamed: 0,SalePrice,LotArea,1stFlrSF,GrLivArea,2fmCon,Duplex,Twnhs,TwnhsE,Fa,Gd,...,NoRidge,NridgHt,OldTown,SWISU,Sawyer,SawyerW,Somerst,StoneBr,Timber,Veenker
0,208500,0.366344,0.356155,0.577712,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,181500,0.391317,0.503056,0.470245,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,223500,0.422359,0.383441,0.593095,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,140000,0.390295,0.399941,0.579157,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,250000,0.468761,0.466237,0.666523,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [53]:
preprocessed.rename(columns={'1stFlrSF':'FirstFlrSF','2fmCon':'TwoFmCon'}, inplace=True)
predictors = [x for x in preprocessed.columns[1:]]
predictors

['LotArea',
 'FirstFlrSF',
 'GrLivArea',
 'TwoFmCon',
 'Duplex',
 'Twnhs',
 'TwnhsE',
 'Fa',
 'Gd',
 'TA',
 'CWD',
 'Con',
 'ConLD',
 'ConLI',
 'ConLw',
 'New',
 'Oth',
 'WD',
 'FV',
 'RH',
 'RL',
 'RM',
 'Pave',
 'Blueste',
 'BrDale',
 'BrkSide',
 'ClearCr',
 'CollgCr',
 'Crawfor',
 'Edwards',
 'Gilbert',
 'IDOTRR',
 'MeadowV',
 'Mitchel',
 'NAmes',
 'NPkVill',
 'NWAmes',
 'NoRidge',
 'NridgHt',
 'OldTown',
 'SWISU',
 'Sawyer',
 'SawyerW',
 'Somerst',
 'StoneBr',
 'Timber',
 'Veenker']

In [54]:
# Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols
pred_sum = '+'.join(predictors)
formula = f"SalePrice ~ {pred_sum}"
model = ols(formula=formula, data=preprocessed).fit()

In [55]:
model.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.806
Model:,OLS,Adj. R-squared:,0.799
Method:,Least Squares,F-statistic:,124.6
Date:,"Sat, 22 Aug 2020",Prob (F-statistic):,0.0
Time:,16:45:33,Log-Likelihood:,-17348.0
No. Observations:,1460,AIC:,34790.0
Df Residuals:,1412,BIC:,35050.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.677e+04,2.46e+04,2.309,0.021,8548.976,1.05e+05
LotArea,1.003e+05,1.63e+04,6.169,0.000,6.84e+04,1.32e+05
FirstFlrSF,8.894e+04,1.16e+04,7.673,0.000,6.62e+04,1.12e+05
GrLivArea,2.089e+05,1.16e+04,18.080,0.000,1.86e+05,2.32e+05
TwoFmCon,-1.547e+04,6885.909,-2.247,0.025,-2.9e+04,-1963.124
Duplex,-2.946e+04,5383.318,-5.473,0.000,-4e+04,-1.89e+04
Twnhs,-1.903e+04,8093.233,-2.351,0.019,-3.49e+04,-3151.964
TwnhsE,-1.795e+04,5204.902,-3.449,0.001,-2.82e+04,-7741.004
Fa,-8.558e+04,7710.123,-11.100,0.000,-1.01e+05,-7.05e+04

0,1,2,3
Omnibus:,513.727,Durbin-Watson:,1.93
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11182.821
Skew:,1.106,Prob(JB):,0.0
Kurtosis:,16.377,Cond. No.,118.0


## Run the same model in scikit-learn

In [58]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
y = ames['SalePrice']
linereg = LinearRegression()
linereg.fit(preprocessed.drop('SalePrice', axis=1), y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [59]:
linereg.coef_

array([100257.13305702,  88944.99711967, 208921.46884228, -15470.83603067,
       -29460.33246996, -19028.0179233 , -17951.17681979, -85579.02251041,
       -58800.96411711, -76544.23263002,  23280.88887463,  60558.1152778 ,
        17103.56395132,  13950.5699292 ,   5959.6242822 ,  29931.67575434,
        13326.06280712,  12145.11974576,  31479.11788169,  16953.57389195,
        19259.46102664,  33387.30213005,  -1652.53991566,   7415.58904673,
       -15323.61368684, -50437.57552947, -30141.74062858, -17247.66140545,
       -13431.92858168, -58033.53728179, -22322.07925776, -68271.5835229 ,
       -26911.16971068, -31844.84258383, -44157.95052307,   3043.55309964,
       -34455.0474652 ,  60277.42825136,  37668.51286882, -71877.55831747,
       -59817.82894761, -45883.59107202, -24007.99382195,  -5124.00673102,
        52416.37278963,  -7977.65007607,   1624.479234  ])

In [60]:
linereg.intercept_

56768.05855065142

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [87]:
def transform(var1,var2,var3):
    transform1 = (np.log(var1) - ames_cont['LotArea'].min())/(ames_cont.LotArea.max()-ames_cont.LotArea.min())
    transform2 = (np.log(var2) - ames_cont['1stFlrSF'].min())/(ames_cont['1stFlrSF'].max()-ames_cont['1stFlrSF'].min())
    transform3 = (np.log(var3) - ames_cont['GrLivArea'].min())/(ames_cont.GrLivArea.max()-ames_cont.GrLivArea.min())
    
    return transform1, transform2, transform3
x_one, x_two, x_three = transform(14977, 1976, 1976)
df = preprocessed.drop(columns=['SalePrice'], axis=1)
test_df = pd.DataFrame(columns=df.columns)

test_df.loc[0] = 0

test_df['LotArea'] = x_one
test_df['FirstFlrSF'] = x_two
test_df['GrLivArea'] = x_three
test_df['Gd'] = 1
test_df['New'] = 1
test_df['RL'] = 1
test_df['Pave'] = 1
test_df['NridgHt'] = 1



In [88]:
test_df


Unnamed: 0,LotArea,FirstFlrSF,GrLivArea,TwoFmCon,Duplex,Twnhs,TwnhsE,Fa,Gd,TA,...,NoRidge,NridgHt,OldTown,SWISU,Sawyer,SawyerW,Somerst,StoneBr,Timber,Veenker
0,0.478363,0.672737,0.628858,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0


In [91]:
prediction = sum(test_df.loc[0]*linereg.coef_) + linereg.intercept_
prediction


322351.9511551196

In [None]:
#The model predicts that a house with the given features will cost $322,351.95

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!