# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [79]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')
ames['1fam'] = 0
ames.loc[ames['BldgType'] =='1Fam', '1fam'] = 1

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

## Continuous Features

In [80]:
# Log transform and normalize
cont_df = np.log(ames[continuous])
cont_df

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice
0,9.041922,6.752270,7.444249,12.247694
1,9.169518,7.140453,7.140453,12.109011
2,9.328123,6.824374,7.487734,12.317167
3,9.164296,6.867974,7.448334,11.849398
4,9.565214,7.043160,7.695303,12.429216
...,...,...,...,...
1455,8.976768,6.859615,7.406711,12.072541
1456,9.486076,7.636752,7.636752,12.254863
1457,9.109636,7.080026,7.757906,12.493130
1458,9.181632,6.982863,6.982863,11.864462


In [81]:
def normalize(feature):
    return (feature - feature.mean()) / feature.std()
ames_log_norm = cont_df.apply(normalize)
ames_log_norm 

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice
0,-0.133185,-0.803295,0.529078,0.559876
1,0.113403,0.418442,-0.381715,0.212692
2,0.419917,-0.576363,0.659449,0.733795
3,0.103311,-0.439137,0.541326,-0.437232
4,0.878108,0.112229,1.281751,1.014303
...,...,...,...,...
1455,-0.259100,-0.465447,0.416538,0.121392
1456,0.725171,1.980456,1.106213,0.577822
1457,-0.002324,0.228260,1.469438,1.174306
1458,0.136814,-0.077546,-0.854179,-0.399519


## Categorical Features

In [62]:
# One hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)


## Combine Categorical and Continuous Features

In [63]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([ames_log_norm, ames_ohe, ames['1fam']], axis=1)
preprocessed.head()

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,1fam
0,-0.133185,-0.803295,0.529078,0.559876,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1,0.113403,0.418442,-0.381715,0.212692,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
2,0.419917,-0.576363,0.659449,0.733795,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,0.103311,-0.439137,0.541326,-0.437232,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
4,0.878108,0.112229,1.281751,1.014303,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


## Run a linear model with SalePrice as the target variable in statsmodels

In [64]:
# Your code here
X = preprocessed.drop('SalePrice', axis=1)
y = preprocessed['SalePrice']

import statsmodels.api as sm
X_int = sm.add_constant(X)
model = sm.OLS(y,X_int).fit()
model.summary()
X

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,...,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,1fam
0,-0.133185,-0.803295,0.529078,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1,0.113403,0.418442,-0.381715,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
2,0.419917,-0.576363,0.659449,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3,0.103311,-0.439137,0.541326,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,0.878108,0.112229,1.281751,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,-0.259100,-0.465447,0.416538,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1456,0.725171,1.980456,1.106213,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1457,-0.002324,0.228260,1.469438,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1458,0.136814,-0.077546,-0.854179,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


## Run the same model in scikit-learn

In [65]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
linreg.coef_

array([ 0.10327192,  0.1371289 ,  0.37682133, -0.01479346, -0.26367   ,
        0.01632772,  0.1054669 , -0.99986001, -0.38202198, -0.66924909,
        0.22847737,  0.5860786 ,  0.31510567,  0.0330941 ,  0.01608664,
        0.29985338,  0.11784232,  0.17480326,  1.06663561,  0.87681007,
        0.99609131,  1.10228499, -0.21311107,  0.05293276, -0.46271253,
       -0.64982261, -0.21019239, -0.07609253, -0.08233633, -0.76126683,
       -0.09799942, -0.96183328, -0.69182575, -0.2553217 , -0.44067351,
       -0.01595046, -0.26762962,  0.36313165,  0.36259667, -0.93504972,
       -0.69976325, -0.47543141, -0.23309732,  0.09502969,  0.42957077,
        0.0056924 ,  0.12762613,  0.15666884])

In [66]:
linreg.intercept_


-0.2883662134305029

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [67]:
index_par={}
x_col = X.columns
for r in range(1,len(model.params)):
    index_par[x_col[r-1]] = model.params[r]
index_par


{'LotArea': 0.10327191688873588,
 '1stFlrSF': 0.13712890356616708,
 'GrLivArea': 0.376821331345217,
 'BldgType_2fmCon': -0.06285449440094518,
 'BldgType_Duplex': -0.3117310396444894,
 'BldgType_Twnhs': -0.0317333166072203,
 'BldgType_TwnhsE': 0.057405864101866495,
 'KitchenQual_Fa': -0.9998600116793104,
 'KitchenQual_Gd': -0.3820219812168999,
 'KitchenQual_TA': -0.6692490885998973,
 'SaleType_CWD': 0.2284773657592834,
 'SaleType_Con': 0.5860785989727194,
 'SaleType_ConLD': 0.3151056675919964,
 'SaleType_ConLI': 0.03309410012488634,
 'SaleType_ConLw': 0.016086639707504125,
 'SaleType_New': 0.29985337594330613,
 'SaleType_Oth': 0.11784232044843618,
 'SaleType_WD': 0.1748032603198253,
 'MSZoning_FV': 1.0666356053634938,
 'MSZoning_RH': 0.8768100657677511,
 'MSZoning_RL': 0.9960913081755349,
 'MSZoning_RM': 1.102284988238092,
 'Street_Pave': -0.2131110683918434,
 'Neighborhood_Blueste': 0.05293275961390154,
 'Neighborhood_BrDale': -0.46271253394630063,
 'Neighborhood_BrkSide': -0.649822608

In [70]:
cont_vrs = {'LotArea':14977, '1stFlrSF':1976, 'GrLivArea':1976}
cat_vrs= ['1fam', 'KitchenQual_Gd', 'SaleType_New', 'MSZoning_RL', 'Street_Pave', 'Neighborhood_NridgHt']
s=0
for k in cont_vrs:
    tr = np.log(cont_vrs[k])
    ser = cont_df[k]
    normald = (tr - ser.mean()) / ser.std()
    coef_ = index_par[k]
    mult = normald*coef
    s+=mult
for v in cat_vrs:
    s+=index_par[v]
price = s+linreg.intercept_
price


1.272477312953693

In [83]:
sale_mean = np.mean(cont_df['SalePrice'])
sale_std = np.std(cont_df['SalePrice'])

cont_df = np.log(ames[continuous])
def normalize(feature):
    return (feature - feature.mean()) / feature.std()

un_st_sale_price = (sale_std * price) + sale_mean
un_st_sale_price
full_price = np.exp(un_st_sale_price)
full_price

277110.11636427284

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!