# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [21]:
# Log transform and normalize
col_name = [f'{column}_log' for column in continuous]
ames_log = np.log(ames[continuous])
ames_log.columns = col_name

scaler = StandardScaler()
ames_log_scaled = scaler.fit_transform(ames_log.values)
ames_log_scaled = pd.DataFrame(data = ames_log_scaled, columns = col_name)

## Categorical Features

In [22]:
# One hot encode categoricals
ames_cat = pd.get_dummies(ames[categoricals],prefix=categoricals, drop_first = True)

## Combine Categorical and Continuous Features

In [23]:
# combine features into a single dataframe called preprocessed
ames_preprocessed = pd.concat([ames_log_scaled, ames_cat], axis = 1)
ames_preprocessed.head()

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,SalePrice_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-0.133231,-0.80357,0.52926,0.560068,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.113442,0.418585,-0.381846,0.212764,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.420061,-0.57656,0.659675,0.734046,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.103347,-0.439287,0.541511,-0.437382,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.878409,0.112267,1.282191,1.014651,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [25]:
# Your code here
y = ames_preprocessed['SalePrice_log']
X = ames_preprocessed.drop(columns = 'SalePrice_log')

In [26]:
X_const = sm.add_constant(X)
models = sm.OLS(y, X, hasconst = True)
result = models.fit()
result.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,SalePrice_log,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,160.0
Date:,"Tue, 30 Jun 2020",Prob (F-statistic):,0.0
Time:,20:46:02,Log-Likelihood:,-738.77
No. Observations:,1460,AIC:,1572.0
Df Residuals:,1413,BIC:,1820.0
Df Model:,46,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
LotArea_log,0.1030,0.019,5.464,0.000,0.066,0.140
1stFlrSF_log,0.1363,0.016,8.580,0.000,0.105,0.167
GrLivArea_log,0.3769,0.016,24.128,0.000,0.346,0.408
BldgType_2fmCon,-0.1712,0.079,-2.169,0.030,-0.326,-0.016
BldgType_Duplex,-0.4224,0.062,-6.858,0.000,-0.543,-0.302
BldgType_Twnhs,-0.1464,0.092,-1.591,0.112,-0.327,0.034
BldgType_TwnhsE,-0.0591,0.058,-1.028,0.304,-0.172,0.054
KitchenQual_Fa,-1.0060,0.088,-11.481,0.000,-1.178,-0.834
KitchenQual_Gd,-0.3870,0.049,-7.856,0.000,-0.484,-0.290

0,1,2,3
Omnibus:,290.089,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1241.226
Skew:,-0.887,Prob(JB):,2.9599999999999998e-270
Kurtosis:,7.154,Cond. No.,82.3


## Run the same model in scikit-learn

In [28]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
linreg

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!