# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [19]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 
                'Neighborhood']


## Continuous Features

In [39]:
# Log transform and normalize

ames_log_norm = pd.DataFrame([])


def log_norm(data):
    
    data_log = np.log(data)
    
    data_log_norm = (data_log - np.mean(data_log)) / np.std(data_log)
    
    return data_log_norm



ames_log_norm = ames[continuous].apply(log_norm)

new_name = [item+"_log" for item in continuous]

ames_log_norm.columns = new_name

## Categorical Features

In [42]:
# One hot encode categoricals
ames_cat = pd.DataFrame([])

ames_cat = pd.get_dummies(ames[categoricals], prefix = categoricals, 
                                    drop_first = True)

## Combine Categorical and Continuous Features

In [45]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([ames_log_norm, ames_cat], axis = 1)
preprocessed.head()

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,SalePrice_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-0.133231,-0.80357,0.52926,0.560068,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.113442,0.418585,-0.381846,0.212764,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.420061,-0.57656,0.659675,0.734046,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.103347,-0.439287,0.541511,-0.437382,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.878409,0.112267,1.282191,1.014651,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [53]:
# Your code here
import statsmodels.api as sm

predictors = list(preprocessed.columns)
predictors.remove("SalePrice_log")

In [57]:
X = preprocessed[predictors]
X_int = sm.add_constant(X)
Y = preprocessed["SalePrice_log"]


In [60]:
model = sm.OLS(Y,X_int).fit()
model.summary()

0,1,2,3
Dep. Variable:,SalePrice_log,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Tue, 25 Jan 2022",Prob (F-statistic):,0.0
Time:,00:49:11,Log-Likelihood:,-738.64
No. Observations:,1460,AIC:,1573.0
Df Residuals:,1412,BIC:,1827.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.1317,0.263,-0.500,0.617,-0.648,0.385
LotArea_log,0.1033,0.019,5.475,0.000,0.066,0.140
1stFlrSF_log,0.1371,0.016,8.584,0.000,0.106,0.168
GrLivArea_log,0.3768,0.016,24.114,0.000,0.346,0.407
BldgType_2fmCon,-0.1715,0.079,-2.173,0.030,-0.326,-0.017
BldgType_Duplex,-0.4205,0.062,-6.813,0.000,-0.542,-0.299
BldgType_Twnhs,-0.1404,0.093,-1.513,0.130,-0.322,0.042
BldgType_TwnhsE,-0.0512,0.060,-0.858,0.391,-0.168,0.066
KitchenQual_Fa,-1.0002,0.088,-11.315,0.000,-1.174,-0.827

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,109.0


## Run the same model in scikit-learn

In [69]:
# Your code here - Check that the coefficients and intercept are 
# the same as those from Statsmodels

from sklearn.linear_model import LinearRegression

model_sk = LinearRegression()
model_sk.fit(X,Y)
print("Model Coefficients: ", model_sk.coef_)
print("\nModel Intercept: ", model_sk.intercept_)

Model Coefficients:  [ 0.10327192  0.1371289   0.37682133 -0.17152105 -0.42048287 -0.14038921
 -0.05121949 -1.00020261 -0.38215288 -0.6694784   0.22855565  0.58627941
  0.31521364  0.03310544  0.01609215  0.29995612  0.1178827   0.17486316
  1.06700108  0.8771105   0.99643261  1.10266268 -0.21318409  0.0529509
 -0.46287108 -0.65004527 -0.21026441 -0.0761186  -0.08236455 -0.76152767
 -0.09803299 -0.96216285 -0.6920628  -0.25540919 -0.4408245  -0.01595592
 -0.26772132  0.36325607  0.36272091 -0.93537011 -0.70000301 -0.47559431
 -0.23317719  0.09506225  0.42971796  0.00569435  0.12766986]

Model Intercept:  -0.1317424941874447


## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!