# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [16]:
# Log transform and normalize
col_name = [f'{column}_log' for column in continuous]
ames_log = np.log(ames[continuous])
ames_log.columns = col_name

scaler = StandardScaler()
ames_logscaler.fit_transform(ames_log.values)


array([[-0.13323102, -0.80357042,  0.52925951,  0.5600676 ],
       [ 0.11344181,  0.41858529, -0.38184621,  0.21276441],
       [ 0.42006074, -0.57656015,  0.65967461,  0.73404645],
       ...,
       [-0.00232484,  0.22833815,  1.46994196,  1.17470811],
       [ 0.13686055, -0.07757255, -0.85447127, -0.39965604],
       [ 0.180142  ,  0.40358096, -0.39613889, -0.30669379]])

## Categorical Features

In [7]:
# One hot encode categoricals
pd.get_dummies(ames[categoricals], drop_first = True)

Unnamed: 0,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,SaleType_CWD,SaleType_Con,SaleType_ConLD,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1458,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Combine Categorical and Continuous Features

In [None]:
# combine features into a single dataframe called preprocessed

## Run a linear model with SalePrice as the target variable in statsmodels

In [None]:
# Your code here

## Run the same model in scikit-learn

In [None]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!