# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


In [24]:
df=pd.DataFrame(ames)
df[categoricals]

Unnamed: 0,BldgType,KitchenQual,SaleType,MSZoning,Street,Neighborhood
0,1Fam,Gd,WD,RL,Pave,CollgCr
1,1Fam,TA,WD,RL,Pave,Veenker
2,1Fam,Gd,WD,RL,Pave,CollgCr
3,1Fam,Gd,WD,RL,Pave,Crawfor
4,1Fam,Gd,WD,RL,Pave,NoRidge
...,...,...,...,...,...,...
1455,1Fam,TA,WD,RL,Pave,Gilbert
1456,1Fam,TA,WD,RL,Pave,NWAmes
1457,1Fam,Gd,WD,RL,Pave,Crawfor
1458,1Fam,Gd,WD,RL,Pave,NAmes


## Continuous Features

In [30]:
# Log transform and normalize
df=pd.DataFrame(ames)
df_continuous_log= df[continuous].apply(np.log1p)
scaler=StandardScaler()
df_continuous_scaler= scaler.fit_transform(df_continuous_log)
print(df_continuous_scaler)


[[-0.13327022 -0.80364504  0.52919393  0.56006699]
 [ 0.11341289  0.41847858 -0.38196548  0.21276333]
 [ 0.42004913 -0.57667677  0.65963119  0.73404616]
 ...
 [-0.00235902  0.22820766  1.47010236  1.17470887]
 [ 0.13683278 -0.07772073 -0.8545358  -0.39965728]
 [ 0.18011644  0.40347203 -0.39625742 -0.30669507]]


## Categorical Features

In [26]:
# One hot encode categoricals
df[categoricals] = df[categoricals].astype('category')
df_encoded = pd.get_dummies(df,columns = categoricals, drop_first=True)
print(df_encoded)

        Id  MSSubClass  LotFrontage  LotArea Alley LotShape LandContour  \
0        1          60         65.0     8450   NaN      Reg         Lvl   
1        2          20         80.0     9600   NaN      Reg         Lvl   
2        3          60         68.0    11250   NaN      IR1         Lvl   
3        4          70         60.0     9550   NaN      IR1         Lvl   
4        5          60         84.0    14260   NaN      IR1         Lvl   
...    ...         ...          ...      ...   ...      ...         ...   
1455  1456          60         62.0     7917   NaN      Reg         Lvl   
1456  1457          20         85.0    13175   NaN      Reg         Lvl   
1457  1458          70         66.0     9042   NaN      Reg         Lvl   
1458  1459          20         68.0     9717   NaN      Reg         Lvl   
1459  1460          20         75.0     9937   NaN      Reg         Lvl   

     Utilities LotConfig LandSlope  ... Neighborhood_NoRidge  \
0       AllPub    Inside       Gtl 

## Combine Categorical and Continuous Features

In [28]:
# combine features into a single dataframe called preprocessed
df_preprocessed = pd.concat([df_encoded,df_continuous_scaler],axis=1)

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

## Run a linear model with SalePrice as the target variable in statsmodels

In [None]:
# Your code here

## Run the same model in scikit-learn

In [None]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [None]:
# Your code here - predict the house price given the following characteristics

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!