# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

## Continuous Features

In [None]:
# Log transform and normalize
ames_cont = ames[continuous]

# Log transform
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

ames_log

# Other way to log transform
#ames_log = ames[continuous].apply(lambda x: np.log(x))

# Normalize (subtract mean and divide by std)
def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)
ames_log_norm

# Other way to normalize using sklearn standard scalar 
#ss = StandardScaler

#log_std = pd.DataFrame(ss.fit_transform(ames_log), columns = ames_log.columns)
#log_std.head()

## Categorical Features

In [None]:
# One hot encode categoricals
ames_cat = ames[categoricals]
ames_ohe = pd.get_dummies(ames_cat, prefix=categoricals, drop_first=True)
ames_ohe

# Other way to ohe
#ohe_method2 = OneHotEncoder(drop = 'first', sparse = False)
#ohe_method2_df = pd.DataFrame(ohe_method2.fit_transform(ames[categoricals]), columns = ohe_method2.get_feature_names())

#ohe_method2_df.head()

In [None]:
# sanity check for method2
ames[categoricals].head()

## Combine Categorical and Continuous Features

In [None]:
# combine features into a single dataframe called preprocessed
preprocessed_ames = pd.concat([ames_log_norm, ames_ohe], axis =1)
preprocessed_ames.head()

## Run a linear model with SalePrice as the target variable in statsmodels

In [None]:
# Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
Y = preprocessed_ames['SalePrice_log']
X = preprocessed_ames.drop('SalePrice_log', axis = 1)
X_int = sm.add_constant(X)

model = sm.OLS(Y, X_int).fit()
model.summary()

## Run the same model in scikit-learn

In [None]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X,Y)
#linreg.score(X,Y)  R-squared

In [None]:
LinearRegression()

In [None]:
linreg.coef_

In [None]:
linreg.intercept_

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [None]:
# Your code here - predict the house price given the following characteristics

continuous.remove('SalePrice')

used_cols = [*continuous, *categoricals]
used_cols

In [None]:
# create empty dataframe for the new row of sample data
new_row = pd.DataFrame(columns=used_cols)

In [None]:
# add details provided to the empty dataframe

new_row = new_row.append({"LotArea": 14977,
                         '1stFlrSF': 1976,
                         'GrLivArea': 1976,
                         'BldgType': '1Fam',
                         'KitchenQual': 'Gd',
                         'SaleType': 'New',
                         'MSZoning': 'RL',
                         'Street': 'Pave',
                         'Neighborhood': 'NridgHt'},
                          ignore_index=True)
new_row

In [None]:
# continuous columns in df
new_row_cont = new_row[continuous]

# log features
log_names = [f'{column}_log' for column in new_row_cont.columns]

new_row_log = np.log(new_row_cont.astype(float))  #won't work unless float
new_row_log.columns = log_names

# normalize
for col in continuous:
    #normalize using mean and std from overall dataset
    new_row_log[f'{col}_log'] = (new_row_log[f'{col}_log'] - ames[col].mean()) / ames[col].std()
new_row_log    



In [None]:
# categoricals in df
new_row_cat = new_row[categoricals]

new_row_ohe = pd.DataFrame(columns = ames_ohe.columns)

# using complicated for loops to ohe the new row
ohe_dict = {}
for col_type in new_row_cat.columns:
    col_list = [c for c in new_row_ohe.columns.to_list() if col_type in c]
    for x in col_list:
        if new_row_cat[col_type][0] in x:
            ohe_dict[x] = 1
        else:
            ohe_dict[x] = 0
            
# putting the results in a dataframe
new_row_ohe = new_row_ohe.append(ohe_dict, ignore_index=True)
new_row_ohe


In [None]:
new_row_processed = pd.concat([new_row_log, new_row_ohe], axis = 1)
new_row_processed

In [None]:
# Regression model

new_row_pred_log = linreg.predict(new_row_processed)
new_row_pred_log

In [None]:
# Prediction needs to be scaled and exponentiated
np.exp(new_row_pred_log) * ames['SalePrice'].std() + ames['SalePrice'].mean()

In [None]:
# Other way

In [None]:
# Make 2 df: 1 with continuous features, 1 with categorical features. Then perform transformations on them


In [None]:
continuous  # will build df with log transform and standardize continuous features

In [None]:
cont_test = pd.DataFrame({'LotArea': [14977],
                         '1stFlrSF': [1976],
                         'GrLivArea': [1976],
                         'SalePrice': 0.1})
cont_test

In [None]:
log_test = cont_test.apply(lambda x: np.log(x))
log_test

In [None]:
ss = StandardScaler
cont_test = pd.DataFrame(ss.transform(log_test), columns = log_test.columns)

In [None]:
cont_test.drop(columns = 'SalePrice', inplace = True)

In [None]:
# will build df with categoricals
cat_df = pd.DataFrame({'BldgType': '1Fam',
                      'KitchenQual': 'Gd',
                      'SaleType': ['New'],
                      'MSZoning': ['RL'],
                      'Street': ['Pave'],
                      'Neighborhood': 'NridgHt'})
cat_df

In [None]:
cat_ohe = pd.DataFrame(ohe_method2.transform(cat_df), columns = ohe_method2.get_feature_names())
cat_ohe

In [None]:
test_df = cont_test.join(cat_ohe)
test_df.head()

In [None]:
X.shape

In [None]:
#Prediction
pred = linreg.predict(test_df)

In [None]:
# convert std pred to sale price
np.exp((pred * ames_log['SalePrice'].std()) + ames_log['SalePrice'].mean())

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!