## Problem Statement

what should people look out for when investing in housing and where? [The Ames Housing Dataset]

TL;DR
Linear Regression, LASSO and Ridge models will be developed to see which features have the most significant effect on the price of a house.


We will be comparing against the linear regression and the LASSO model of our data to determine which model fared better. A better model is one with lesser features (with high coefficients) and ideally a high R-squared. R-squared can be perceive as the proportion of the variance in the dependent variable that is predictable from the independent variable.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

from scipy.stats import skew

In [None]:
pd.set_option('display.width', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 120)

#### Feature Engineering

- Features that have very little/high correlation to saleprice/other features and were NOT dropped are used in feature engineering to determine if correlation can be improved

In [None]:
train_test_cleaned['prop_age'] = train_test_cleaned['yr_sold'] - train_test_cleaned['year_built']
train_test_cleaned.drop('yr_sold', axis=1, inplace=True) 
train_test_cleaned.drop('year_built', axis=1, inplace=True) 
# to determine whether age of house will affect sale price

In [None]:
train_test_cleaned['pool'] = train_test_cleaned['pool_area'] * train_test_cleaned['pool_qc']
train_test_cleaned.drop('pool_area', axis=1, inplace=True) 
train_test_cleaned.drop('pool_qc', axis=1, inplace=True) 
# to determine whether the pool feature will affect sale price

In [None]:
train_test_cleaned['bsmt'] = train_test_cleaned['bsmt_qual'] * train_test_cleaned['bsmt_cond']
train_test_cleaned.drop('bsmt_qual', axis=1, inplace=True) 
train_test_cleaned.drop('bsmt_cond', axis=1, inplace=True) 
# to determine whether the basement quality and condition will affect sale price

In [None]:
train_test_cleaned['fireplace'] = train_test_cleaned['fireplaces'] * train_test_cleaned['fireplace_qu']
train_test_cleaned.drop('fireplaces', axis=1, inplace=True) 
train_test_cleaned.drop('fireplace_qu', axis=1, inplace=True) 
# to determine whether fireplace will affect sale price

In [None]:
train_test_cleaned['garage_aes'] = train_test_cleaned['garage_finish'] * train_test_cleaned['garage_qual'] * train_test_cleaned['garage_cond']
train_test_cleaned.drop('garage_finish', axis=1, inplace=True)
train_test_cleaned.drop('garage_cond', axis=1, inplace=True)
train_test_cleaned.drop('garage_qual', axis=1, inplace=True) 

In [None]:
train_test_cleaned.describe().T

#### Dummies for nominal data

In [None]:
train_test_c_nom = train_test_cleaned.select_dtypes(include=object)
train_test_c_nom.info()

In [None]:
x = pd.get_dummies(train_test_c_nom, drop_first=True)

In [None]:
x['id'] = train_test_cleaned.id

In [None]:
plt.figure(figsize=(150,30))
mask = np.zeros_like(train_test_cleaned.corr())
mask[np.triu_indices_from(mask)] = True
sns.heatmap(train_test_cleaned.corr(), mask=mask, annot=True, square=True, cmap='Blues')

In [None]:
train_test_cleaned.drop('pool', axis=1, inplace=True) 

In [None]:
train_test_fin = pd.merge(train_test_cleaned, x, how='outer', on='id')

In [None]:
train_test_fin.drop(columns= ['ms_subclass', 'ms_zoning', 'condition_2', 'bldg_type', 'house_style', 
                    'roof_style', 'exterior_2nd', 'bsmtfin_type_1', 'bsmtfin_type_2', 'heating', 'central_air', 
                    'garage_type', 'sale_type'], axis=1, inplace=True)

In [None]:
train_test_fin.describe().T


In [None]:
test_final = train_test_fin[train_test_fin.saleprice.isnull()]
test_final = test_final.drop(columns='saleprice', axis=1)

In [None]:
train_nfinal = train_test_fin.dropna()
train_final = train_nfinal.drop(columns='neighborhood', axis=1)
#delete rows with saleprice = null and save as train_c

In [None]:
test_final.to_csv('./test_final.csv')
train_final.to_csv('./train_final.csv')

##### Ammending outliers in train_final

In [None]:
train_final=train_final[train_final.open_porch_sf<600]
train_final=train_final[train_final.screen_porch<550]
train_final=train_final[train_final.lot_frontage<250]
train_final=train_final[train_final.enclosed_porch<800]
train_final=train_final[train_final.wood_deck_sf<1250]
train_final=train_final[train_final.total_bsmt_sf<4000]
train_final=train_final[train_final.mas_vnr_area<1500]
train_final=train_final[train_final.gr_liv_area<3800]
train_final=train_final[train_final['3ssn_porch']<450]

In [None]:
train_final.drop(columns='id', axis=1, inplace=True)

In [None]:
train_final.head()