# Project 2 - Ames Housing Data and Kaggle Challenge
## Revisited

When reviewing the various projects I completed during my time with General Assembly I found that Project 2, based on an existing Kaggle challenge, offered to most room for improvement. My notebooks were a mess: there was no clear organization to the repository as a whole and each individual notebook represented the full 'Data Science Process' (Cleaning through Model Deployment), each for a different list of features. 

While the project was not particularly complicated, the way in which I originally approached the project was: I now had a chance to improve upon the work I had done in a more succinct, cleaner manner. 

**Imports:**

In [44]:
import pandas as pd
pd.set_option("display.max_columns", None)

import numpy as np
np.random.seed(42)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "darkgrid")

from sklearn import metrics 
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, HuberRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV

**Reading in Model-Ready Data**

In [2]:
#reading in training and testing data
ames_df = pd.read_csv('../datasets/train_model_ready.csv')
ames_test_df = pd.read_csv('../datasets/test_model_ready.csv')

In [3]:
ames_df.head()

Unnamed: 0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,60,3,68.0,13517,1,1,1,1,1,1,1,0,0,1,1,2,6,8,1976,2005,2,1,2,1,1,289.0,4,3,2,3,3,1,6,533.0,1,0.0,192.0,725.0,1,5,1,2,725,754,0,1479,0.0,0.0,2,1,3,1,4,6,2,0,0,5,1976.0,2,2.0,475.0,3,3,2,0,44,0,0,0,0,0,0,0,0,3,2010,WD,130500
1,60,3,43.0,11492,1,1,1,1,1,1,1,0,1,1,1,2,7,5,1996,1997,2,1,3,3,1,132.0,4,3,2,4,3,1,6,637.0,1,0.0,276.0,913.0,1,5,1,2,913,1209,0,2122,1.0,0.0,2,1,4,1,4,8,2,1,3,5,1997.0,2,2.0,559.0,3,3,2,0,74,0,0,0,0,0,0,0,0,4,2009,WD,220000
2,20,3,68.0,7922,1,1,2,1,1,2,1,0,1,1,1,2,5,7,1953,2007,2,1,3,3,2,0.0,3,4,2,3,3,1,6,731.0,1,0.0,326.0,1057.0,1,3,1,2,1057,0,0,1057,1.0,0.0,1,0,3,1,4,5,2,0,0,1,1953.0,1,1.0,246.0,3,3,2,0,52,0,0,0,0,0,0,0,0,1,2010,WD,109000
3,60,3,73.0,9802,1,1,2,1,1,2,1,1,1,1,1,2,5,5,2006,2007,2,1,3,3,2,0.0,3,3,2,4,3,1,1,0.0,1,0.0,384.0,384.0,1,4,1,2,744,700,0,1444,0.0,0.0,2,1,3,1,3,7,2,0,0,3,2007.0,3,2.0,400.0,3,3,2,100,0,0,0,0,0,0,0,0,0,4,2010,WD,174000
4,50,3,82.0,14235,1,1,1,1,1,2,1,0,1,1,1,1,6,8,1900,1993,2,1,1,1,2,0.0,3,3,2,2,4,1,1,0.0,1,0.0,676.0,676.0,1,3,1,2,831,614,0,1445,0.0,0.0,2,0,3,1,3,6,2,0,0,1,1957.0,1,2.0,484.0,3,3,0,0,59,0,0,0,0,0,0,0,0,3,2010,WD,138500


In [4]:
ames_test_df.head()

Unnamed: 0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type
0,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,NoFrPl,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,NoPl,NoFnc,NoMisc,0,4,2006,WD
1,90,RL,68.0,9662,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,NoFrPl,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,NoPl,NoFnc,NoMisc,0,8,2006,WD
2,60,RL,58.0,17104,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,NoPl,NoFnc,NoMisc,0,9,2006,New
3,30,RM,60.0,8520,Pave,NoAlley,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,NoFrPl,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,NoPl,NoFnc,NoMisc,0,7,2007,WD
4,20,RL,68.0,9500,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,NoPl,NoFnc,NoMisc,0,7,2009,WD


In [5]:
ames_df.shape

(2051, 79)

In [6]:
ames_test_df.shape

(878, 78)

**Setting Features**

In [7]:
ames_corr = ames_df.corr()['saleprice'][:-1]
list_feats = ames_corr[(ames_corr > 0.10) | (ames_corr < -0.10)].sort_values(ascending = False).keys()
list_feats

Index(['overall_qual', 'exter_qual', 'gr_liv_area', 'kitchen_qual',
       'garage_area', 'garage_cars', 'total_bsmt_sf', '1st_flr_sf',
       'bsmt_qual', 'year_built', 'garage_finish', 'year_remod/add',
       'fireplace_qu', 'full_bath', 'garage_yr_blt', 'totrms_abvgrd',
       'mas_vnr_area', 'fireplaces', 'heating_qc', 'garage_type',
       'bsmtfin_sf_1', 'bsmt_exposure', 'bsmtfin_type_1', 'open_porch_sf',
       'wood_deck_sf', 'lot_frontage', 'lot_area', 'paved_drive',
       'garage_qual', 'bsmt_full_bath', 'half_bath', 'central_air',
       'garage_cond', 'foundation', 'electrical', '2nd_flr_sf', 'ms_zoning',
       'bsmt_cond', 'exterior_1st', 'bldg_type', 'bsmt_unf_sf', 'house_style',
       'bedroom_abvgr', 'alley', 'screen_porch', 'condition_1',
       'kitchen_abvgr', 'enclosed_porch', 'roof_style', 'lot_shape',
       'mas_vnr_type'],
      dtype='object')

**Train-Test Split**

In [8]:
#setting features and target:
X = ames_df[list_feats]
y = ames_df[['saleprice']]

In [9]:
# Train/test split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.20,
                                                    random_state = 42)

In [10]:
# Check train/test shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1640, 51), (411, 51), (1640, 1), (411, 1))

## Modeling

**Linear Regression**

In [11]:
lr_1 = LinearRegression()

lr_1.fit(X_train, y_train)
print(f'{lr_1.score(X_train, y_train)}, {lr_1.score(X_test, y_test)}')
print(abs(lr_1.score(X_train, y_train) - lr_1.score(X_test, y_test)))

0.8411258777638544, 0.8752960624925732
0.0341701847287188


In [12]:
cross_val_score(lr_1, X_train, y_train, cv = 5), cross_val_score(lr_1, X_test, y_test, cv = 5)

(array([0.87166792, 0.82577027, 0.85289895, 0.80651756, 0.63037171]),
 array([0.85978705, 0.87807926, 0.90490501, 0.84530776, 0.86406725]))

In [13]:
cross_val_score(lr_1, X_train, y_train, cv = 5).mean(), cross_val_score(lr_1, X_test, y_test, cv = 5).mean()

(0.7974452812656244, 0.8704292654118984)

In [14]:
cross_val_score(lr_1, X, y, cv = 5)

array([0.84359525, 0.87262677, 0.76373762, 0.87319694, 0.78912717])

In [15]:
cross_val_score(lr_1, X, y, cv = 5).mean()

0.8284567490108949

In [19]:
def model_eval(true_val, pred_val):
    rmse = mean_squared_error(true_val, pred_val, squared = False)
    r2 = r2_score(true_val, pred_val)
    
    return f'RMSE: {rmse}', f'R2 Score: {r2}'

In [21]:
lr_1_preds = lr_1.predict(X_test)

In [22]:
model_eval(y_test, lr_1_preds)

('RMSE: 27220.036312363693', 'R2 Score: 0.8752960624925732')

In [16]:
#sorting features by the absolutely size of their coefficient (without respect for positive or negative)
lr_features = pd.DataFrame(X_train.columns, columns = ['feature'])
lr_features['abscoef'] = np.abs(lr_1.coef_)[0]
lr_features.sort_values(by='abscoef', ascending = False).head(10)

Unnamed: 0,feature,abscoef
1,exter_qual,14649.656384
0,overall_qual,12490.376993
39,bldg_type,11860.975836
3,kitchen_qual,10814.863847
29,bsmt_full_bath,9128.733635
43,alley,8348.804226
32,garage_cond,7824.940568
8,bsmt_qual,6880.091449
37,bsmt_cond,6034.913028
45,condition_1,5809.968673


**Huber Regressor**

In [17]:
#https://machinelearningmastery.com/robust-regression-for-machine-learning-in-python/
huber = HuberRegressor(max_iter = 50_000)

huber.fit(X_train, y_train['saleprice'])
huber.score(X_train, y_train['saleprice']), huber.score(X_test, y_test['saleprice'])

(0.820107501641, 0.886503652227477)

In [18]:
huber_preds = huber.predict(X_test)

In [20]:
model_eval(y_test, huber_preds)

('RMSE: 25968.063246757178', 'R2 Score: 0.886503652227477')

**Scaling with Standard Scaler**

In [23]:
sc = StandardScaler()

X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [24]:
lr_1.fit(X_train_sc, y_train)
lr_1.score(X_train_sc, y_train), lr_1.score(X_test_sc, y_test)

(0.8411258777638544, 0.8752960624925663)

In [25]:
cross_val_score(lr_1, X, y, cv = 5)

array([0.84359525, 0.87262677, 0.76373762, 0.87319694, 0.78912717])

In [26]:
cross_val_score(lr_1, X, y, cv = 5).mean()

0.8284567490108949

**LASSO**

In [27]:
y_train.shape

(1640, 1)

In [28]:
y_train['saleprice'].shape

(1640,)

In [29]:
l_alphas = np.logspace(-3, 3, 100)

lasso = LassoCV(alphas = l_alphas,
                cv = 5,
                max_iter = 50_000)

lasso.fit(X_train_sc, y_train['saleprice'])
lasso.score(X_train_sc, y_train['saleprice']), lasso.score(X_test_sc, y_test['saleprice'])

(0.8380414073383797, 0.8767392063265569)

In [30]:
lasso.alpha_

657.9332246575682

In [31]:
cross_val_score(lasso, X, y['saleprice'], cv = 5)

array([0.84457438, 0.87674522, 0.75803054, 0.87720662, 0.7903267 ])

In [32]:
cross_val_score(lasso, X, y['saleprice'], cv = 5).mean()

0.8293766924276179

**Ridge**

In [33]:
ridge = RidgeCV(alphas = l_alphas)

ridge.fit(X_train_sc, y_train['saleprice'])
ridge.score(X_train_sc, y_train['saleprice']), ridge.score(X_test_sc, y_test['saleprice'])

(0.8356723959286619, 0.8762639311324675)

In [34]:
ridge.alpha_

432.87612810830615

**DecisionTreeRegressor**

- With default values:

In [49]:
dtr = DecisionTreeRegressor()

dtr.fit(X_train, y_train)
dtr.score(X_train, y_train['saleprice']), dtr.score(X_test, y_test['saleprice'])

(1.0, 0.8116214019827525)

- With GridSearch:

In [47]:
grid = GridSearchCV(estimator = dtr,
                    param_grid = {'max_depth': [None, 2, 3, 5, 7],
                                 'min_samples_split': [2, 5, 10, 15, 20],
                                 'min_samples_leaf': [1, 2, 3, 4, 5, 6],
                                 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10]},
                    cv = 5,
                    verbose = 1)

In [48]:
import time
t0 = time.time()

grid.fit(X_train, y_train)

print(time.time() - t0)

Fitting 5 folds for each of 900 candidates, totalling 4500 fits
49.40610980987549


In [50]:
grid.score(X_train, y_train), grid.score(X_test, y_test)

(0.9176567695807675, 0.811587995154812)

In [51]:
grid.best_params_

{'ccp_alpha': 0,
 'max_depth': None,
 'min_samples_leaf': 6,
 'min_samples_split': 20}

In [52]:
grid_preds = grid.predict(X_test)

In [53]:
model_eval(y_test, grid_preds)

('RMSE: 33458.22538656117', 'R2 Score: 0.811587995154812')