# Project 2 - Ames Housing Data and Kaggle Challenge
## Revisited

When reviewing the various projects I completed during my time with General Assembly I found that Project 2, based on an existing Kaggle challenge, offered to most room for improvement. My notebooks were a mess: there was no clear organization to the repository as a whole and each individual notebook represented the full 'Data Science Process' (Cleaning through Model Deployment), each for a different list of features. 

While the project was not particularly complicated, the way in which I originally approached the project was: I now had a chance to improve upon the work I had done in a more succinct, cleaner manner. 

**Imports:**

In [1]:
import pandas as pd
pd.set_option("display.max_columns", None)

import numpy as np
np.random.seed(42)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "darkgrid")

from sklearn import metrics 
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, HuberRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

**Reading in Model-Ready Data**

In [2]:
#reading in training and testing data
ames_df = pd.read_csv('../datasets/train_model_ready.csv')
ames_test_df = pd.read_csv('../datasets/test_model_ready.csv')

In [3]:
ames_df.head()

Unnamed: 0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,60,RL,68.0,13517,Pave,NoAlley,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,NoFrPl,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,NoPl,NoFnc,NoMisc,0,3,2010,WD,130500
1,60,RL,43.0,11492,Pave,NoAlley,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,NoPl,NoFnc,NoMisc,0,4,2009,WD,220000
2,20,RL,68.0,7922,Pave,NoAlley,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,NoFrPl,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,NoPl,NoFnc,NoMisc,0,1,2010,WD,109000
3,60,RL,73.0,9802,Pave,NoAlley,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,NoFrPl,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,NoPl,NoFnc,NoMisc,0,4,2010,WD,174000
4,50,RL,82.0,14235,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,NoFrPl,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,NoPl,NoFnc,NoMisc,0,3,2010,WD,138500


In [4]:
ames_test_df.head()

Unnamed: 0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type
0,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,NoFrPl,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,NoPl,NoFnc,NoMisc,0,4,2006,WD
1,90,RL,68.0,9662,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,NoFrPl,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,NoPl,NoFnc,NoMisc,0,8,2006,WD
2,60,RL,58.0,17104,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,NoPl,NoFnc,NoMisc,0,9,2006,New
3,30,RM,60.0,8520,Pave,NoAlley,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,NoFrPl,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,NoPl,NoFnc,NoMisc,0,7,2007,WD
4,20,RL,68.0,9500,Pave,NoAlley,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,NoPl,NoFnc,NoMisc,0,7,2009,WD


In [5]:
ames_df.shape

(2051, 79)

In [6]:
ames_test_df.shape

(878, 78)

**Setting Features**

In [7]:
ames_corr = ames_df.corr()['saleprice'][:-1]
list_feats = ames_corr[(ames_corr > 0.10) | (ames_corr < -0.10)].sort_values(ascending = False).keys()
list_feats

Index(['overall_qual', 'gr_liv_area', 'garage_area', 'garage_cars',
       'total_bsmt_sf', '1st_flr_sf', 'year_built', 'year_remod/add',
       'full_bath', 'garage_yr_blt', 'totrms_abvgrd', 'mas_vnr_area',
       'fireplaces', 'bsmtfin_sf_1', 'open_porch_sf', 'wood_deck_sf',
       'lot_frontage', 'lot_area', 'bsmt_full_bath', 'half_bath', '2nd_flr_sf',
       'bsmt_unf_sf', 'bedroom_abvgr', 'screen_porch', 'kitchen_abvgr',
       'enclosed_porch'],
      dtype='object')

In [8]:
list_feats_obj = ames_df.select_dtypes(exclude = 'number').columns
list_feats_obj

Index(['ms_zoning', 'street', 'alley', 'lot_shape', 'land_contour',
       'utilities', 'lot_config', 'land_slope', 'neighborhood', 'condition_1',
       'condition_2', 'bldg_type', 'house_style', 'roof_style', 'roof_matl',
       'exterior_1st', 'exterior_2nd', 'mas_vnr_type', 'exter_qual',
       'exter_cond', 'foundation', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure',
       'bsmtfin_type_1', 'bsmtfin_type_2', 'heating', 'heating_qc',
       'central_air', 'electrical', 'kitchen_qual', 'functional',
       'fireplace_qu', 'garage_type', 'garage_finish', 'garage_qual',
       'garage_cond', 'paved_drive', 'pool_qc', 'fence', 'misc_feature',
       'sale_type'],
      dtype='object')

**Train-Test Split**

In [9]:
#setting features and target:
X = pd.get_dummies(ames_df, columns = list_feats_obj, drop_first = True).drop(columns = 'saleprice')
y = ames_df[['saleprice']]

In [10]:
# Train/test split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.20,
                                                    random_state = 42)

In [11]:
# Check train/test shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1640, 260), (411, 260), (1640, 1), (411, 1))

In [12]:
X_train.isna().sum().sort_values(ascending = False)

ms_subclass          0
foundation_Stone     0
bsmt_qual_Fa         0
bsmt_qual_Gd         0
bsmt_qual_NoBsmt     0
                    ..
condition_1_RRAn     0
condition_1_RRNe     0
condition_1_RRNn     0
condition_2_Feedr    0
sale_type_WD         0
Length: 260, dtype: int64

## Modeling

**Linear Regression**

In [13]:
lr_1 = LinearRegression()

lr_1.fit(X_train, y_train)
print(f'{lr_1.score(X_train, y_train)}, {lr_1.score(X_test, y_test)}')
print(abs(lr_1.score(X_train, y_train) - lr_1.score(X_test, y_test)))

0.9442018393894729, 0.9249765807891271
0.01922525860034574


In [14]:
cross_val_score(lr_1, X_train, y_train, cv = 5), cross_val_score(lr_1, X_test, y_test, cv = 5)

(array([0.92228758, 0.61676296, 0.58445918, 0.79351056, 0.50253562]),
 array([0.84426306, 0.79798976, 0.82320348, 0.81954553, 0.8293604 ]))

In [15]:
cross_val_score(lr_1, X_train, y_train, cv = 5).mean(), cross_val_score(lr_1, X_test, y_test, cv = 5).mean()

(0.6839111802119382, 0.822872444560333)

In [16]:
cross_val_score(lr_1, X, y, cv = 5)

array([0.71553983, 0.91891329, 0.68944041, 0.89233223, 0.81040976])

In [17]:
cross_val_score(lr_1, X, y, cv = 5).mean()

0.805327101622406

**Huber Regressor**

In [33]:
#https://machinelearningmastery.com/robust-regression-for-machine-learning-in-python/
huber = HuberRegressor(max_iter = 50_000)

huber.fit(X_train, y_train['saleprice'])
huber.score(X_train, y_train['saleprice']), huber.score(X_test, y_test['saleprice'])

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.8491797363732768, 0.9191667818260718)

In [34]:
huber_preds = huber.predict(X_test)

In [35]:
model_eval(y_test, huber_preds)

('RMSE: 21915.104795537616', 'R2 Score: 0.9191667818260718')

In [18]:
def model_eval(true_val, pred_val):
    rmse = mean_squared_error(true_val, pred_val, squared = False)
    r2 = r2_score(true_val, pred_val)
    
    return f'RMSE: {rmse}', f'R2 Score: {r2}'

In [19]:
lr_1_preds = lr_1.predict(X_test)

In [20]:
model_eval(y_test, lr_1_preds)

('RMSE: 21112.858822152903', 'R2 Score: 0.9249765807891271')

**Scaling with Standard Scaler**

In [21]:
sc = StandardScaler()

X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [22]:
lr_1.fit(X_train_sc, y_train)
lr_1.score(X_train_sc, y_train), lr_1.score(X_test_sc, y_test)

(0.9442013257834327, -3.790913846045648e+20)

In [23]:
cross_val_score(lr_1, X, y, cv = 5)

array([0.71553983, 0.91891329, 0.68944041, 0.89233223, 0.81040976])

In [24]:
cross_val_score(lr_1, X, y, cv = 5).mean()

0.805327101622406

**LASSO**

In [25]:
y_train.shape

(1640, 1)

In [26]:
y_train['saleprice'].shape

(1640,)

In [27]:
l_alphas = np.logspace(-3, 3, 100)

lasso = LassoCV(alphas = l_alphas,
                cv = 5,
                max_iter = 50_000)

lasso.fit(X_train_sc, y_train['saleprice'])
lasso.score(X_train_sc, y_train['saleprice']), lasso.score(X_test_sc, y_test['saleprice'])

(0.9089581949963439, 0.9059163448698833)

In [28]:
lasso.alpha_

657.9332246575682

In [29]:
cross_val_score(lasso, X, y['saleprice'], cv = 5)

array([0.86483147, 0.89954753, 0.77503515, 0.90555985, 0.82648217])

In [30]:
cross_val_score(lasso, X, y['saleprice'], cv = 5).mean()

0.8542912319417914

**Ridge**

In [31]:
ridge = RidgeCV(alphas = l_alphas)

ridge.fit(X_train_sc, y_train['saleprice'])
ridge.score(X_train_sc, y_train['saleprice']), ridge.score(X_test_sc, y_test['saleprice'])

(0.9018617525737664, 0.8949899014488097)

In [32]:
ridge.alpha_

657.9332246575682