# Modeling Benchmarks

This notebook contains my many models that I run over many different feature sets. We will compare these to the null model as well.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# Import CSV
clean_train =  pd.read_csv('../data/cleaned_train.csv')
clean_test = pd.read_csv('../data/cleaned_test.csv')

# Creating a baseline

In [4]:
y = clean_train['saleprice']

In [5]:
baseline = clean_train['saleprice'].mean()
baseline

181469.70160897123

Since we can't make any predictions right now, we have to go into the [selection notebook](https://git.generalassemb.ly/laternader/project_2/blob/master/deliverables/code/1.5%20-%20Select-Features.ipynb).

But for now we can make a null prediction (`baseline` is our null prediction). We need to calculate the distance between a sale price from the 'saleprice' column to the mean of all saleprices. This will give us null residuals.

In [6]:
null_resids = y - baseline
null_resids[:5]

0   -50969.701609
1    38530.298391
2   -72469.701609
3    -7469.701609
4   -42969.701609
Name: saleprice, dtype: float64

In [7]:
np.mean(null_resids**2)

6278872217.837828

In [8]:
# This is the null model before the split
clean_train['null_pred'] = baseline
#                                     y before the split  , caluculated mean of y
mse_null = metrics.mean_squared_error(clean_train['saleprice'], clean_train['null_pred'])
mse_null

6278872217.837828

In [9]:
rmse_null = mse_null**.5
rmse_null

79239.33504161824

Now we can compare with our models.

---
# Feature Set 1

The features presented here are features that had high positive correlation with 'saleprice' in the train. It will contain LinearRegression, Lasso, And Ridge models

In [10]:
features1 = ['overall_qual',
  'year_built',
  'year_remod/add',
  'mas_vnr_area',
  'total_bsmt_sf',
  '1st_flr_sf',
  'gr_liv_area',
  'full_bath',
  'garage_yr_blt',
  'garage_cars',
  'garage_area']

In [11]:
X_train = clean_train[features1]
X_test = clean_test[features1]
y_train = clean_train['saleprice']

This code helps select any columns that need to be dummified.

In [12]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features1].select_dtypes(include='object'))

We will now fit the `train` into its own model before proceeding to apply it to the whole model.

In [13]:
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69420)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train_validation, y_train_validation)

LinearRegression()

In [14]:
lr.score(X_train_validation,y_train_validation), lr.score(X_test_validation,y_test_validation)

(0.7925242267944131, 0.8224348663093287)

In [15]:
preds = lr.predict(X_test_validation)
preds[:5]

array([164343.0559405 , 343233.76737602, 164352.24749496, 159379.24111704,
       183737.62333763])

Now that we calculated the R scores, we need to calculate the metrics to compare. In order to do that, we bring in the function below.

In [16]:
# The idea came from lab 3.01
def big_metrics(y, predictions, features):
    # 1. Calculate Sum Squared Error; SSE
    SSE = ((y - predictions)**2).sum()
    
    # 2. Calculate Mean Squared Error; MSE
    MSE = SSE/len(y)
    
    # 3. Calculate Root Mean Squared Error; RMSE
    RMSE = np.sqrt(MSE)
    
    # 4. Calculate Mean Absolute Error; MAE
    MAE = (abs(y - predictions)).sum()/len(y)
    
    # 5. Calculate R2 Score Error; R2
    Null_SSE = ((y - y.mean())**2).sum() # Need null to calculate R2
    R2 = 1 - SSE/Null_SSE
    
    # 6. Calculate R2 Adjusted
    R2_adj = 1-(((1-R2)*(len(y)-1))/(len(y)-len(features)-1))
    
    return SSE, MSE, RMSE, MAE, R2, R2_adj

metric_names = ['SSE', 'MSE', 'RMSE', 'MAE', 'R2', 'R2_adj']

In [17]:
f1_metrics = pd.Series(big_metrics(y_test_validation, preds, features1),index=metric_names)
f1_metrics

SSE       5.234070e+11
MSE       1.020287e+09
RMSE      3.194193e+04
MAE       2.297774e+04
R2        8.224349e-01
R2_adj    8.185362e-01
dtype: float64

In [18]:
f1_metrics[2], rmse_null

(31941.92581264128, 79239.33504161824)

Now time to calculate to compare to the null model.

Since my RMSE of my first model based on `features1` is lower than the rmse_null, that means my model is pimpin.

In [19]:
pd.Series(lr.coef_,index=features1)

overall_qual      19943.446696
year_built          299.341674
year_remod/add      368.118239
mas_vnr_area         42.652246
total_bsmt_sf        14.067156
1st_flr_sf           17.368340
gr_liv_area          40.173878
full_bath         -4996.950326
garage_yr_blt      -121.236634
garage_cars        8341.460468
garage_area          37.286706
dtype: float64

There were 2 features that cause a drop in price as their units increase. I think we can do a better model.

# Let's model the first set of features to Ridge and RidgeCV

In order to do that, we need to utilize `StandardScaler`

In [20]:
# Standardize the numbers
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69420)
# Since we already split it earlier, we will use those to perform on the Ridge model
ss = StandardScaler()
Z_train = ss.fit_transform(X_train_validation)
Z_test = ss.transform(X_test_validation)

ridge_model = Ridge(alpha=5)
ridge_model.fit(Z_train, y_train_validation)

print('Ridge Training score:', ridge_model.score(Z_train, y_train_validation))
print('Ridge Test score:', ridge_model.score(Z_test, y_test_validation)) 

Ridge Training score: 0.7925207788470631
Ridge Test score: 0.8223824929134431


In [21]:
r_alphas = np.logspace(0, 5, 100)
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2',cv=5)
ridge_cv.fit(Z_train, y_train_validation);

In [22]:
print('RidgeCV Training score:',ridge_cv.score(Z_train, y_train_validation))
print('RidgeCV Test score:',ridge_cv.score(Z_test, y_test_validation))

RidgeCV Training score: 0.7917988595805897
RidgeCV Test score: 0.8209229268500502


Calculate the RMSE on `Ridge` and `RidgeCV` models.

In [23]:
Z_test_preds = ridge_model.predict(Z_test)
Z_test_preds[:5]

array([164696.09016549, 342886.16306817, 164328.88212282, 159396.09397223,
       183784.30335765])

In [24]:
f1_metrics_r = pd.Series(big_metrics(y_test_validation, Z_test_preds, features1),index=metric_names)
f1_metrics_r

SSE       5.235614e+11
MSE       1.020588e+09
RMSE      3.194664e+04
MAE       2.296204e+04
R2        8.223825e-01
R2_adj    8.184827e-01
dtype: float64

In [25]:
f1_metrics[2], f1_metrics_r[2], rmse_null

(31941.92581264128, 31946.636151185918, 79239.33504161824)

In [26]:
pd.Series(ridge_model.coef_,index=features1)

overall_qual      28259.620523
year_built         8963.330739
year_remod/add     7774.782363
mas_vnr_area       7747.665538
total_bsmt_sf      6432.741516
1st_flr_sf         7071.773072
gr_liv_area       20124.262765
full_bath         -2648.375006
garage_yr_blt     -2944.313413
garage_cars        6377.296810
garage_area        6799.344874
dtype: float64

In [27]:
Z_test_preds = ridge_cv.predict(Z_test)
Z_test_preds[:5]

array([169435.09918185, 338122.82168343, 164146.53128715, 159534.31301852,
       184368.40156621])

In [28]:
f1_metrics_rcv = pd.Series(big_metrics(y_test_validation, Z_test_preds, features1),index=metric_names)
f1_metrics_rcv

SSE       5.278638e+11
MSE       1.028974e+09
RMSE      3.207763e+04
MAE       2.277429e+04
R2        8.209229e-01
R2_adj    8.169911e-01
dtype: float64

In [29]:
f1_metrics[2], f1_metrics_r[2], f1_metrics_rcv[2], rmse_null

(31941.92581264128, 31946.636151185918, 32077.627816421944, 79239.33504161824)

In [30]:
pd.Series(ridge_cv.coef_,index=features1)

overall_qual      26205.448081
year_built         7935.802368
year_remod/add     7773.871003
mas_vnr_area       8064.937525
total_bsmt_sf      6888.817077
1st_flr_sf         7364.787214
gr_liv_area       18819.901729
full_bath         -1361.541544
garage_yr_blt     -1583.584147
garage_cars        6972.047774
garage_area        6433.793926
dtype: float64

# We can also LassoCV

In [31]:
l_alphas = np.logspace(1, 0, 100)
# Instantiate
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=50000)
# Fit
lasso_cv.fit(Z_train, y_train_validation);

In [32]:
lasso_cv.alpha_

9.54548456661834

In [33]:
print(lasso_cv.score(Z_train, y_train_validation))
print(lasso_cv.score(Z_test, y_test_validation))

0.7925239474551466
0.8224310871378363


In [34]:
Z_test_preds = lasso_cv.predict(Z_test)
Z_test_preds[:5]

array([164369.05259128, 343167.8241581 , 164323.20754674, 159374.80568804,
       183747.67484936])

In [35]:
lasso_cv.coef_

array([28407.57674559,  9009.3938593 ,  7752.6913927 ,  7721.01169798,
        6396.21145898,  7052.61505885, 20202.16313906, -2713.92777713,
       -2998.62290198,  6328.74151544,  6814.20418289])

In [36]:
f1_metrics_lasso = pd.Series(big_metrics(y_test_validation, Z_test_preds, features1),index=metric_names)
f1_metrics_lasso

SSE       5.234182e+11
MSE       1.020308e+09
RMSE      3.194227e+04
MAE       2.297518e+04
R2        8.224311e-01
R2_adj    8.185324e-01
dtype: float64

In [37]:
f1_metrics_lasso[2], f1_metrics[2], f1_metrics_r[2], f1_metrics_rcv[2], rmse_null

(31942.265725580957,
 31941.92581264128,
 31946.636151185918,
 32077.627816421944,
 79239.33504161824)

In [38]:
pd.Series(lasso_cv.coef_,index=features1)

overall_qual      28407.576746
year_built         9009.393859
year_remod/add     7752.691393
mas_vnr_area       7721.011698
total_bsmt_sf      6396.211459
1st_flr_sf         7052.615059
gr_liv_area       20202.163139
full_bath         -2713.927777
garage_yr_blt     -2998.622902
garage_cars        6328.741515
garage_area        6814.204183
dtype: float64

In [39]:
X_train_validation.shape

(1538, 11)

All of the models did better than the null, but the best one was the LinearRegression model.

Now that we have computed that for our first set of features, the next set of features will follow the same route.

---
# Feature Set 2

In [40]:
# Features
features2 = ['ms_subclass',

  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_cond',
  'year_built',

  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [41]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features2].select_dtypes(include='object'))
dummies_columns

['neighborhood',
 'bldg_type',
 'house_style',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [42]:
X_train = clean_train[features2]
X_test = clean_test[features2]
y_train = clean_train['saleprice']

In [43]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [44]:
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69420)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train_validation, y_train_validation)

LinearRegression()

In [45]:
lr.score(X_train_validation,y_train_validation), lr.score(X_test_validation,y_test_validation)

(0.8287125336393979, 0.8345724494443683)

In [46]:
preds = lr.predict(X_test_validation)
preds[:5]

array([186833.12661567, 403152.85403427, 168639.78656847, 142373.98035003,
       146796.53765383])

In [47]:
f2_metrics = pd.Series(big_metrics(y_test_validation, preds, features2),index=metric_names)
f2_metrics

SSE       4.876292e+11
MSE       9.505443e+08
RMSE      3.083090e+04
MAE       2.258889e+04
R2        8.345724e-01
R2_adj    8.271451e-01
dtype: float64

In [48]:
f1_metrics[2], f2_metrics[2], rmse_null

(31941.92581264128, 30830.897683095838, 79239.33504161824)

In [49]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass               -93.871708
overall_cond             6062.117989
year_built                461.881873
full_bath               17391.830566
half_bath               11053.458720
bedroom_abvgr           -2675.642632
totrms_abvgrd            8636.575640
fireplaces              16950.190376
garage_cars             17887.167258
mo_sold                   -48.719597
yr_sold                  -366.154345
neighborhood_Blmngtn   -15214.129744
neighborhood_Blueste    -5091.724108
neighborhood_BrDale     -1000.533649
neighborhood_BrkSide   -17594.655101
neighborhood_ClearCr    11210.790617
neighborhood_CollgCr    -8910.520583
neighborhood_Crawfor    14775.911623
neighborhood_Edwards   -24211.112455
neighborhood_Gilbert   -29237.040298
neighborhood_Greens     36122.375018
neighborhood_IDOTRR    -24042.155727
neighborhood_MeadowV   -14308.319776
neighborhood_Mitchel   -16399.435914
neighborhood_NAmes     -16915.930058
neighborhood_NPkVill   -11455.321313
neighborhood_NWAmes    -14401.476128
n

In [50]:
X_train_validation.shape

(1538, 85)

The second model has gotten better but not by much.

---

# Feature Set 3

In [51]:
features3 = ['ms_subclass',
  'street',
  'alley',
  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_qual',
  'overall_cond',
  'year_built',
  'bsmt_qual',
  'bsmt_cond',
  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [52]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features3].select_dtypes(include='object'))
dummies_columns

['street',
 'alley',
 'neighborhood',
 'bldg_type',
 'house_style',
 'bsmt_qual',
 'bsmt_cond',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [53]:
X_train = clean_train[features3]
X_test = clean_test[features3]
y_train = clean_train['saleprice']

In [54]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [55]:
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train_validation, y_train_validation)

LinearRegression()

In [56]:
lr.score(X_train_validation,y_train_validation), lr.score(X_test_validation, y_test_validation)

(0.8674697211050928, 0.8172156954875349)

In [57]:
preds = lr.predict(X_test_validation)
preds[:5]

array([147452.75, 109926.25, 176755.25, 138815.75,  97104.25])

In [58]:
f3_metrics = pd.Series(big_metrics(y_test_validation, preds, features3),index=metric_names)
f3_metrics

SSE       5.946239e+11
MSE       1.159111e+09
RMSE      3.404572e+04
MAE       2.279685e+04
R2        8.172157e-01
R2_adj    8.070401e-01
dtype: float64

In [59]:
f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(34045.718222792704, 30830.897683095838, 31941.92581264128, 79239.33504161824)

In [60]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass            -1.146746e+02
overall_qual            1.307566e+04
overall_cond            4.903706e+03
year_built              2.696322e+02
full_bath               1.395788e+04
half_bath               1.047157e+04
bedroom_abvgr          -1.052034e+03
totrms_abvgrd           6.845065e+03
fireplaces              1.314588e+04
garage_cars             1.593366e+04
mo_sold                -2.487169e+02
yr_sold                -1.933224e+02
street_Grvl             1.582520e+13
street_Pave             1.582520e+13
alley_Grvl             -1.857903e+02
alley_Pave              5.050988e+03
neighborhood_Blmngtn   -4.242468e+13
neighborhood_Blueste   -4.242468e+13
neighborhood_BrDale    -4.242468e+13
neighborhood_BrkSide   -4.242468e+13
neighborhood_ClearCr   -4.242468e+13
neighborhood_CollgCr   -4.242468e+13
neighborhood_Crawfor   -4.242468e+13
neighborhood_Edwards   -4.242468e+13
neighborhood_Gilbert   -4.242468e+13
neighborhood_Greens    -4.242468e+13
neighborhood_IDOTRR    -4.242468e+13
n

In [61]:
X_train_validation.shape

(1538, 97)

My third model is the worst.

---

# Feature Set 4

In [62]:
features4 = ['ms_subclass',
  'street',
  'alley',
  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_qual',
  'overall_cond',
  'year_built',
  'year_remod/add',
  'bsmt_qual',
  'bsmt_cond',
  'total_bsmt_sf',
  'heating_qc',
  'central_air',
  'electrical',
  '1st_flr_sf',
  'gr_liv_area',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_yr_blt',
  'garage_cars',
  'garage_area',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [63]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features4].select_dtypes(include='object'))
dummies_columns

['street',
 'alley',
 'neighborhood',
 'bldg_type',
 'house_style',
 'bsmt_qual',
 'bsmt_cond',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [64]:
X_train = clean_train[features4]
X_test = clean_test[features4]
y_train = clean_train['saleprice']

In [65]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [66]:
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train_validation, y_train_validation)

LinearRegression()

In [67]:
lr.score(X_train_validation,y_train_validation), lr.score(X_test_validation, y_test_validation)

(0.8898429449547527, 0.8324550014648933)

In [68]:
preds = lr.predict(X_test_validation)
preds[:5]

array([173492.07602443, 105405.20246085, 175318.17733791, 185332.60473552,
        97978.29929928])

In [69]:
f4_metrics = pd.Series(big_metrics(y_test_validation, preds, features4),index=metric_names)
f4_metrics

SSE       5.450482e+11
MSE       1.062472e+09
RMSE      3.259559e+04
MAE       1.973766e+04
R2        8.324550e-01
R2_adj    8.209122e-01
dtype: float64

In [70]:
f4_metrics[2], f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(32595.58532312685,
 34045.718222792704,
 30830.897683095838,
 31941.92581264128,
 79239.33504161824)

In [71]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass              -168.447166
overall_qual             9576.490185
overall_cond             5586.831510
year_built                374.672628
year_remod/add             80.150149
total_bsmt_sf              14.624137
1st_flr_sf                 -2.965972
gr_liv_area                49.146851
full_bath                4499.193567
half_bath                5146.743645
bedroom_abvgr           -3535.562464
totrms_abvgrd            -163.833591
fireplaces               7051.975308
garage_yr_blt            -116.274295
garage_cars              9475.924157
garage_area                16.384405
mo_sold                   -32.446462
yr_sold                  -418.633301
street_Grvl             -5426.056922
street_Pave              5426.056922
alley_Grvl               2227.058808
alley_Pave               2541.547685
neighborhood_Blmngtn     1745.555244
neighborhood_Blueste     -408.481653
neighborhood_BrDale      7538.967933
neighborhood_BrkSide   -12037.184229
neighborhood_ClearCr    15898.467102
n

In [72]:
X_train_validation.shape

(1538, 103)

# Features Set 2 But some numeric features are treated as categorical

In [73]:
features2s = ['ms_subclass',

  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_cond',
  'year_built',

  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [74]:
dummy_columns2s = [
                'neighborhood',
                'bldg_type',
                'house_style',
                'overall_cond',
                'heating_qc',
                'central_air',
                'electrical',
                'kitchen_qual',
                'functional',
                'garage_qual',
                'garage_cond',
                'sale_type']

In [75]:
X_train = clean_train[features2s]
X_test = clean_test[features2s]
y_train = clean_train['saleprice']

In [76]:
# Make dummies
X_train = pd.get_dummies(data=X_train, columns=dummy_columns2s)
X_test = pd.get_dummies(data=X_test, columns=dummy_columns2s)

In [77]:
X_train_validation, X_test_validation, y_train_validation, y_test_validation = \
                train_test_split(X_train, y_train, random_state=69)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train_validation, y_train_validation)

LinearRegression()

In [78]:
lr.score(X_train_validation,y_train_validation), lr.score(X_test_validation, y_test_validation)

(0.843551729993486, 0.7914930814336835)

In [79]:
# now find y_pred or y_test
preds = lr.predict(X_test_validation)
preds[:10]

array([154309.56781076, 108952.06234413, 179649.88032277, 151666.87657353,
        96817.13266452,  85781.79616661, 242343.46362064, 156902.11484486,
       128501.56029976, 192426.85435061])

In [80]:
f1s_metrics = pd.Series(big_metrics(y_test_validation, preds, features2s),index=metric_names)
f1s_metrics

SSE       6.783033e+11
MSE       1.322229e+09
RMSE      3.636246e+04
MAE       2.494088e+04
R2        7.914931e-01
R2_adj    7.821315e-01
dtype: float64

In [81]:
f1s_metrics[2], f4_metrics[2], f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(36362.46239746164,
 32595.58532312685,
 34045.718222792704,
 30830.897683095838,
 31941.92581264128,
 79239.33504161824)

It seems my model has worsen.

In [82]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass              -161.353332
year_built                487.845978
full_bath               16466.220281
half_bath               11734.572141
bedroom_abvgr           -2313.964209
totrms_abvgrd            8502.163736
fireplaces              17449.625002
garage_cars             20979.553431
mo_sold                  -199.573963
yr_sold                  -362.462964
neighborhood_Blmngtn   -11646.357949
neighborhood_Blueste    -2095.459710
neighborhood_BrDale      4015.147291
neighborhood_BrkSide   -14590.512852
neighborhood_ClearCr    17800.171382
neighborhood_CollgCr   -13757.449421
neighborhood_Crawfor     9047.027887
neighborhood_Edwards   -18882.069859
neighborhood_Gilbert   -34423.231345
neighborhood_Greens     37498.250639
neighborhood_IDOTRR    -20103.668654
neighborhood_MeadowV    -1295.647748
neighborhood_Mitchel   -16610.515431
neighborhood_NAmes     -15043.093330
neighborhood_NPkVill    -8270.319328
neighborhood_NWAmes    -18645.260742
neighborhood_NoRidge    45014.818255
n

In [83]:
X_train_validation.shape

(1538, 93)

I wasn't satisfied with these numbers so I decided to do the same thing but on the logged `saleprice`.

---



Go to the Log Benchmarks [here](https://git.generalassemb.ly/laternader/project_2/blob/master/deliverables/code/4.5%20-%20Log-Benchmarks.ipynb).