# Modeling Benchmarks But with a logged SalePrice

This notebook contains my many models that I run over many different feature sets. We will compare these to the null model as well.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# Import CSV
clean_train =  pd.read_csv('../data/clean-train-engineered.csv')
clean_test = pd.read_csv('../data/clean-test-engineered.csv')

# Creating a baseline

In [4]:
y = clean_train['saleprice']

In [5]:
baseline = clean_train['saleprice'].mean()
baseline

181469.70160897123

Since we can't make any predictions right now, we have to go into the [selection notebook](https://git.generalassemb.ly/laternader/project_2/blob/master/deliverables/code/1.5%20-%20Select-Features.ipynb). Since these were logged, the original benchmarks are [here](https://git.generalassemb.ly/laternader/project_2/blob/master/deliverables/code/4%20-%20Modeling-Benchmarks.ipynb).

But for now we can make a null prediction (`baseline` is our null prediction). We need to calculate the distance between a sale price from the 'saleprice' column to the mean of all saleprices. This will give us null residuals.

In [6]:
null_resids = y - baseline
null_resids[:5]

0   -50969.701609
1    38530.298391
2   -72469.701609
3    -7469.701609
4   -42969.701609
Name: saleprice, dtype: float64

In [7]:
np.mean(null_resids**2)

6278872217.837828

In [8]:
# This is the null model before the split
clean_train['null_pred'] = baseline
#                                     y before the split  , caluculated mean of y
mse_null = metrics.mean_squared_error(clean_train['saleprice'], clean_train['null_pred'])
mse_null

6278872217.837828

In [9]:
rmse_null = mse_null**.5
rmse_null

79239.33504161824

Now we can compare with our models.

---
# Feature Set 1

The features presented here are features that had high positive correlation with 'saleprice' in the train. It will contain LinearRegression, Lasso, And Ridge models

In [10]:
features1 = ['overall_qual',
  'year_built',
  'year_remod/add',
  'mas_vnr_area',
  'total_bsmt_sf',
  '1st_flr_sf',
  'gr_liv_area',
  'full_bath',
  'garage_yr_blt',
  'garage_cars',
  'garage_area']

In [11]:
X_train = clean_train[features1]
X_test = clean_test[features1]
y_train = clean_train['saleprice']

This code helps select any columns that need to be dummified.

In [12]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features1].select_dtypes(include='object'))

We will now fit the `train` into its own model before proceeding to apply it to the whole model.

In [13]:
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69420)

y_trainv_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)

# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train, y_trainv_log)

LinearRegression()

In [14]:
lr.score(X_train, y_trainv_log), lr.score(X_validation, y_valid_log)

(0.8172897846675002, 0.8319308459763215)

In [15]:
preds = lr.predict(X_validation)
preds[:5]

array([12.0010011 , 12.84888129, 11.95654211, 11.85926903, 12.02173559])

Now that we calculated the R scores, we need to calculate the metrics to compare. In order to do that, we bring in the function below.

In [16]:
# The idea came from lab 3.01
def big_metrics(y, predictions, features):
    # 1. Calculate Sum Squared Error; SSE
    SSE = ((y - predictions)**2).sum()
    
    # 2. Calculate Mean Squared Error; MSE
    MSE = SSE/len(y)
    
    # 3. Calculate Root Mean Squared Error; RMSE
    RMSE = np.sqrt(MSE)
    
    # 4. Calculate Mean Absolute Error; MAE
    MAE = (abs(y - predictions)).sum()/len(y)
    
    # 5. Calculate R2 Score Error; R2
    Null_SSE = ((y - y.mean())**2).sum() # Need null to calculate R2
    R2 = 1 - SSE/Null_SSE
    
    # 6. Calculate R2 Adjusted
    R2_adj = 1-(((1-R2)*(len(y)-1))/(len(y)-len(features)-1))
    
    return SSE, MSE, RMSE, MAE, R2, R2_adj

metric_names = ['SSE', 'MSE', 'RMSE', 'MAE', 'R2', 'R2_adj']

In [17]:
f1_metrics = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(preds), features1),index=metric_names)
f1_metrics

SSE       3.844577e+11
MSE       7.494303e+08
RMSE      2.737572e+04
MAE       1.941165e+04
R2        8.695732e-01
R2_adj    8.667096e-01
dtype: float64

In [18]:
f1_metrics[2], rmse_null

(27375.72404094178, 79239.33504161824)

Now time to calculate to compare to the null model.

Since my RMSE of my first model based on `features1` is lower than the rmse_null, that means my model is pimpin.

In [19]:
pd.Series(lr.coef_,index=features1)

overall_qual      0.103465
year_built        0.002500
year_remod/add    0.002369
mas_vnr_area      0.000021
total_bsmt_sf     0.000079
1st_flr_sf        0.000072
gr_liv_area       0.000221
full_bath        -0.010683
garage_yr_blt    -0.000785
garage_cars       0.078310
garage_area      -0.000036
dtype: float64

In [20]:
pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)

overall_qual      40.798099
year_built        36.880047
year_remod/add    36.875194
mas_vnr_area      36.788732
total_bsmt_sf     36.790858
1st_flr_sf        36.790579
gr_liv_area       36.796058
full_bath         36.397037
garage_yr_blt     36.759070
garage_cars       39.784595
garage_area       36.786611
dtype: float64

There were 2 features that cause a drop in price as their units increase. I think we can do a better model.

# Let's model the first set of features to Ridge and RidgeCV

In order to do that, we need to utilize `StandardScaler`

In [21]:
# Standardize the numbers
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69420)

y_trainv_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)
# Since we already split it earlier, we will use those to perform on the Ridge model
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_validation)

ridge_model = Ridge(alpha=5)
ridge_model.fit(Z_train, y_trainv_log)

print('Ridge Training score:', ridge_model.score(Z_train, y_trainv_log))
print('Ridge Test score:', ridge_model.score(Z_test, y_valid_log)) 

Ridge Training score: 0.8211060240636552
Ridge Test score: 0.7990267195796201


In [22]:
r_alphas = np.logspace(0, 5, 100)
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2',cv=5)
ridge_cv.fit(Z_train, y_trainv_log);

In [23]:
print('RidgeCV Training score:',ridge_cv.score(Z_train, y_trainv_log))
print('RidgeCV Test score:',ridge_cv.score(Z_test, y_valid_log))

RidgeCV Training score: 0.8184368142549853
RidgeCV Test score: 0.7994493832365643


Calculate the RMSE on `Ridge` and `RidgeCV` models.

In [24]:
Z_test_preds = ridge_model.predict(Z_test)
Z_test_preds[:5]

array([11.5953043 , 11.9961983 , 11.64314539, 12.37056958, 11.96181749])

In [25]:
f1_metrics_r = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(Z_test_preds), features1),index=metric_names)
f1_metrics_r

SSE       1.899319e+12
MSE       4.933297e+09
RMSE      7.023744e+04
MAE       2.508507e+04
R2        3.188588e-01
R2_adj    2.987716e-01
dtype: float64

In [26]:
pd.Series(ridge_model.coef_,index=features1)

overall_qual      0.144390
year_built        0.072067
year_remod/add    0.050643
mas_vnr_area      0.010812
total_bsmt_sf     0.027336
1st_flr_sf        0.041736
gr_liv_area       0.114328
full_bath        -0.013431
garage_yr_blt    -0.004776
garage_cars       0.051838
garage_area      -0.011827
dtype: float64

In [27]:
pd.Series(np.exp(ridge_model.coef_-1)*100,index=X_train.columns)

overall_qual      42.502374
year_built        39.537020
year_remod/add    38.698960
mas_vnr_area      37.187870
total_bsmt_sf     37.807433
1st_flr_sf        38.355812
gr_liv_area       41.243712
full_bath         36.297162
garage_yr_blt     36.612648
garage_cars       38.745234
garage_area       36.355408
dtype: float64

In [28]:
f1_metrics[2], f1_metrics_r[2], rmse_null

(27375.72404094178, 70237.43528484975, 79239.33504161824)

In [29]:
Z_test_preds = ridge_cv.predict(Z_test)
Z_test_preds[:5]

array([11.6113851 , 12.0138693 , 11.6479125 , 12.37737574, 11.96689498])

In [30]:
f1_metrics_rcv = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(Z_test_preds), features1),index=metric_names)
f1_metrics_rcv

SSE       1.728542e+12
MSE       4.489718e+09
RMSE      6.700536e+04
MAE       2.501389e+04
R2        3.801038e-01
R2_adj    3.618227e-01
dtype: float64

In [31]:
pd.Series(ridge_cv.coef_,index=features1)

overall_qual      0.128049
year_built        0.061520
year_remod/add    0.050677
mas_vnr_area      0.016023
total_bsmt_sf     0.031445
1st_flr_sf        0.041588
gr_liv_area       0.099092
full_bath         0.000436
garage_yr_blt     0.003485
garage_cars       0.049498
garage_area      -0.003507
dtype: float64

In [32]:
pd.Series(np.exp(ridge_cv.coef_-1)*100,index=X_train.columns)

overall_qual      41.813513
year_built        39.122210
year_remod/add    38.700303
mas_vnr_area      37.382147
total_bsmt_sf     37.963117
1st_flr_sf        38.350145
gr_liv_area       40.620073
full_bath         36.804004
garage_yr_blt     36.916365
garage_cars       38.654683
garage_area       36.659137
dtype: float64

In [33]:
f1_metrics[2], f1_metrics_r[2], f1_metrics_rcv[2], rmse_null

(27375.72404094178, 70237.43528484975, 67005.36052237834, 79239.33504161824)

# We can also LassoCV

In [34]:
l_alphas = np.logspace(1, 0, 100)
# Instantiate
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=50000)
# Fit
lasso_cv.fit(Z_train, y_trainv_log);

In [35]:
lasso_cv.alpha_

10.0

In [36]:
print(lasso_cv.score(Z_train, y_trainv_log))
print(lasso_cv.score(Z_test, y_valid_log))

0.0
-0.008906686074497516


In [37]:
Z_test_preds = lasso_cv.predict(Z_test)
Z_test_preds[:5]

array([12.01529025, 12.01529025, 12.01529025, 12.01529025, 12.01529025])

In [38]:
lasso_cv.coef_

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [39]:
f1_metrics_lasso = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(Z_test_preds), features1),index=metric_names)
f1_metrics_lasso

SSE       2.985794e+12
MSE       7.755309e+09
RMSE      8.806423e+04
MAE       6.063651e+04
R2       -7.077685e-02
R2_adj   -1.023547e-01
dtype: float64

In [40]:
f1_metrics_lasso[2], f1_metrics[2], f1_metrics_r[2], f1_metrics_rcv[2], rmse_null

(88064.23468114945,
 27375.72404094178,
 70237.43528484975,
 67005.36052237834,
 79239.33504161824)

In [41]:
pd.Series(lasso_cv.coef_,index=features1)

overall_qual      0.0
year_built        0.0
year_remod/add    0.0
mas_vnr_area      0.0
total_bsmt_sf     0.0
1st_flr_sf        0.0
gr_liv_area       0.0
full_bath         0.0
garage_yr_blt     0.0
garage_cars       0.0
garage_area       0.0
dtype: float64

All of the models did better than the null, but the best one was the LinearRegression model.

Now that we have computed that for our first set of features, the next set of features will follow the same route.

---
# Feature Set 2

In [42]:
# Features
features2 = ['ms_subclass',

  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_cond',
  'year_built',

  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [43]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features2].select_dtypes(include='object'))
dummies_columns

['neighborhood',
 'bldg_type',
 'house_style',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [44]:
X_train = clean_train[features2]
X_test = clean_test[features2]
y_train = clean_train['saleprice']

In [45]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [46]:
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69420)

y_train_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train, y_train_log)

LinearRegression()

In [47]:
lr.score(X_train,y_train_log), lr.score(X_validation ,y_valid_log)

(0.8421271184937242, 0.8251206541735754)

In [48]:
preds = lr.predict(X_validation)
preds[:5]

array([12.09392878, 12.95277291, 12.02121379, 11.86000032, 11.90328118])

In [49]:
f2_metrics = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(preds), features2),index=metric_names)
f2_metrics

SSE       4.238242e+11
MSE       8.261680e+08
RMSE      2.874314e+04
MAE       2.095717e+04
R2        8.562182e-01
R2_adj    8.497627e-01
dtype: float64

In [50]:
f1_metrics[2], f2_metrics[2], rmse_null

(27375.72404094178, 28743.13855801938, 79239.33504161824)

In [51]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass             0.000130
overall_cond            0.051980
year_built              0.002935
full_bath               0.097882
half_bath               0.043754
bedroom_abvgr           0.005230
totrms_abvgrd           0.043530
fireplaces              0.091288
garage_cars             0.084372
mo_sold                -0.000417
yr_sold                -0.006464
neighborhood_Blmngtn   -0.063642
neighborhood_Blueste   -0.047034
neighborhood_BrDale    -0.119671
neighborhood_BrkSide   -0.078382
neighborhood_ClearCr    0.119558
neighborhood_CollgCr    0.006519
neighborhood_Crawfor    0.128389
neighborhood_Edwards   -0.106609
neighborhood_Gilbert   -0.086666
neighborhood_Greens     0.226709
neighborhood_IDOTRR    -0.187502
neighborhood_MeadowV   -0.237219
neighborhood_Mitchel   -0.030551
neighborhood_NAmes     -0.039268
neighborhood_NPkVill   -0.086677
neighborhood_NWAmes    -0.018572
neighborhood_NoRidge    0.199053
neighborhood_NridgHt    0.196658
neighborhood_OldTown   -0.123370
neighborho

In [52]:
pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)

ms_subclass             36.792715
overall_cond            38.750768
year_built              36.896087
full_bath               40.570963
half_bath               38.433297
bedroom_abvgr           36.980840
totrms_abvgrd           38.424703
fireplaces              40.304294
garage_cars             40.026503
mo_sold                 36.772621
yr_sold                 36.550908
neighborhood_Blmngtn    34.519617
neighborhood_Blueste    35.097725
neighborhood_BrDale     32.638730
neighborhood_BrkSide    34.014541
neighborhood_ClearCr    41.459966
neighborhood_CollgCr    37.028547
neighborhood_Crawfor    41.827725
neighborhood_Edwards    33.067825
neighborhood_Gilbert    33.733948
neighborhood_Greens     46.149175
neighborhood_IDOTRR     30.498207
neighborhood_MeadowV    29.019006
neighborhood_Mitchel    35.681034
neighborhood_NAmes      35.371348
neighborhood_NPkVill    33.733573
neighborhood_NWAmes     36.111030
neighborhood_NoRidge    44.890354
neighborhood_NridgHt    44.782983
neighborhood_O

The second model has gotten worse.

---

# Feature Set 3

In [53]:
features3 = ['ms_subclass',
  'street',
  'alley',
  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_qual',
  'overall_cond',
  'year_built',
  'bsmt_qual',
  'bsmt_cond',
  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [54]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features3].select_dtypes(include='object'))
dummies_columns

['street',
 'alley',
 'neighborhood',
 'bldg_type',
 'house_style',
 'bsmt_qual',
 'bsmt_cond',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [55]:
X_train = clean_train[features3]
X_test = clean_test[features3]
y_train = clean_train['saleprice']

In [56]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [57]:
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69)

y_train_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train, y_train_log)

LinearRegression()

In [58]:
lr.score(X_train,y_train_log), lr.score(X_validation ,y_valid_log)

(0.8803199626662463, 0.8227166111856092)

In [59]:
preds = lr.predict(X_validation)
preds[:5]

array([11.85280371, 11.65089941, 12.07013655, 11.80814981, 11.48225451])

In [60]:
f3_metrics = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(preds), features3),index=metric_names)
f3_metrics

SSE       5.630839e+11
MSE       1.097629e+09
RMSE      3.313049e+04
MAE       2.135747e+04
R2        8.269109e-01
R2_adj    8.172750e-01
dtype: float64

In [61]:
f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(33130.49181361535, 28743.13855801938, 27375.72404094178, 79239.33504161824)

In [62]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass             2.539300e-04
overall_qual            8.017974e-02
overall_cond            4.405507e-02
year_built              1.707899e-03
full_bath               8.018858e-02
half_bath               4.563396e-02
bedroom_abvgr           1.353998e-02
totrms_abvgrd           3.746105e-02
fireplaces              6.250780e-02
garage_cars             7.161518e-02
mo_sold                -1.376919e-03
yr_sold                -3.516148e-03
street_Grvl             2.641620e+07
street_Pave             2.641620e+07
alley_Grvl              1.118328e-02
alley_Pave              5.221639e-03
neighborhood_Blmngtn   -7.081736e+07
neighborhood_Blueste   -7.081736e+07
neighborhood_BrDale    -7.081736e+07
neighborhood_BrkSide   -7.081736e+07
neighborhood_ClearCr   -7.081736e+07
neighborhood_CollgCr   -7.081736e+07
neighborhood_Crawfor   -7.081736e+07
neighborhood_Edwards   -7.081736e+07
neighborhood_Gilbert   -7.081736e+07
neighborhood_Greens    -7.081736e+07
neighborhood_IDOTRR    -7.081736e+07
n

In [63]:
pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)

  pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)


ms_subclass             36.797287
overall_qual            39.859068
overall_cond            38.444870
year_built              36.850828
full_bath               39.859420
half_bath               38.505618
bedroom_abvgr           37.289440
totrms_abvgrd           38.192197
fireplaces              39.160868
garage_cars             39.519150
mo_sold                 36.737325
yr_sold                 36.658819
street_Grvl                   inf
street_Pave                   inf
alley_Grvl              37.201663
alley_Pave              36.980540
neighborhood_Blmngtn     0.000000
neighborhood_Blueste     0.000000
neighborhood_BrDale      0.000000
neighborhood_BrkSide     0.000000
neighborhood_ClearCr     0.000000
neighborhood_CollgCr     0.000000
neighborhood_Crawfor     0.000000
neighborhood_Edwards     0.000000
neighborhood_Gilbert     0.000000
neighborhood_Greens      0.000000
neighborhood_IDOTRR      0.000000
neighborhood_MeadowV     0.000000
neighborhood_Mitchel     0.000000
neighborhood_N

My third model is the better than the second but worse than the first.

---

# Feature Set 4

In [64]:
features4 = ['ms_subclass',
  'street',
  'alley',
  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_qual',
  'overall_cond',
  'year_built',
  'year_remod/add',
  'bsmt_qual',
  'bsmt_cond',
  'total_bsmt_sf',
  'heating_qc',
  'central_air',
  'electrical',
  '1st_flr_sf',
  'gr_liv_area',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_yr_blt',
  'garage_cars',
  'garage_area',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [65]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
dummies_columns = list(clean_train[features4].select_dtypes(include='object'))
dummies_columns

['street',
 'alley',
 'neighborhood',
 'bldg_type',
 'house_style',
 'bsmt_qual',
 'bsmt_cond',
 'heating_qc',
 'central_air',
 'electrical',
 'kitchen_qual',
 'functional',
 'garage_qual',
 'garage_cond',
 'sale_type']

In [66]:
X_train = clean_train[features4]
X_test = clean_test[features4]
y_train = clean_train['saleprice']

In [67]:
# Make dummies because we have categorical variables
X_train = pd.get_dummies(data=X_train, columns=dummies_columns)
X_test = pd.get_dummies(data=X_test, columns=dummies_columns)

In [68]:
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69)

y_train_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train, y_train_log)

LinearRegression()

In [69]:
lr.score(X_train,y_train_log), lr.score(X_validation ,y_valid_log)

(0.89842194218709, 0.8389604201202638)

In [70]:
preds = lr.predict(X_validation)
preds[:5]

array([11.9851361 , 11.62562413, 12.06712433, 12.01603518, 11.48697277])

In [71]:
f4_metrics = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(preds), features4),index=metric_names)
f4_metrics

SSE       8.825717e+11
MSE       1.720413e+09
RMSE      4.147786e+04
MAE       1.866575e+04
R2        7.287020e-01
R2_adj    7.100114e-01
dtype: float64

In [72]:
f4_metrics[2], f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(41477.858180177944,
 33130.49181361535,
 28743.13855801938,
 27375.72404094178,
 79239.33504161824)

In [73]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass             0.000035
overall_qual            0.062536
overall_cond            0.046492
year_built              0.002145
year_remod/add          0.000563
total_bsmt_sf           0.000062
1st_flr_sf              0.000042
gr_liv_area             0.000198
full_bath               0.035720
half_bath               0.023973
bedroom_abvgr           0.003055
totrms_abvgrd           0.003731
fireplaces              0.033891
garage_yr_blt          -0.000461
garage_cars             0.057468
garage_area            -0.000004
mo_sold                -0.000411
yr_sold                -0.004632
street_Grvl            -0.053173
street_Pave             0.053173
alley_Grvl              0.021225
alley_Pave             -0.007194
neighborhood_Blmngtn   -0.001375
neighborhood_Blueste   -0.033101
neighborhood_BrDale    -0.082208
neighborhood_BrkSide   -0.053993
neighborhood_ClearCr    0.140095
neighborhood_CollgCr    0.007801
neighborhood_Crawfor    0.091478
neighborhood_Edwards   -0.074492
neighborho

In [74]:
pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)

ms_subclass             36.789242
overall_qual            39.161984
overall_cond            38.538678
year_built              36.866938
year_remod/add          36.808648
total_bsmt_sf           36.790229
1st_flr_sf              36.789494
gr_liv_area             36.795244
full_bath               38.125777
half_bath               37.680512
bedroom_abvgr           36.900490
totrms_abvgrd           36.925473
fireplaces              38.056078
garage_yr_blt           36.771001
garage_cars             38.964009
garage_area             36.787809
mo_sold                 36.772832
yr_sold                 36.617941
street_Grvl             34.882909
street_Pave             38.797018
alley_Grvl              37.577121
alley_Pave              36.524243
neighborhood_Blmngtn    36.737411
neighborhood_Blueste    35.590176
neighborhood_BrDale     33.884663
neighborhood_BrkSide    34.854322
neighborhood_ClearCr    42.320246
neighborhood_CollgCr    37.076036
neighborhood_Crawfor    40.311967
neighborhood_E

The numbers continually got worse so I decided to do something different with an old set.

---

# Features Set 2 But some numeric features are treated as categorical

In [75]:
features2s = ['ms_subclass',

  'neighborhood',
  'bldg_type',
  'house_style',
  'overall_cond',
  'year_built',

  'heating_qc',
  'central_air',
  'electrical',
  'full_bath',
  'half_bath',
  'bedroom_abvgr',
  'kitchen_qual',
  'totrms_abvgrd',
  'functional',
  'fireplaces',
  'garage_cars',
  'garage_qual',
  'garage_cond',
  'mo_sold',
  'yr_sold',
  'sale_type']

In [76]:
dummy_columns2s = [
                'neighborhood',
                'bldg_type',
                'house_style',
                'overall_cond',
                'heating_qc',
                'central_air',
                'electrical',
                'kitchen_qual',
                'functional',
                'garage_qual',
                'garage_cond',
                'sale_type']

In [77]:
X_train = clean_train[features2s]
X_test = clean_test[features2s]
y_train = clean_train['saleprice']

In [78]:
# Make dummies
X_train = pd.get_dummies(data=X_train, columns=dummy_columns2s)
X_test = pd.get_dummies(data=X_test, columns=dummy_columns2s)

In [79]:
X_train, X_validation, y_train, y_validation = \
                train_test_split(X_train, y_train, random_state=69)

y_train_log = y_train.map(np.log)
y_valid_log = y_validation.map(np.log)
# Instatiate and fit the model
lr = LinearRegression()
lr.fit(X_train, y_train_log)

LinearRegression()

In [80]:
lr.score(X_train,y_train_log), lr.score(X_validation ,y_valid_log)

(0.8579903517463012, 0.7850558247699848)

In [81]:
preds = lr.predict(X_validation)
preds[:5]

array([11.905755  , 11.64061438, 12.08620438, 11.94123568, 11.48846181])

In [82]:
f2s_metrics = pd.Series(big_metrics(np.exp(y_valid_log), np.exp(preds), features2s),index=metric_names)
f2s_metrics

SSE       6.316134e+11
MSE       1.231215e+09
RMSE      3.508868e+04
MAE       2.305905e+04
R2        8.058453e-01
R2_adj    7.971282e-01
dtype: float64

In [83]:
f2s_metrics[2], f4_metrics[2], f3_metrics[2], f2_metrics[2], f1_metrics[2], rmse_null 

(35088.67742962722,
 41477.858180177944,
 33130.49181361535,
 28743.13855801938,
 27375.72404094178,
 79239.33504161824)

In [84]:
pd.Series(lr.coef_,index=X_train.columns)

ms_subclass             0.000093
year_built              0.002693
full_bath               0.093322
half_bath               0.051114
bedroom_abvgr           0.007849
totrms_abvgrd           0.046110
fireplaces              0.088209
garage_cars             0.096771
mo_sold                -0.000468
yr_sold                -0.003472
neighborhood_Blmngtn   -0.051582
neighborhood_Blueste   -0.029679
neighborhood_BrDale    -0.086740
neighborhood_BrkSide   -0.072830
neighborhood_ClearCr    0.144287
neighborhood_CollgCr   -0.013748
neighborhood_Crawfor    0.100140
neighborhood_Edwards   -0.097809
neighborhood_Gilbert   -0.110554
neighborhood_Greens     0.238416
neighborhood_IDOTRR    -0.173156
neighborhood_MeadowV   -0.157891
neighborhood_Mitchel   -0.036983
neighborhood_NAmes     -0.036262
neighborhood_NPkVill   -0.064421
neighborhood_NWAmes    -0.035259
neighborhood_NoRidge    0.150661
neighborhood_NridgHt    0.151865
neighborhood_OldTown   -0.115675
neighborhood_SWISU     -0.006668
neighborho

In [85]:
pd.Series(np.exp(lr.coef_-1)*100,index=X_train.columns)

ms_subclass             36.791358
year_built              36.887154
full_bath               40.386350
half_bath               38.717198
bedroom_abvgr           37.077827
totrms_abvgrd           38.523968
fireplaces              40.180393
garage_cars             40.525880
mo_sold                 36.770724
yr_sold                 36.660428
neighborhood_Blmngtn    34.938443
neighborhood_Blueste    35.712170
neighborhood_BrDale     33.731450
neighborhood_BrkSide    34.203928
neighborhood_ClearCr    42.498024
neighborhood_CollgCr    36.285631
neighborhood_Crawfor    40.662667
neighborhood_Edwards    33.360122
neighborhood_Gilbert    32.937633
neighborhood_Greens     46.692636
neighborhood_IDOTRR     30.938906
neighborhood_MeadowV    31.414816
neighborhood_Mitchel    35.452275
neighborhood_NAmes      35.477821
neighborhood_NPkVill    34.492759
neighborhood_NWAmes     35.513423
neighborhood_NoRidge    42.769759
neighborhood_NridgHt    42.821265
neighborhood_OldTown    32.769414
neighborhood_S

It wasn't that far from the original set 2 model results.

I wasn't satisfied with these numbers so I decided to do the same thing but on the logged `saleprice`.

---

