# Modeling

In this notebook, I will be comparing Linear Regression, Lasso Regression and Ridge Regression to find the model that best predicts housing prices in Ames, Iowa using house features.

Success will be evaluated based off for R^2 and Root Mean Squared Error (RMSE) scores.



In [65]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error 
from sklearn.dummy import DummyRegressor

In [66]:
house = pd.read_csv('./datasets/house_clean.csv')
test = pd.read_csv('./datasets/test_clean.csv')

In [67]:
house = house.drop('Unnamed: 0', axis=1)
test = test.drop('Unnamed: 0', axis=1)

# Split the house dataset into another training set and test set, then scale

In [68]:
X = house.drop(['id', 'pid', 'saleprice', 'logsaleprice'], axis=1)
y = house['logsaleprice']

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [70]:
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

# Dummy Model:

In [71]:
dummodel = DummyRegressor()

In [72]:
dummodel.fit(X_train_sc, y_train)

DummyRegressor()

In [73]:
dummodel.score(X_train_sc, y_train)

0.0

In [74]:
dummodel.score(X_test_sc, y_test)

-0.0024133337764569163

In [75]:
predictions = dummodel.predict(X_test_sc)

In [76]:
np.sqrt(mean_squared_error(y_test, predictions))

0.40868562345586157

The dummy model had a testing R^2 of -0.00075, meaning that the model doesn't explain any variance.
The RMSE of the dummy model is 0.397, which means that it's predictions are off by 39.7 cents

# Model 1: Linear Regression

In [77]:
linear = LinearRegression()
linear.fit(X_train_sc, y_train)

LinearRegression()

In [78]:
linear.score(X_train_sc, y_train)

0.9528116564031468

In [79]:
linear.score(X_test_sc, y_test)

-9.841086528931519e+19

In [80]:
y_pred = linear.predict(X_test_sc)

In [81]:
np.sqrt(mean_squared_error(y_test, y_pred))

4049370068.1638203

The linear regression model did very very poorly.

At first glance, the R^2 score for the training set looked good. 
But the negative R^2 score for the testing set shows that the linear model doesn't fit at all.

Additionally, the root mean squared error shows that our predictions are off by $108096602256.85

# Model 2: Lasso, with cross validation to find best alpha

In [82]:
l_alphas = np.logspace(-3, 0, 100)
lasso = LassoCV(alphas = l_alphas, cv =5, max_iter = 50000)
lasso.fit(X_train_sc, y_train)

LassoCV(alphas=array([0.001     , 0.00107227, 0.00114976, 0.00123285, 0.00132194,
       0.00141747, 0.00151991, 0.00162975, 0.00174753, 0.00187382,
       0.00200923, 0.00215443, 0.00231013, 0.00247708, 0.00265609,
       0.00284804, 0.00305386, 0.00327455, 0.00351119, 0.00376494,
       0.00403702, 0.00432876, 0.00464159, 0.00497702, 0.0053367 ,
       0.00572237, 0.00613591, 0.00657933, 0.0070548 , 0.00756463,
       0.008...
       0.09326033, 0.1       , 0.10722672, 0.1149757 , 0.12328467,
       0.13219411, 0.14174742, 0.15199111, 0.16297508, 0.17475284,
       0.18738174, 0.2009233 , 0.21544347, 0.23101297, 0.24770764,
       0.26560878, 0.28480359, 0.30538555, 0.32745492, 0.35111917,
       0.37649358, 0.40370173, 0.43287613, 0.46415888, 0.49770236,
       0.53366992, 0.57223677, 0.61359073, 0.65793322, 0.70548023,
       0.75646333, 0.81113083, 0.869749  , 0.93260335, 1.        ]),
        cv=5, max_iter=50000)

In [83]:
lasso.score(X_train_sc, y_train)

0.9374870409272786

In [84]:
lasso.score(X_test_sc, y_test)

0.7886324182211444

In [85]:
y_pred = lasso.predict(X_test_sc)
np.sqrt(mean_squared_error(y_test, y_pred))

0.18766579914115603

In [86]:
X_test_2 = test.drop(['id', 'pid'], axis=1)
X_test_2_sc = scaler.transform(X_test_2)

In [87]:
test_lasso_pred = lasso.predict(X_test_2_sc)

In [88]:
test['log_preds'] = test_lasso_pred
test['SalePrice'] = (np.e)**test['log_preds']

In [89]:
lasso_sub = test[['id', 'SalePrice']]
lasso_sub = lasso_sub.rename(columns={'id':'Id'})

In [90]:
lasso_sub.to_csv('./datasets/lasso_sub.csv', index=False)

The Lasso model did much better than the Linear Regression model and the dummy model.
It has a testing R^2 of 0.878, meaning that the model accounts for 87.8% of variance.
The RMSE of 0.138 indicates that the model is off by 13.8 cents

# Model 3: Ridge, with cross validation to find best alpha

In [91]:
r_alphas = np.logspace(0, 5, 100)
ridge = RidgeCV(alphas = r_alphas, scoring = 'r2', cv = 5)
ridge.fit(X_train_sc, y_train)

RidgeCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]),
        cv=5, scoring='r2')

In [92]:
ridge.score(X_train_sc, y_train)

0.9391002581009751

In [93]:
ridge.score(X_test_sc, y_test)

0.8082558756549979

In [94]:
ridge_pred = ridge.predict(X_test_sc)
np.sqrt(mean_squared_error(y_test, ridge_pred))

0.17874214914763062

In [95]:
test_ridge_pred = ridge.predict(X_test_2_sc)
test['log_preds'] = test_ridge_pred
test['SalePrice'] = (np.e)**test['log_preds']
ridge_sub = test[['id', 'SalePrice']]
ridge_sub = ridge_sub.rename(columns={'id':'Id'})
ridge_sub.to_csv('./datasets/ridge_sub.csv', index=False)

The lasso model and the ridge model performed similarly.

# Now I'm going to try Lasso with my datasets in which I removed features that were not strongly correlated with sale price

In [96]:
corr_train = pd.read_csv('./datasets/house_corr_features.csv')
corr_test = pd.read_csv('./datasets/test_corr_features.csv')

In [97]:
X = corr_train.drop(['id', 'pid', 'saleprice', 'logsaleprice'], axis=1)
y = corr_train['logsaleprice']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [98]:
Z_train = scaler.fit_transform(X_train)
Z_test = scaler.transform(X_test)

In [99]:
lasso.fit(Z_train, y_train)

LassoCV(alphas=array([0.001     , 0.00107227, 0.00114976, 0.00123285, 0.00132194,
       0.00141747, 0.00151991, 0.00162975, 0.00174753, 0.00187382,
       0.00200923, 0.00215443, 0.00231013, 0.00247708, 0.00265609,
       0.00284804, 0.00305386, 0.00327455, 0.00351119, 0.00376494,
       0.00403702, 0.00432876, 0.00464159, 0.00497702, 0.0053367 ,
       0.00572237, 0.00613591, 0.00657933, 0.0070548 , 0.00756463,
       0.008...
       0.09326033, 0.1       , 0.10722672, 0.1149757 , 0.12328467,
       0.13219411, 0.14174742, 0.15199111, 0.16297508, 0.17475284,
       0.18738174, 0.2009233 , 0.21544347, 0.23101297, 0.24770764,
       0.26560878, 0.28480359, 0.30538555, 0.32745492, 0.35111917,
       0.37649358, 0.40370173, 0.43287613, 0.46415888, 0.49770236,
       0.53366992, 0.57223677, 0.61359073, 0.65793322, 0.70548023,
       0.75646333, 0.81113083, 0.869749  , 0.93260335, 1.        ]),
        cv=5, max_iter=50000)

In [100]:
lasso.score(Z_train, y_train)

0.856830659855403

In [101]:
lasso.score(Z_test, y_test)

0.8968763885824291

In [102]:
y_pred = lasso.predict(Z_test)

In [103]:
np.sqrt(mean_squared_error(y_test, y_pred))

0.12972871773498984

In [104]:
X_test_2 = corr_test.drop(['id', 'pid'], axis=1)
Z_test_2 = scaler.transform(X_test_2)

In [105]:
preds = lasso.predict(Z_test_2)

In [106]:
test_corr_pred = preds
corr_test['log_preds'] = test_corr_pred
corr_test['SalePrice'] = (np.e)**corr_test['log_preds']
corr_sub = corr_test[['id', 'SalePrice']]
corr_sub = corr_sub.rename(columns={'id':'Id'})
corr_sub.to_csv('./datasets/corr_sub.csv', index=False)

In [107]:
coef = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(lasso.coef_))], axis = 1)
coef.columns = ['feature', 'coef']

In [108]:
coef.sort_values(by='coef', ascending=True).head(25)

Unnamed: 0,feature,coef
24,paved_drive_N,-0.026706
16,fireplace_qu_no_fire,-0.018508
1,ms_subclass_30,-0.017768
2,ms_zoning_RM,-0.01748
15,kitchen_qual_TA,-0.009171
12,heating_qc_TA,-0.004401
13,central_air_N,-0.004116
20,garage_finish_Unf,-0.003354
76,garage_yr_blt_2008.0,-0.002372
3,lot_shape_Reg,-0.002287


Because this model with 80% fewer features performed just as well as the original models with all of the features indicates that many of the housing features do not affect sale price.

If I were a homeowner looking to renovate my home, I would just focus on the features that influenced the model, and not the other 318 features.