# Part 3 - Modelling
---


The objective of this notebook is to use the Ordinary Least Squares (OLS) Linear Regression, LASSO regression and Ridge Regression to determine the optimal algorithm for submission to Kaggle and answer our business question in Parts 4 & 5 respectively.


The methodology for modelling is as follows:


1. The regression models will be ran with the full set of 90 columns to obtain the coefficients, feature names and RMSE values for the top 5, 10, 15, 20, 25 and 30 features. 



2. The regression models will then be ran a second time using the respective number of features and scored using RMSE.



3. The best performing model at each feature number will be submitted to Kaggle to obtain a RMSE for the test set.

# Housekeeping


Importing libraries and DataFrame.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# remove warnings
warnings.filterwarnings("ignore")

pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

df = pd.read_csv("df_EDA_transformed.csv")
df

Unnamed: 0,Id,PID,Lot Area,Lot Shape,Land Slope,Overall Qual,Overall Cond,Mas Vnr Area,Exter Qual,Exter Cond,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating QC,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Fence,Misc Val,SalePrice,total_porch,has_bsmt,has_garage,has_fireplace,has_2f,recently_built,recently_remoded,foundation_BrkTil,foundation_CBlock,foundation_PConc,sale_New,neighborhood_Blmngtn,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_Greens,neighborhood_GrnHill,neighborhood_IDOTRR,neighborhood_Landmrk,neighborhood_MeadowV,neighborhood_Mitchel,neighborhood_NAmes,neighborhood_NPkVill,neighborhood_NWAmes,neighborhood_NoRidge,neighborhood_NridgHt,neighborhood_OldTown,neighborhood_SWISU,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker
0,109,533352170,9.511777,1.098612,0.0,6,2.197225,5.669881,1.098612,0.693147,1.386294,1.386294,0.693147,6,6.280396,0.693147,0.000000,5.262690,6.587550,3,0.693147,6.587550,6.626718,0.0,7.299797,0.000000,0.0,2,0.693147,3,0.693147,1.098612,1.945910,2.079442,0.000000,0,7.589336,2,2.0,475.0,1.386294,1.386294,1.098612,0.000000,3.806662,0.00000,0.0,0.0,0.0,0.0,11.779136,3.806662,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,544,531379050,9.349493,1.098612,0.0,7,1.791759,4.890349,1.098612,0.693147,1.609438,1.386294,0.693147,6,6.458338,0.693147,0.000000,5.624018,6.817831,3,0.693147,6.817831,7.098376,0.0,7.660585,0.693147,0.0,2,0.693147,4,0.693147,1.098612,2.197225,2.079442,0.693147,3,7.599902,2,2.0,559.0,1.386294,1.386294,1.098612,0.000000,4.317488,0.00000,0.0,0.0,0.0,0.0,12.301387,4.317488,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,153,535304180,8.977525,1.386294,0.0,5,2.079442,0.000000,0.693147,1.098612,1.386294,1.386294,0.693147,6,6.595781,0.693147,0.000000,5.789960,6.964136,1,0.693147,6.964136,0.000000,0.0,6.964136,0.693147,0.0,1,0.000000,3,0.693147,1.098612,1.791759,2.079442,0.000000,0,7.577634,1,1.0,246.0,1.386294,1.386294,1.098612,0.000000,3.970292,0.00000,0.0,0.0,0.0,0.0,11.599112,3.970292,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,318,916386060,9.190444,1.386294,0.0,5,1.791759,0.000000,0.693147,0.693147,1.609438,1.386294,0.693147,1,0.000000,0.693147,0.000000,5.953243,5.953243,2,0.693147,6.613384,6.552508,0.0,7.275865,0.000000,0.0,2,0.693147,3,0.693147,0.693147,2.079442,2.079442,0.000000,0,7.604894,3,2.0,400.0,1.386294,1.386294,1.098612,4.615121,0.000000,0.00000,0.0,0.0,0.0,0.0,12.066816,0.000000,1,1,0,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,255,906425045,9.563529,1.098612,0.0,6,2.197225,0.000000,0.693147,0.693147,1.098612,1.609438,0.693147,1,0.000000,0.693147,0.000000,6.517671,6.517671,1,0.693147,6.723832,6.421622,0.0,7.276556,0.000000,0.0,2,0.000000,3,0.693147,0.693147,1.945910,2.079442,0.000000,0,7.579679,1,2.0,484.0,1.386294,1.386294,0.000000,0.000000,4.094345,0.00000,0.0,0.0,0.0,0.0,11.838633,4.094345,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,9.345745,1.098612,0.0,8,1.791759,0.000000,1.098612,0.693147,1.609438,1.386294,1.386294,6,6.919684,0.693147,0.000000,6.773080,7.541683,3,0.693147,7.455298,0.000000,0.0,7.455298,0.693147,0.0,2,0.000000,3,0.693147,1.098612,2.079442,2.079442,0.693147,4,7.604894,3,2.0,520.0,1.386294,1.386294,1.098612,0.000000,5.624018,0.00000,0.0,0.0,0.0,0.0,12.607369,5.624018,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2047,785,905377130,9.420844,1.098612,0.0,4,1.791759,0.000000,0.693147,0.693147,1.386294,1.386294,0.693147,4,5.572154,0.693147,0.000000,6.396930,6.759255,3,0.693147,6.759255,0.000000,0.0,6.759255,0.000000,0.0,1,0.000000,1,0.693147,0.693147,1.609438,2.079442,0.000000,0,7.581720,1,2.0,539.0,1.386294,1.386294,1.098612,5.068904,0.000000,0.00000,0.0,0.0,0.0,0.0,11.320566,0.000000,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2048,916,909253010,8.930494,1.386294,0.0,6,1.945910,0.000000,0.693147,0.693147,1.386294,1.386294,0.693147,1,0.000000,0.693147,0.000000,6.799056,6.799056,2,0.693147,7.067320,6.609349,0.0,7.556951,0.000000,0.0,1,0.693147,3,0.693147,0.693147,2.302585,2.079442,0.693147,3,7.565275,1,2.0,342.0,1.098612,1.098612,1.098612,0.000000,0.000000,0.00000,0.0,0.0,0.0,0.0,12.083911,0.000000,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2049,639,535179160,9.249657,1.386294,0.0,4,1.791759,0.000000,0.693147,0.693147,1.386294,1.386294,0.693147,3,5.049856,1.098612,6.621406,5.690359,7.090910,1,0.693147,7.090910,0.000000,0.0,7.090910,0.693147,0.0,1,0.000000,3,0.693147,0.693147,1.945910,2.079442,1.098612,4,7.579168,1,1.0,294.0,1.386294,1.386294,1.098612,0.000000,5.247024,4.94876,0.0,0.0,0.0,0.0,11.877576,10.195784,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


## Train-Test-Split and Modelling


**Summary**
---


In this section, we have instantiated the OLS, LASSO and Ridge models, extracted the top features and scored them on those features. 


From modelling, we observe that the LASSO model achieved better cross-validated RMSE scores than the rest of the models for top 10 features and above, whereas the Ridge regression model performed the best for top 5 features. We will use the respective models for submission to Kaggle in **Part 4**.


In [2]:
X = df.drop(columns =["SalePrice", "Id", "PID"], axis=1)
y = df["SalePrice"]

### Modelling with Full Set of Features to Obtain Coefficients & Names of Top Features

In [3]:
# defining a function that conducts OLS, OLS with standardized X, Lasso & Ridge Regressions 

ols_lasso_ridge_coef ={}
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X)
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

# fit the models OLS conducted using standardized and non-standardized X
lr = LinearRegression()
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

# cross validation with non standardized & standardized X
r2_ols = cross_val_score(lr, sc_X, y, cv=10).mean()
rmse_ols = cross_val_score(lr,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X, y,cv=10, scoring="neg_root_mean_squared_error").mean()

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X, y,cv=10, scoring="neg_root_mean_squared_error").mean()

# fitting and scoring on test set
lr.fit(X_train,y_train)
lasso_cv.fit(Z_train, y_train)
ridge_cv.fit(Z_train, y_train)

lr_pred = lr.predict(X_test)
lasso_pred = lasso_cv.predict(Z_test)
ridge_pred = ridge_cv.predict(Z_test)

# Printing the results
print("OLS ".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()
print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()
print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

ols_lasso_ridge_coef["ols"] = lr.coef_
ols_lasso_ridge_coef["lasso"] = lasso_cv.coef_
ols_lasso_ridge_coef["ridge"] = ridge_cv.coef_

Cross Val Score (R2):  0.9153898972555584
Cross Val Score (RMSE):  0.11849445580394635
Test Score(R2):  0.9207770194841021
Test Score (RMSE): 0.1113607438311811, 20229.20955975009

Cross Val Score (R2):  0.9158820730830273
Cross Val Score (RMSE):  0.1181269916303029
Test Score(R2):  0.9175256897256341
Test Score (RMSE): 0.11362290287922454, 20567.415416396707
Lasso Alpha:  0.0008302175681319744

Cross Val Score (R2):  0.9158494879939582
Cross Val Score (RMSE):  0.11811251230200906
Test Score(R2):  0.9201514474232736
Test Score (RMSE): 0.11179955077056015, 20290.00426003857
Ridge Alpha:  14.508287784959402


### Extracting Names of Top Features

In [4]:
# Extract top 5 coefficients for OLS
ols_coefs = ols_lasso_ridge_coef["ols"].reshape(1,-1)
df_ols = pd.DataFrame(ols_coefs, columns=X.columns)
df_ols = df_ols.T
df_ols = df_ols.rename(columns={0:"Coefficient"})
df_ols["abs_coef"] = df_ols["Coefficient"].apply(abs)
ols_top5 = df_ols["abs_coef"].sort_values(ascending=False).head().index

# Extract top 5 coefficients for lasso
lasso_coefs = ols_lasso_ridge_coef["lasso"].reshape(1,-1)
df_lasso = pd.DataFrame(lasso_coefs, columns=X.columns)
df_lasso = df_lasso.T
df_lasso = df_lasso.rename(columns={0:"Coefficient"})
df_lasso["abs_coef"] = df_lasso["Coefficient"].apply(abs)
lasso_top5 = df_lasso["abs_coef"].sort_values(ascending=False).head().index

# Extract top 5 coefficients for ridge
ridge_coefs = ols_lasso_ridge_coef["ridge"].reshape(1,-1)
df_ridge = pd.DataFrame(ridge_coefs, columns=X.columns)
df_ridge = df_ridge.T
df_ridge = df_ridge.rename(columns={0:"Coefficient"})
df_ridge["abs_coef"] = df_ridge["Coefficient"].apply(abs)
ridge_top5 = df_ridge["abs_coef"].sort_values(ascending=False).head().index

# Extract top 10 coefficients
ols_top10 = df_ridge["abs_coef"].sort_values(ascending=False).head(10).index
lasso_top10 = df_lasso["abs_coef"].sort_values(ascending=False).head(10).index
ridge_top10 = df_ridge["abs_coef"].sort_values(ascending=False).head(10).index

# Extract top 15 coefficients
ols_top15 = df_ols["abs_coef"].sort_values(ascending=False).head(15).index
lasso_top15 = df_lasso["abs_coef"].sort_values(ascending=False).head(15).index
ridge_top15 = df_ridge["abs_coef"].sort_values(ascending=False).head(15).index

# Extract top 20 coefficients
ols_top20 = df_ols["abs_coef"].sort_values(ascending=False).head(20).index
lasso_top20 = df_lasso["abs_coef"].sort_values(ascending=False).head(20).index
ridge_top20 = df_ridge["abs_coef"].sort_values(ascending=False).head(20).index

# Extract top 25 coefficients
ols_top25 = df_ols["abs_coef"].sort_values(ascending=False).head(25).index
lasso_top25 = df_lasso["abs_coef"].sort_values(ascending=False).head(25).index
ridge_top25 = df_ridge["abs_coef"].sort_values(ascending=False).head(25).index

# Extract top 30 coefficients
ols_top30 = df_ols["abs_coef"].sort_values(ascending=False).head(30).index
lasso_top30 = df_lasso["abs_coef"].sort_values(ascending=False).head(30).index
ridge_top30 = df_ridge["abs_coef"].sort_values(ascending=False).head(30).index

### Running Models with the Different Feature Numbers


After obtaining the top X number of features (where X is 5-30), we will run the respective models again using them.

#### Top 5 Features

In [5]:
# use the respective model top5 coefficients to run model again to check results

# OLS
X_ols = df[ols_top5]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, sc_X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,sc_X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top5]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top5]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top5_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False),3)
top5_best_intercept = ridge_cv.intercept_
top5_coef = ridge_cv.coef_

Cross Val Score (R2):  0.7028251814854176
Cross Val Score (RMSE):  0.22342861175777168
Test Score(R2):  0.7064518261809791
Test Score (RMSE): 0.21436122288640636, 47455.92016990079

Cross Val Score (R2):  0.8054664374160246
Cross Val Score (RMSE):  0.18070402011970682
Test Score(R2):  0.8047898060751233
Test Score (RMSE): 0.17480658191649623, 32278.834561973545
Lasso Alpha:  1e-05

Cross Val Score (R2):  0.8172901769668135
Cross Val Score (RMSE):  0.17513404104312935
Test Score(R2):  0.8160441973258399
Test Score (RMSE): 0.16969274707690885, 30273.521441578872
Ridge Alpha:  0.7054802310718645


#### Top 10 Features

In [6]:
# use the respective model top10 coefficients to run model again to check results

# OLS
X_ols = df[ols_top10]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top10]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top10]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top10_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False),3)
top10_best_intercept = lasso_cv.intercept_
top10_coef = lasso_cv.coef_

Cross Val Score (R2):  0.8665265434318655
Cross Val Score (RMSE):  0.1494520208890749
Test Score(R2):  0.8611388078883109
Test Score (RMSE): 0.14743377277884578, 25969.13608857167

Cross Val Score (R2):  0.8763263721235074
Cross Val Score (RMSE):  0.1439870925929026
Test Score(R2):  0.8679316876551789
Test Score (RMSE): 0.14378243973549026, 25563.94452526712
Lasso Alpha:  5.0941380148163754e-05

Cross Val Score (R2):  0.86651950804685
Cross Val Score (RMSE):  0.149446912096125
Test Score(R2):  0.8610873400749964
Test Score (RMSE): 0.1474610928341889, 25979.67718440352
Ridge Alpha:  0.5590810182512223


#### Top 15 Features

In [7]:
# use the respective model top15 coefficients to run model again to check results

# OLS
X_ols = df[ols_top15]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top15]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top15]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top15_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False),3)
top15_best_intercept = lasso_cv.intercept_
top15_coef = lasso_cv.coef_

Cross Val Score (R2):  0.8434490405818655
Cross Val Score (RMSE):  0.16200635930212023
Test Score(R2):  0.8371975950775896
Test Score (RMSE): 0.15963824418930983, 31802.442905181586

Cross Val Score (R2):  0.8850136381345012
Cross Val Score (RMSE):  0.13892215129022947
Test Score(R2):  0.8704926678974425
Test Score (RMSE): 0.14238154882809198, 25613.451007981483
Lasso Alpha:  1.5922827933410938e-05

Cross Val Score (R2):  0.8840324157267746
Cross Val Score (RMSE):  0.139415963373866
Test Score(R2):  0.8825559688348695
Test Score (RMSE): 0.13558823558204808, 24384.277708800797
Ridge Alpha:  1.7886495290574351


#### Top 20 Features

In [8]:
# use the respective model top20 coefficients to run model again to check results

# OLS
X_ols = df[ols_top20]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top20]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top20]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top20_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False),3)
top20_best_intercept = lasso_cv.intercept_
top20_coef = lasso_cv.coef_

Cross Val Score (R2):  0.8617236687237846
Cross Val Score (RMSE):  0.15236854276311274
Test Score(R2):  0.8624593273889184
Test Score (RMSE): 0.1467310773791218, 30345.373536062434

Cross Val Score (R2):  0.8999989189523487
Cross Val Score (RMSE):  0.1294757478600141
Test Score(R2):  0.8872350074842132
Test Score (RMSE): 0.1328598268693701, 24522.633733881412
Lasso Alpha:  1e-05

Cross Val Score (R2):  0.8947890320374998
Cross Val Score (RMSE):  0.13292890433543325
Test Score(R2):  0.8871279150573375
Test Score (RMSE): 0.1329229001000119, 24885.350139551076
Ridge Alpha:  0.00025950242113997375


#### Top 25 Features

In [9]:
# use the respective model top25 coefficients to run model again to check results

# OLS
X_ols = df[ols_top25]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top25]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top25]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top25_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False),3)
top25_best_intercept = lasso_cv.intercept_
top25_coef = lasso_cv.coef_

Cross Val Score (R2):  0.8915839787321976
Cross Val Score (RMSE):  0.13482520386246027
Test Score(R2):  0.888773568040735
Test Score (RMSE): 0.13195034719723103, 25323.798041203634

Cross Val Score (R2):  0.9085754512655246
Cross Val Score (RMSE):  0.12361296293420394
Test Score(R2):  0.9021914015991599
Test Score (RMSE): 0.12373570930218694, 22861.813863341886
Lasso Alpha:  1e-05

Cross Val Score (R2):  0.8982631687160358
Cross Val Score (RMSE):  0.13066055687246775
Test Score(R2):  0.8892720901296971
Test Score (RMSE): 0.1316543112880836, 24161.739842637697
Ridge Alpha:  0.00041320124001153346


#### Top 30 Features

In [10]:
# use the respective model top30 coefficients to run model again to check results

# OLS
X_ols = df[ols_top30]
sc_X_ols = X_ols
X_train_ols, X_test_ols, y_train, y_test = train_test_split(X_ols,y,test_size=0.2, random_state=99)

lr = LinearRegression()

r2_ols = cross_val_score(lr, X_ols, y, cv=10).mean()
rmse_ols = cross_val_score(lr,X_ols,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lr.fit(X_train_ols,y_train)
lr_pred = lr.predict(X_test_ols)

print("OLS".center(18, "="))
print("Cross Val Score (R2): ", r2_ols)
print("Cross Val Score (RMSE): ", -(rmse_ols))
print("Test Score(R2): ", lr.score(X_test_ols, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lr_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lr_pred), squared=False)))
print()

# Lasso
X_lasso = df[lasso_top30]
X_train_lasso, X_test_lasso, y_train, y_test = train_test_split(X_lasso,y,test_size=0.2, random_state=99)
iter_alpha = np.logspace(-5,5,100)

sc = StandardScaler()
sc_X = sc.fit_transform(X_lasso)
Z_train_lasso = sc.fit_transform(X_train_lasso)
Z_test_lasso = sc.transform(X_test_lasso)
lasso_cv = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)

r2_lasso = cross_val_score(lasso_cv, sc_X, y, cv=10).mean()
rmse_lasso = cross_val_score(lasso_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

lasso_cv.fit(Z_train_lasso, y_train)
lasso_pred = lasso_cv.predict(Z_test_lasso)

print("LASSO".center(18, "="))
print("Cross Val Score (R2): ", r2_lasso)
print("Cross Val Score (RMSE): ", -(rmse_lasso))
print("Test Score(R2): ", lasso_cv.score(Z_test_lasso, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, lasso_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False)))
print("Lasso Alpha: ", lasso_cv.alpha_)
print()

# Ridge
X_ridge = df[ridge_top30]
X_train_ridge, X_test_ridge, y_train, y_test = train_test_split(X_ridge,y,test_size=0.2, random_state=99)

sc_X = sc.fit_transform(X_ridge)
Z_train_ridge = sc.fit_transform(X_train_ridge)
Z_test_ridge = sc.transform(X_test_ridge)
ridge_cv = RidgeCV(alphas=iter_alpha, cv=10)

r2_ridge = cross_val_score(ridge_cv, sc_X, y, cv=10).mean()
rmse_ridge = cross_val_score(ridge_cv,sc_X,y,cv=10, scoring="neg_root_mean_squared_error").mean()

ridge_cv.fit(Z_train_ridge, y_train)
ridge_pred = ridge_cv.predict(Z_test_ridge)

print("RIDGE".center(18, "="))
print("Cross Val Score (R2): ", r2_ridge)
print("Cross Val Score (RMSE): ", -(rmse_ridge))
print("Test Score(R2): ", ridge_cv.score(Z_test_ridge, y_test))
print("Test Score (RMSE): " + str(mean_squared_error(y_test, ridge_pred, squared=False)) + ", " + str(mean_squared_error(np.expm1(y_test), np.expm1(ridge_pred), squared=False)))
print("Ridge Alpha: ", ridge_cv.alpha_)

top30_best_rmse = round(mean_squared_error(np.expm1(y_test), np.expm1(lasso_pred), squared=False),3)
top30_best_intercept = lasso_cv.intercept_
top30_coef = lasso_cv.coef_

Cross Val Score (R2):  0.8969342722188183
Cross Val Score (RMSE):  0.13129960486119474
Test Score(R2):  0.8943515943618383
Test Score (RMSE): 0.12859912338603918, 24286.88230013631

Cross Val Score (R2):  0.9133744407274544
Cross Val Score (RMSE):  0.12008661350585596
Test Score(R2):  0.9112582203836066
Test Score (RMSE): 0.11786113020081232, 21400.976332553208
Lasso Alpha:  1e-05

Cross Val Score (R2):  0.9067377766269423
Cross Val Score (RMSE):  0.1248661534413988
Test Score(R2):  0.9004968462831792
Test Score (RMSE): 0.12480298056166736, 23239.015942769995
Ridge Alpha:  0.001047615752789665


#### Storing the results in a DataFrame

In [11]:
# feature names
ridge_top5 = list(ridge_top5)
lasso_top10 = list(lasso_top10)
lasso_top15 = list(lasso_top15)
lasso_top20 = list(lasso_top20)
lasso_top25 = list(lasso_top25)
lasso_top30 = list(lasso_top30)

# number of features
coefficients= [5, 10, 15, 20, 25, 30]

# the actual coefficients
ridge_top5_coefs =  top5_coef
lasso_top10_coefs = top10_coef
lasso_top15_coefs = top15_coef
lasso_top20_coefs = top20_coef
lasso_top25_coefs = top25_coef
lasso_top30_coefs = top30_coef

# intercepts

rmse = {"number_of_features": coefficients, "RMSE":[top5_best_rmse, top10_best_rmse, top15_best_rmse, 
        top20_best_rmse, top25_best_rmse, top30_best_rmse], "coef_weights":[ridge_top5_coefs, lasso_top10_coefs,
        lasso_top15_coefs, lasso_top20_coefs, lasso_top25_coefs, lasso_top30_coefs], "coefficients":[lasso_top5, lasso_top10,
        lasso_top15, lasso_top20, lasso_top25, lasso_top30], "intercept":[top5_best_intercept, top10_best_intercept, 
        top15_best_intercept, top20_best_intercept, top25_best_intercept, top30_best_intercept]}

df_rmse = pd.DataFrame(data=rmse)

df_rmse.to_csv("df_production.csv", index_label=False)

##  Notebook Summary

Using the 3 regression models, we conducted modelling on 5-30 features and obtained RMSE scores of each feature set. The Ridge and LASSO regression models proved to be the best performing model across the feature sets and will be used for submission to Kaggle in **Part 4: Kaggle Submission** and further analysis in **Part 5: Production Model & Insights**.