# Part 4 - Kaggle Submission (OPTIONAL)


In Part 3, we established that the LASSO regression model provided the optimum RMSE scores (via 10 fold cross-validation) for the top 5-30 coefficients of each model.


In this notebook, we will apply the transformations and feature engineering to enable submission to Kaggle.


The test set will undergo data cleaning to address its missing values, encode the ordinal variables, apply the lop1p transform to skewed features, and feature engineering to mirror the training set so that it can serve as input to our LASSO regression model to provide predictions.
<br>
<br>



*Reader can also skip this notebook to Part 5 where insights with respect to the business question will be examined in greater detail.*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# remove warnings
warnings.filterwarnings("ignore")


pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

## Preparing the Test Set for Submission


The data will be cleaned in this section. Missing values will be examined and imputed with the appropriate values.

In [2]:
# Preprocess test set & get it ready for submission

df = pd.read_csv("df_EDA_transformed.csv")
df_test = pd.read_csv("datasets/test.csv")
display(df_test.head())
display(df_test.shape)

df_test_id = df_test["Id"]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD


(878, 80)

In [3]:
display(df_test.isnull().sum())

Id                   0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       160
Lot Area             0
Street               0
Alley              820
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type         1
Mas Vnr Area         1
Exter Qual           0
Exter Cond           0
Foundation           0
Bsmt Qual           25
Bsmt Cond           25
Bsmt Exposure       25
BsmtFin Type 1      25
BsmtFin SF 1         0
BsmtFin Type 2      25
BsmtFin SF 2         0
Bsmt Unf SF          0
Total Bsmt SF        0
Heating              0
Heating QC           0
Central Air          0
Electrical 

### Electrical

Since the utilities feature states that the property has all utilities and almost all properties has "SBrKr" as their value
for the electrical feature (as observed during EDA), we will impute as such.

In [4]:
df_test[df_test["Electrical"].isnull()]
df_test.iloc[634, 43] = "SBrKr"

### Garage Yr Blt & Garage Finish

Investigation reveals that Index 764 is likely an erroneous entry as it has values for `Garage Cars` and `Garage Area` despite having some
null values for the rest of the garage features, we will impute `Garage Yr Blt` with the `Year Remodd/Add` date as the
garage would likely be renovated during that time, `Garage Finish` will be imputed with the mode.

In [5]:
display(df_test[df_test["Garage Finish"].isnull()])

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
29,1904,534451020,50,RL,51.0,3500,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrkSide,Feedr,Norm,1Fam,1.5Fin,3,5,1945,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,No,LwQ,144,Unf,0,226,370,GasA,TA,N,FuseA,442,228,0,670,1,0,1,0,2,1,Fa,4,Typ,0,,,,,0,0,,,N,0,21,0,0,0,0,,MnPrv,Shed,2000,7,2007,WD
45,979,923228150,160,RM,21.0,1533,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,2Story,4,6,1970,2008,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,TA,Y,SBrkr,798,546,0,1344,0,0,1,1,3,1,TA,6,Typ,1,TA,,,,0,0,,,Y,0,0,0,0,0,0,,,,0,5,2009,WD
66,2362,527403120,20,RL,,8125,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,4,4,1971,1971,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,TA,TA,No,BLQ,614,Unf,0,244,858,GasA,TA,Y,SBrkr,858,0,0,858,0,0,1,0,3,1,TA,5,Typ,0,,,,,0,0,,,Y,0,0,0,0,0,0,,,,0,6,2006,WD
68,2188,908226180,30,RH,70.0,4270,Pave,,Reg,Bnk,AllPub,Inside,Mod,Edwards,Norm,Norm,1Fam,1Story,3,6,1931,2006,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Rec,544,Unf,0,0,544,GasA,Ex,Y,SBrkr,774,0,0,774,0,0,1,0,3,1,Gd,6,Typ,0,,,,,0,0,,,Y,0,0,286,0,0,0,,,,0,5,2007,WD
105,1988,902207010,30,RM,40.0,3880,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,9,1945,1997,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,ALQ,329,Unf,0,357,686,GasA,Gd,Y,SBrkr,866,0,0,866,0,0,1,0,2,1,Gd,4,Typ,0,,,,,0,0,,,Y,58,42,0,0,0,0,,,,0,8,2007,WD
109,217,905101300,90,RL,72.0,10773,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,4,3,1967,1967,Gable,Tar&Grv,Plywood,Plywood,BrkFace,72.0,Fa,Fa,CBlock,TA,TA,No,ALQ,704,Unf,0,1128,1832,GasA,TA,N,SBrkr,1832,0,0,1832,2,0,2,0,4,2,TA,8,Typ,0,,,,,0,0,,,Y,0,58,0,0,0,0,,,,0,5,2010,WD
113,2908,923205120,20,RL,90.0,17217,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1Story,5,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,1140,1140,GasA,Ex,Y,SBrkr,1140,0,0,1140,0,0,1,0,3,1,TA,6,Typ,0,,,,,0,0,,,Y,36,56,0,0,0,0,,,,0,7,2006,WD
144,1507,908250040,50,RL,57.0,8050,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1.5Fin,5,8,1947,1993,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,Slab,,,,,0,,0,0,0,GasA,Gd,Y,SBrkr,929,208,0,1137,0,0,1,1,4,1,TA,8,Min1,0,,,,,0,0,,,Y,0,0,0,0,0,0,,,,0,4,2008,WD
152,1368,903476110,50,RM,60.0,5586,Pave,,IR1,Bnk,AllPub,Inside,Gtl,OldTown,Feedr,Norm,1Fam,1.5Fin,6,7,1920,1998,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,901,901,GasA,Gd,Y,SBrkr,1088,110,0,1198,0,0,1,0,4,1,TA,7,Typ,0,,,,,0,0,,,N,0,98,0,0,0,0,,MnPrv,,0,9,2008,ConLD
156,332,923228270,160,RM,21.0,1900,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,4,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,Ex,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,5,Typ,0,,,,,0,0,,,Y,0,0,0,0,0,0,,,,0,6,2010,WD


In [6]:
df_test.iloc[764, 60] = int(df_test.iloc[764, 21])
df_test.iloc[764, 61] = df_test["Garage Finish"].mode()

### Imputing the other missing values (as per Part 1) & Encoding Ordinal Features (As per Part 2.1)

In [7]:
# imputing 0 and none for missing values

df_test["Fireplace Qu"].fillna("none", inplace=True)
df_test["Garage Finish"].fillna("none", inplace=True)
df_test["Garage Qual"].fillna("none", inplace=True)
df_test["Garage Cond"].fillna("none", inplace=True)
df_test["Pool QC"].fillna("none", inplace=True)
df_test["Fence"].fillna("none", inplace=True)

df_test["Bsmt Qual"].fillna("none", inplace=True)
df_test["Bsmt Cond"].fillna("none", inplace=True)
df_test["Bsmt Exposure"].fillna("none", inplace=True)
df_test["BsmtFin Type 1"].fillna("none", inplace=True)
df_test["BsmtFin Type 2"].fillna("none", inplace=True)

df_test["Bsmt Full Bath"].fillna(0, inplace=True)
df_test["Bsmt Half Bath"].fillna(0, inplace=True)
df_test["Garage Yr Blt"].fillna(0, inplace=True)
df_test["Garage Cars"].fillna(0, inplace=True)

df_test["Mas Vnr Area"].fillna(0, inplace=True)
df_test["BsmtFin SF 1"].fillna(0, inplace=True)
df_test["BsmtFin SF 2"].fillna(0, inplace=True)
df_test["Bsmt Unf SF"].fillna(0, inplace=True)
df_test["Total Bsmt SF"].fillna(0, inplace=True)
df_test["Garage Area"].fillna(0, inplace=True)

df_test["Alley"].fillna("none", inplace=True)
df_test["Misc Feature"].fillna("none", inplace=True)
df_test["Garage Type"].fillna("none", inplace=True)
df_test["Mas Vnr Type"].fillna("None", inplace=True)

df_test.drop("Lot Frontage", axis=1,inplace=True)

In [8]:
# Duplicate engineered features on test_set

lot_shape_encode = {"IR3":0, "IR2":1, "IR1":2, "Reg":3}
land_slope_encode = {"Gtl":0, "Mod":1, "Sev":2}
exter_qual_encode = {"Po":0, "Fa":1, "TA":1, "Gd":2, "Ex":3}
exter_cond_encode = {"Po":0, "Fa":1, "TA":1, "Gd":2, "Ex":3}
bsmt_qual_encode = {"none":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
bsmt_cond_encode = {"none":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
bsmt_exposure_encode = {"none":0, "No":1, "Mn":2, "Av":3, "Gd": 4}
bsmtfin_type1_encode = {"none":0, "Unf":1, "LwQ":2, "Rec":3, "BLQ":4, "ALQ":5, "GLQ":6}
bsmtfin_type2_encode = {"none":0, "Unf":1, "LwQ":2, "Rec":3, "BLQ":4, "ALQ":5, "GLQ":6}
heating_qc_encode = {"Po":0, "Fa":1, "TA":1, "Gd":2, "Ex":3}
central_air_encode = {"N":0, "Y":1}
kitchen_qual_encode = {"Po":0, "Fa":1, "TA":1, "Gd":2, "Ex":3}
functional_encode = {"Sal":0, "Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7}
fireplace_qu_encode = {"none":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
garage_finish_encode = {"none":0, "Unf":1, "RFn":2, "Fin":3}
garage_qual_encode = {"none":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
garage_cond_encode = {"none":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
paved_drive_encode = {"N":0, "P":1, "Y":2}
pool_qc_encode = {"none":0, "Fa":1, "TA":2, "Gd":3, "Ex":4}
fence_encode = {"none":0, "MnWw":1, "GdWo":2, "MnPrv":3, "GdPrv":4}

df_test["Lot Shape"] = df_test["Lot Shape"].map(lot_shape_encode)
df_test["Land Slope"] = df_test["Land Slope"].map(land_slope_encode)
df_test["Exter Qual"] = df_test["Exter Qual"].map(exter_qual_encode)
df_test["Exter Cond"] = df_test["Exter Cond"].map(exter_cond_encode)
df_test["Bsmt Qual"] = df_test["Bsmt Qual"].map(bsmt_qual_encode)
df_test["Bsmt Cond"] = df_test["Bsmt Cond"].map(bsmt_cond_encode)
df_test["Bsmt Exposure"] = df_test["Bsmt Exposure"].map(bsmt_exposure_encode)
df_test["BsmtFin Type 1"] = df_test["BsmtFin Type 1"].map(bsmtfin_type1_encode)
df_test["BsmtFin Type 2"] = df_test["BsmtFin Type 2"].map(bsmtfin_type2_encode)
df_test["Heating QC"] = df_test["Heating QC"].map(heating_qc_encode)
df_test["Central Air"] = df_test["Central Air"].map(central_air_encode)
df_test["Kitchen Qual"] = df_test["Kitchen Qual"].map(kitchen_qual_encode)
df_test["Functional"] = df_test["Functional"].map(functional_encode)
df_test["Fireplace Qu"] = df_test["Fireplace Qu"].map(fireplace_qu_encode)
df_test["Garage Finish"] = df_test["Garage Finish"].map(garage_finish_encode)
df_test["Garage Qual"] = df_test["Garage Qual"].map(garage_qual_encode)
df_test["Garage Cond"] = df_test["Garage Cond"].map(garage_cond_encode)
df_test["Paved Drive"] = df_test["Paved Drive"].map(paved_drive_encode)
df_test["Pool QC"] = df_test["Pool QC"].map(pool_qc_encode)
df_test["Fence"] = df_test["Fence"].map(fence_encode)

### Additional Feature Engineering (as per Part 2.2)

In [9]:
df_test = pd.concat([df_test, pd.get_dummies(df_test["Neighborhood"], prefix="neighborhood")], axis=1)
df_test = pd.concat([df_test, pd.get_dummies(df_test["Foundation"], prefix="foundation")], axis=1)
df_test= pd.concat([df_test, pd.get_dummies(df_test["Sale Type"], prefix="sale")], axis=1)

df_test["has_bsmt"] = df["Bsmt Qual"].apply(lambda x:1 if x > 0 else 0)
df_test["recently_built"] = df_test["Yr Sold"] - df_test["Year Built"]
df_test["recently_remoded"] = df_test["Yr Sold"] - df_test["Year Remod/Add"]
df_test["recently_built"] = df_test["recently_built"].apply(lambda x: 1 if x <= 10 else 0)
df_test["recently_remoded"] = df_test["recently_remoded"].apply(lambda x: 1 if x <= 10 else 0)


#feature is in train set and not in test set, need to artificially insert to generate predictions for test set
df_test["neighborhood_GrnHill"] = np.zeros(len(df_test)) 

# duplicate the log1p transformations for the continuous, ordinal and discrete features included in the regression model
%store -r skewed_columns

for col in skewed_columns:
    if col != "SalePrice":
        df_test[col] = df_test[col].apply(np.log1p)

## Generating Predictions for the Test Set


In this section, we will use generate predictions for submission to Kaggle.


In the cells below we will define the different test sets corresponding to the different number of coefficients/predictors and generate predictions using the best performing regression model (Lasso). The generated predictions will be saved as .csv files

In [10]:
# preprocessing
import ast

df_rmse = pd.read_csv("df_production.csv") # import dataframe from Part 3 which holds data for our lasso models

df_rmse.iloc[0,3] = df_rmse.iloc[0,3].replace("Index","").replace(", dtype='object'", "")

ridge_top5 = ast.literal_eval(df_rmse.iloc[0,3]) # extracting the feature names
lasso_top10 = ast.literal_eval(df_rmse.iloc[1,3])
lasso_top15 = ast.literal_eval(df_rmse.iloc[2,3])
lasso_top20 = ast.literal_eval(df_rmse.iloc[3,3])
lasso_top25 = ast.literal_eval(df_rmse.iloc[4,3])
lasso_top30 = ast.literal_eval(df_rmse.iloc[5,3])

df_test_top5 = df_test[ridge_top5] # preparing the test sets using the respective number of features
df_test_top10 = df_test[lasso_top10]
df_test_top15 = df_test[lasso_top15]
df_test_top20 = df_test[lasso_top20]
df_test_top25 = df_test[lasso_top25]
df_test_top30 = df_test[lasso_top30]

iter_alpha = np.logspace(-5,5,100)

In [11]:
# predict using best performing regression model: ridge, train using entire training set
# for top5 coefficents

X = df[ridge_top5]
y = df["SalePrice"]

sc = StandardScaler()
ridge_cv_test = RidgeCV(alphas=iter_alpha, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top5)

ridge_cv_test.fit(Z_train,y)

final_pred = np.expm1(ridge_cv_test.predict(Z_test))
submission_top5 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top5["SalePrice"] = final_pred 
submission_top5.to_csv("submission_top5.csv", index_label=False, index=False)

In [12]:
# predict using best performing regression model: Lasso, train using entire training set
# for top10 coefficents

X = df[lasso_top10]
y = df["SalePrice"]

sc = StandardScaler()
lasso_cv_test = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top10)

lasso_cv_test.fit(Z_train,y)

final_pred = np.expm1(lasso_cv_test.predict(Z_test))
submission_top5 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top5["SalePrice"] = final_pred 
submission_top5.to_csv("submission_top10.csv", index_label=False, index=False)

In [13]:
# predict using best performing regression model: Lasso, train using entire training set
# for top15 coefficents

X = df[lasso_top15]
y = df["SalePrice"]

sc = StandardScaler()
lasso_cv_test = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top15)

lasso_cv_test.fit(Z_train,y)

final_pred = np.expm1(lasso_cv_test.predict(Z_test))
submission_top10 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top10["SalePrice"] = final_pred 
submission_top10.to_csv("submission_top15.csv", index_label=False, index=False)

In [14]:
# predict using best performing regression model: Lasso, train using entire training set
# for top20 coefficents

X = df[lasso_top20]
y = df["SalePrice"]

sc = StandardScaler()
lasso_cv_test = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top20)

lasso_cv_test.fit(Z_train,y)

final_pred = np.expm1(lasso_cv_test.predict(Z_test))
submission_top15 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top15["SalePrice"] = final_pred 
submission_top15.to_csv("submission_top20.csv", index_label=False, index=False)

In [15]:
# predict using best performing regression model: Lasso, train using entire training set
# for top25 coefficents

X = df[lasso_top25]
y = df["SalePrice"]

sc = StandardScaler()
lasso_cv_test = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top25)

lasso_cv_test.fit(Z_train,y)

final_pred = np.expm1(lasso_cv_test.predict(Z_test))
submission_top20 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top20["SalePrice"] = final_pred 
submission_top20.to_csv("submission_top25.csv", index_label=False, index=False)

In [16]:
# predict using best performing regression model: Lasso, train using entire training set
# for top30 coefficents

X = df[lasso_top30]
y = df["SalePrice"]

sc = StandardScaler()
lasso_cv_test = LassoCV(alphas=iter_alpha, max_iter=50000, cv=10)
Z_train = sc.fit_transform(X)
Z_test = sc.transform(df_test_top30)

lasso_cv_test.fit(Z_train,y)

final_pred = np.expm1(lasso_cv_test.predict(Z_test))
submission_top25 = pd.DataFrame(df_test_id) # df_test_id was defined earlier during test set pre-processing
submission_top25["SalePrice"] = final_pred 
submission_top25.to_csv("submission_top30.csv", index_label=False, index=False)