# Notebook Intro:

In this notebook, I use my cleaned training data, then look at all possible features, using all continuous and discrete features, importing the updated ordinal features (the ones with numerical values) from notebook 4, and then creating dummy columns for all nominal features with dropping the first dummy column for each one.

I perform a train/test split on this data, fit a scaler on the train split and then transform both the training and test splits.

I then fit a lasso model on the train split and run the model on both the training and test split.  I get good scores back, so I then perform the scaling and fitting of the lasso model to the entire set of training data.  I then use this updated lasso model on the test data to create **prediction 3**.  Note that while fitting the lasso model to the test data, I realize that some dummy columns don't align with the training and test data, so I combine these, then separate them so that both the training and test data will have the same columns.  All values that are null (because they didn't previously exist, are replaced with 0).

I export the lasso model on the training data for use in other notebooks.


I then fit a ridge model on the train split and run the model on both the training and test split.  However, it has similar performance to the lasso model, and a lot more features, so I don't use it to create any predictions with the test data.

In [154]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.pipeline import Pipeline

In [155]:
# import cleaned training data
filepath = '../datasets/interim_files/train_clean.csv'
df = pd.read_csv(filepath)

#import ordinal features
filepath = '../datasets/interim_files/training_updated_ordinal_features.csv'
df_ord = pd.read_csv(filepath)

In [156]:
# From Step_3- Notebook

# continuous features - per the data dictionary
continuous_features = ['Lot Frontage','Lot Area','Mas Vnr Area','BsmtFin SF 1','BsmtFin SF 2','Bsmt Unf SF','Total Bsmt SF','1st Flr SF','2nd Flr SF','Low Qual Fin SF','Gr Liv Area','Garage Area','Wood Deck SF','Open Porch SF','Enclosed Porch','3Ssn Porch','Screen Porch','Pool Area','Misc Val']

# nominal features - remove PID
nominal_features = ['MS SubClass','MS Zoning','Street','Alley','Land Contour','Lot Config','Neighborhood','Condition 1','Condition 2','Bldg Type','House Style','Roof Style','Roof Matl','Exterior 1st','Exterior 2nd','Mas Vnr Type','Foundation','Heating','Central Air','Garage Type','Misc Feature','Sale Type']

#'PID',

# discrete features 
discrete_features = ['Year Built','Year Remod/Add','Bsmt Full Bath','Bsmt Half Bath','Full Bath','Half Bath','Bedroom AbvGr','Kitchen AbvGr','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Mo Sold','Yr Sold']

# Ordinal Features
ordinal_features = ['Lot Shape','Utilities','Land Slope','Overall Qual','Overall Cond','Exter Qual','Exter Cond','Bsmt Qual','Bsmt Cond','Bsmt Exposure','BsmtFin Type 1','BsmtFin Type 2','Heating QC','Electrical','Kitchen Qual','Functional','Fireplace Qu','Garage Finish','Garage Qual','Garage Cond','Paved Drive','Pool QC','Fence']

In [157]:
df_nom = pd.get_dummies(df[nominal_features], columns = nominal_features, drop_first = True)

In [158]:
df_cont = df[continuous_features]
df_dis = df[discrete_features]


In [159]:
X = pd.concat((df_cont, df_dis, df_ord, df_nom), axis = 1)
y = df['SalePrice']

In [160]:
X.shape

(2026, 210)

In [161]:
# train test split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

In [162]:
# scale model
ss = StandardScaler()
ss.fit(X_train)

StandardScaler()

In [163]:
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

## Lasso Model

In [164]:
alphas = np.linspace(0.05,100000, 1000)

In [165]:
lcv = LassoCV(alphas = alphas)
lcv.fit(X_train, y_train)
lcv.score(X_train, y_train), lcv.score(X_test, y_test)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


(0.893756441672642, 0.8982359812052281)

In [166]:
lcv.alpha_

900.9504504504504

In [167]:
sum(lcv.coef_ != 0)

88

In [168]:
# from kobe review 

cols_and_coef = pd.DataFrame({ 
    'var': X.columns,
    'coef val': lcv.coef_
}).set_index('var').sort_values('coef val', ascending=False)

In [169]:
cols_and_coef.head()

Unnamed: 0_level_0,coef val
var,Unnamed: 1_level_1
Gr Liv Area,21352.537746
Overall Qual,12662.646646
Neighborhood_NridgHt,9866.331154
Kitchen Qual,6307.499278
Neighborhood_StoneBr,6147.214647


In [170]:
cols_and_coef[cols_and_coef['coef val'] != 0].index

Index(['Gr Liv Area', 'Overall Qual', 'Neighborhood_NridgHt', 'Kitchen Qual',
       'Neighborhood_StoneBr', 'Exter Qual', 'Bsmt Exposure',
       'Neighborhood_NoRidge', '1st Flr SF', 'Year Built', 'Sale Type_New',
       'Misc Feature_Gar2', 'BsmtFin SF 1', 'Garage Cars',
       'Neighborhood_GrnHill', 'Overall Cond', 'Mas Vnr Area', 'Screen Porch',
       'Roof Matl_WdShngl', 'Bsmt Qual', 'Misc Feature_Othr', 'Bsmt Full Bath',
       'Exterior 1st_BrkFace', 'Fireplaces', 'Neighborhood_Crawfor',
       'Land Contour_HLS', 'Roof Style_Hip', 'Functional', 'BsmtFin Type 1',
       'Garage Area', 'Neighborhood_Somerst', 'House Style_1Story',
       'Condition 1_Norm', 'Fireplace Qu', 'Wood Deck SF', 'Roof Matl_CompShg',
       'Heating QC', 'Sale Type_Con', 'Year Remod/Add', 'Full Bath',
       'Condition 1_PosN', 'Misc Feature_Shed', 'Condition 2_PosA',
       'Lot Config_CulDSac', 'TotRms AbvGrd', 'Land Contour_Low',
       'BsmtFin SF 2', 'Foundation_PConc', 'Mas Vnr Type_Stone',
    

## Fit to all training data

In [171]:
ss1 = StandardScaler()
ss1.fit(X)

StandardScaler()

In [172]:
X1 = ss1.transform(X)

In [173]:
lcv2 = LassoCV(alphas = alphas)
lcv2.fit(X1, y)
lcv2.score(X1, y)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


0.8964588456747216

In [174]:
lcv2.alpha_

800.8504004004003

In [175]:
sum(lcv2.coef_ != 0)

90

In [176]:
X.shape

(2026, 210)

In [177]:
cols_and_coef = pd.DataFrame({ 
    'var': X.columns,
    'coef val': lcv2.coef_
}).set_index('var').sort_values('coef val', ascending=False)

In [178]:
#export predictions
filepath = '../datasets/interim_files/lasso_fitted_st6.csv'

cols_and_coef.to_csv(filepath)

## do to test

In [179]:
# import cleaned test info
filepath = '../datasets/interim_files/test_clean.csv'

testdata = pd.read_csv(filepath)
testdata.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,none,none,none,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,none,Reg,Lvl,...,0,0,0,none,none,none,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,none,IR1,Lvl,...,0,185,0,none,none,none,0,7,2009,WD


In [180]:
# import ordinal features

filepath = '../datasets/interim_files/testdata_updated_ordinal_features.csv'

testdata_ord = pd.read_csv(filepath)

In [181]:
testdata_nom = pd.get_dummies(testdata[nominal_features], columns = nominal_features, drop_first = True)

In [182]:
testdata_cont = testdata[continuous_features].fillna(0)
testdata_dis = testdata[discrete_features]

In [183]:
Xtestdata = pd.concat((testdata_cont, testdata_dis, testdata_ord, testdata_nom), axis = 1)

In [184]:
df_nom.shape

(2026, 154)

In [185]:
testdata_nom.shape

(878, 143)

In [186]:
df_nom.head()

Unnamed: 0,MS SubClass_30,MS SubClass_40,MS SubClass_45,MS SubClass_50,MS SubClass_60,MS SubClass_70,MS SubClass_75,MS SubClass_80,MS SubClass_85,MS SubClass_90,...,Misc Feature_TenC,Misc Feature_none,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1


In [187]:
testdata_nom.head()

Unnamed: 0,MS SubClass_30,MS SubClass_40,MS SubClass_45,MS SubClass_50,MS SubClass_60,MS SubClass_70,MS SubClass_75,MS SubClass_80,MS SubClass_85,MS SubClass_90,...,Misc Feature_none,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


###  Because the test data and training data have different dummy columns - I add them together, then separate them, and fill all null values that were created by adding the empty columns with 0 (this way the model will have the same number of columns to operate on for the training and the test data)

In [188]:
overall_nom = pd.concat([df_nom,testdata_nom], join = 'outer', axis = 0)
overall_nom.head()

Unnamed: 0,MS SubClass_30,MS SubClass_40,MS SubClass_45,MS SubClass_50,MS SubClass_60,MS SubClass_70,MS SubClass_75,MS SubClass_80,MS SubClass_85,MS SubClass_90,...,Sale Type_Oth,Sale Type_WD,Roof Matl_Metal,Roof Matl_Roll,Exterior 1st_PreCast,Exterior 2nd_Other,Exterior 2nd_PreCast,Mas Vnr Type_CBlock,Heating_GasA,Sale Type_VWD
0,0,0,0,0,1,0,0,0,0,0,...,0,1,,,,,,,,
1,0,0,0,0,1,0,0,0,0,0,...,0,1,,,,,,,,
2,0,0,0,0,0,0,0,0,0,0,...,0,1,,,,,,,,
3,0,0,0,0,1,0,0,0,0,0,...,0,1,,,,,,,,
4,0,0,0,1,0,0,0,0,0,0,...,0,1,,,,,,,,


In [189]:
df_nom_updated = overall_nom[0:2026].fillna(0)

In [190]:
testdata_nom_updated = overall_nom[2026:].fillna(0)

# recalculate lasso cv after updating the columns in the training and test dataframes

In [191]:
X2 = pd.concat((df_cont, df_dis, df_ord, df_nom_updated), axis = 1)
y = df['SalePrice']

In [192]:
ss2 = StandardScaler()
ss2.fit(X2)

StandardScaler()

In [193]:
X3 = ss2.transform(X2)

In [194]:
lcv3 = LassoCV(alphas = alphas)
lcv3.fit(X3, y)
lcv3.score(X3, y)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


0.8964588456747216

In [195]:
lcv3.alpha_

800.8504004004003

In [196]:
sum(lcv3.coef_ != 0)

90

In [197]:
var_and_coeff = pd.DataFrame({ 
    'var': X2.columns,
    'coef val': lcv3.coef_
}).set_index('var').sort_values('coef val', ascending=False)

In [198]:
var_and_coeff.head()

Unnamed: 0_level_0,coef val
var,Unnamed: 1_level_1
Gr Liv Area,20996.217885
Overall Qual,13686.671018
Neighborhood_NridgHt,9452.750353
Neighborhood_StoneBr,6360.891369
Exter Qual,6212.077268


In [199]:
var_and_coeff_nonzero = var_and_coeff[var_and_coeff['coef val'] != 0]

In [200]:
var_and_coeff_nonzero

Unnamed: 0_level_0,coef val
var,Unnamed: 1_level_1
Gr Liv Area,20996.217885
Overall Qual,13686.671018
Neighborhood_NridgHt,9452.750353
Neighborhood_StoneBr,6360.891369
Exter Qual,6212.077268
...,...
Pool QC,-2213.896430
Bldg Type_Twnhs,-2351.808886
MS SubClass_120,-2452.846176
Bldg Type_TwnhsE,-3436.274597


## Apply similar method to test data

In [201]:
Xtestdata2 = pd.concat((testdata_cont, testdata_dis, testdata_ord, testdata_nom_updated), axis = 1)

In [202]:
Xtestdata2 = ss2.transform(Xtestdata2)

In [203]:
y_pred = lcv3.predict(Xtestdata2)

In [204]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df = y_pred_df.rename(columns={0:'SalePrice'})
y_pred_df.head()

Unnamed: 0,SalePrice
0,122587.316209
1,152608.263643
2,221243.892562
3,105812.428456
4,177367.812003


In [205]:
final_preds = pd.concat([testdata['Id'],y_pred_df], axis=1 )
final_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         878 non-null    int64  
 1   SalePrice  878 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 13.8 KB


In [206]:
#export predictions
filepath = '../datasets/submissions/prediction3.csv'

final_preds.to_csv(filepath, index=False)

## Ridge Model

In [207]:
alphas = np.linspace(0.05,1000, 10)

In [208]:
rcv = RidgeCV(alphas=alphas)
rcv.fit(X_train,y_train)
rcv.score(X_train, y_train), rcv.score(X_test, y_test)

(0.8908486967201146, 0.8956070282561881)

In [209]:
rcv.alpha_

666.6833333333333

In [210]:
rcv.coef_

array([ 1.12565622e+03,  1.32973184e+03,  3.95582186e+03,  3.09019306e+03,
        1.08861557e+03, -5.44002989e+02,  3.01962929e+03,  5.97142671e+03,
        2.88565157e+03,  7.64420864e+00,  7.09127508e+03,  3.81024937e+03,
        1.94181579e+03,  1.01569111e+03,  3.14542360e+02,  5.55026078e+02,
        2.91008635e+03, -1.53027973e+03, -5.50772157e+03,  2.26057274e+03,
        2.41785502e+03,  2.66408516e+03, -3.19032975e+02,  3.40944500e+03,
        1.43485669e+03,  4.43204994e+02, -1.18659935e+03,  4.14943872e+03,
        2.79472340e+03, -9.06508676e+02,  3.50174161e+03, -7.90758790e+02,
       -1.04627818e+02, -4.02027259e+01,  7.87997916e+02, -1.19119180e+03,
        7.60501034e+03,  2.89861685e+03,  4.77744018e+03,  2.38031660e+02,
        3.31462845e+03, -6.01336207e+02,  4.15671129e+03,  2.57497691e+03,
        3.42783382e+02,  1.76801305e+03, -3.87461546e+02,  5.53855779e+03,
        2.08661250e+03,  3.00371618e+03,  1.22598479e+03,  4.63362884e+02,
        3.42537829e+01,  

In [None]:
# similar results to lasso, but has many more features and , so I am going to ignore

# Notebook Summary:

In this notebook, I use my cleaned training data, then look at all possible features, using all continuous and discrete features, importing the updated ordinal features (the ones with numerical values) from notebook 4, and then creating dummy columns for all nominal features with dropping the first dummy column for each one.

I perform a train/test split on this data, fit a scaler on the train split and then transform both the training and test splits.

I then fit a lasso model on the train split and run the model on both the training and test split.  I get good scores back, so I then perform the scaling and fitting of the lasso model to the entire set of training data.  I then use this updated lasso model on the test data to create **prediction 3**.  Note that while fitting the lasso model to the test data, I realize that some dummy columns don't align with the training and test data, so I combine these, then separate them so that both the training and test data will have the same columns.  All values that are null (because they didn't previously exist, are replaced with 0).

I export the lasso model on the training data for use in other notebooks.


I then fit a ridge model on the train split and run the model on both the training and test split.  However, it has similar performance to the lasso model, and a lot more features, so I don't use it to create any predictions with the test data.