9/26

This is more code for the Kaggle Ames, Iowa Housing Prices Competition. Last time, I determined a couple models for the data that resulted in good CV error and low differences between training and validation error. However, I never managed to get back to my best score in the competition, and my solution for that score used PCA (it also used Pearson's correlation coefficient, but it seems like feature removal may not be the greatest idea after the manual pruning I did). So, I'm going to combine PCA with my self-made features and LassoCV and random trees.

In [31]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso, LassoCV, ElasticNetCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, learning_curve, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.random_projection import GaussianRandomProjection

import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import locale
from scipy.stats import randint
locale.setlocale( locale.LC_ALL, '' )

'English_United States.1252'

In [32]:
train_set = pd.read_csv("train.csv")
test_set = pd.read_csv("test.csv")
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [33]:
# dropping features with too many null values
nec_data = train_set.drop("Alley", axis=1)
nec_data = nec_data.drop("PoolQC", axis=1) 
nec_data = nec_data.drop("Fence", axis=1)  
nec_data = nec_data.drop("MiscFeature", axis=1)
nec_data = nec_data.drop("Id", axis=1) # too individual

# making new features
nec_data["has_porch"] = (nec_data["OpenPorchSF"] + nec_data["EnclosedPorch"] + nec_data["3SsnPorch"] + nec_data["ScreenPorch"] > 0)
nec_data["has_deck"] = (nec_data["WoodDeckSF"] > 0)
nec_data["has_pool"] = (nec_data["PoolArea"] > 0)
nec_data["TotalSF"] = nec_data["1stFlrSF"] + nec_data["2ndFlrSF"] + nec_data["TotalBsmtSF"]

# used in self-made feature
nec_data = nec_data.drop("1stFlrSF", axis=1) 
nec_data = nec_data.drop("2ndFlrSF", axis=1) 
nec_data = nec_data.drop("TotalBsmtSF", axis=1) 
nec_data = nec_data.drop("GrLivArea", axis=1) 

nec_data = nec_data.drop("WoodDeckSF", axis=1) 

nec_data = nec_data.drop("PoolArea", axis=1) 

nec_data = nec_data.drop("OpenPorchSF", axis=1) 
nec_data = nec_data.drop("EnclosedPorch", axis=1) 
nec_data = nec_data.drop("3SsnPorch", axis=1) 
nec_data = nec_data.drop("ScreenPorch", axis=1) 

In [34]:
text_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown="ignore"))
num_norm_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
preprocess = ColumnTransformer([
    ("text", text_pipeline, make_column_selector(dtype_include=object))],
    remainder=num_norm_pipeline
)

In [35]:
corr_matrix = nec_data.corr(numeric_only=True)
print(corr_matrix["SalePrice"])

MSSubClass     -0.084284
LotFrontage     0.351799
LotArea         0.263843
OverallQual     0.790982
OverallCond    -0.077856
YearBuilt       0.522897
YearRemodAdd    0.507101
MasVnrArea      0.477493
BsmtFinSF1      0.386420
BsmtFinSF2     -0.011378
BsmtUnfSF       0.214479
LowQualFinSF   -0.025606
BsmtFullBath    0.227122
BsmtHalfBath   -0.016844
FullBath        0.560664
HalfBath        0.284108
BedroomAbvGr    0.168213
KitchenAbvGr   -0.135907
TotRmsAbvGrd    0.533723
Fireplaces      0.466929
GarageYrBlt     0.486362
GarageCars      0.640409
GarageArea      0.623431
MiscVal        -0.021190
MoSold          0.046432
YrSold         -0.028923
SalePrice       1.000000
has_porch       0.296678
has_deck        0.297662
has_pool        0.093708
TotalSF         0.782260
Name: SalePrice, dtype: float64


In [36]:
prices = pd.DataFrame(train_set["SalePrice"].copy())
nec_data = nec_data.drop("SalePrice", axis=1) # labels
ids = test_set["Id"].copy()

In [37]:
lin_reg = Pipeline([("pre", preprocess), ("pca", TruncatedSVD(n_components=2)), ("reg", LassoCV(max_iter=10000, random_state=446))])
lin_reg.fit(nec_data, prices)

housing_predictions = lin_reg.predict(nec_data)
for k in range(10):
    print(f"Real value: {locale.currency(prices.iloc[k].values[0], grouping=True)}, Prediction: {locale.currency(housing_predictions[k].round(2), grouping=True)}")

Real value: $208,500.00, Prediction: $228,918.52
Real value: $181,500.00, Prediction: $171,837.16
Real value: $223,500.00, Prediction: $245,414.08
Real value: $140,000.00, Prediction: $177,068.15
Real value: $250,000.00, Prediction: $301,092.50
Real value: $143,000.00, Prediction: $168,837.83
Real value: $307,000.00, Prediction: $260,258.74
Real value: $200,000.00, Prediction: $231,454.77
Real value: $129,900.00, Prediction: $165,484.99
Real value: $118,000.00, Prediction: $119,390.76


  y = column_or_1d(y, warn=True)


In [38]:
lin_rmse = mean_squared_error(prices, housing_predictions, squared=False)
print(lin_rmse)

rmse = -cross_val_score(lin_reg, nec_data, prices, scoring="neg_root_mean_squared_error", cv=5)
print(np.average(rmse))

42333.184594367915


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


42177.78515657047


  y = column_or_1d(y, warn=True)


In [39]:
# test_set["has_porch"] = (test_set["OpenPorchSF"] + test_set["EnclosedPorch"] + test_set["3SsnPorch"] + test_set["ScreenPorch"] > 0)
# test_set["has_deck"] = (test_set["WoodDeckSF"] > 0)
# test_set["has_pool"] = (test_set["PoolArea"] > 0)
# test_set["TotalSF"] = test_set["1stFlrSF"] + test_set["2ndFlrSF"] + test_set["TotalBsmtSF"]

# test_pred = lin_reg.predict(test_set)

# with open('sacreddeer_house_new_submission_5.csv', 'w', newline='') as f:
#     writer = csv.writer(f)
#     writer.writerow(["Id", "SalePrice"])
#     for k in range(len(ids)):
#         writer.writerow([ids[k], test_pred[k]])

In [43]:
randcom = np.array([2, 3, 4, 5, 6, 7, 8, 9])
param_distribs = {"pca__n_components": randcom}
rmd_search = RandomizedSearchCV(
    lin_reg, param_distributions=param_distribs, 
    n_iter=10, cv=5, scoring="neg_root_mean_squared_error", random_state=446
)
rmd_search.fit(nec_data, np.array(prices).ravel())
final_rnd_model = rmd_search.best_estimator_
print(final_rnd_model.get_params)

<bound method Pipeline.get_params of Pipeline(steps=[('pre',
                 ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer',
                                                              SimpleImputer(strategy='median')),
                                                             ('standardscaler',
                                                              StandardScaler())]),
                                   transformers=[('text',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000001A73BE674C0>)])),
            

In [45]:
final_rnd_model.fit(nec_data, np.array(prices).ravel())
rmse2 = -cross_val_score(final_rnd_model, nec_data, np.array(prices).ravel(), scoring="neg_root_mean_squared_error", cv=10)
print(np.average(rmse2))

37252.52656794961


In [46]:
# test_set["has_porch"] = (test_set["OpenPorchSF"] + test_set["EnclosedPorch"] + test_set["3SsnPorch"] + test_set["ScreenPorch"] > 0)
# test_set["has_deck"] = (test_set["WoodDeckSF"] > 0)
# test_set["has_pool"] = (test_set["PoolArea"] > 0)
# test_set["TotalSF"] = test_set["1stFlrSF"] + test_set["2ndFlrSF"] + test_set["TotalBsmtSF"]

# test_pred = final_rnd_model.predict(test_set)

# with open('sacreddeer_house_new_submission_6.csv', 'w', newline='') as f:
#     writer = csv.writer(f)
#     writer.writerow(["Id", "SalePrice"])
#     for k in range(len(ids)):
#         writer.writerow([ids[k], test_pred[k]])

In [49]:
randcom2 = np.array([6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
param_distribs2 = {"pca__n_components": randcom2}
rmd_search2 = RandomizedSearchCV(
    lin_reg, param_distributions=param_distribs2, 
    n_iter=10, cv=5, scoring="neg_root_mean_squared_error", random_state=446
)
rmd_search2.fit(nec_data, np.array(prices).ravel())
final_rnd_model2 = rmd_search.best_estimator_
print(final_rnd_model2.get_params)

<bound method Pipeline.get_params of Pipeline(steps=[('pre',
                 ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer',
                                                              SimpleImputer(strategy='median')),
                                                             ('standardscaler',
                                                              StandardScaler())]),
                                   transformers=[('text',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000001A73BE674C0>)])),
            

SVD with 9 components performed very well, reducing my test error down to 0.189. This is my best score yet, however, I'm still under 3000th place. When I'm writing this, school starts tomorrow, so I'll probably need to come back to this project much later.