# Housing Prices Competition for Kaggle Learn Users
## Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

## Goal

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

## Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

In [182]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_log_error
from sklearn.linear_model import SGDRegressor

In [209]:
def save_submission(X, pred):
    pd.DataFrame({'SalePrice': pred}, index=X['Id']).to_csv('submission.csv')

In [13]:
df = pd.read_csv('data/train.csv')
df.shape

(1460, 81)

In [205]:
df_submission = pd.read_csv('data/test.csv')
df_submission.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


# Data cleaning
All information about handling the data is provided in 'exploratory_analysis.ipynb'.

Filter unwanted outliers

In [106]:
df_filtered_outliers = df[((df['LotFrontage'] < 250) | df['LotFrontage'].isna()) &
                           (df['LotArea'] < 100000) &
                           (df['MasVnrArea'] < 1200) &
                           (df['BsmtFinSF1'] < 5000) &
                           (df['BsmtFinSF2'] < 1400) &
                           (df['TotalBsmtSF'] < 6000) &
                           (df['BsmtFinSF1'] < 5000) &
                           (df['1stFlrSF'] < 4000) &
                           (df['2ndFlrSF'] < 2000) &
                           (df['GrLivArea'] < 4500) &
                           (df['BedroomAbvGr'] < 7) &
                           (df['KitchenAbvGr'].isin([1, 2])) &
                           (df['WoodDeckSF'] < 700) &
                           (df['OpenPorchSF'] < 450) &
                           (df['EnclosedPorch'] < 350)]
df_filtered_outliers.shape

(1429, 81)

Drop broken elements
1. Drop element that has 'BsmtCond' but missed 'BsmtExposure'.
1. Drop element that has 'BsmtFinType1' but missed 'BsmtFinType2'.

In [149]:
drop_mask = ~(df_filtered_outliers['BsmtExposure'].isna() ^ df_filtered_outliers['BsmtCond'].isna() |
              df_filtered_outliers['BsmtFinType2'].isna() ^ df_filtered_outliers['BsmtFinType1'].isna())
df_droped_broken = df_filtered_outliers[drop_mask]
df_droped_broken.shape

(1427, 81)

Handle missing data and endode categorical data

In [192]:
num_transformer = Pipeline([('impute', SimpleImputer(strategy='constant', fill_value=0)),
                            ('scaler', StandardScaler())])
cat_transformer = Pipeline([('impute', SimpleImputer(strategy='constant', fill_value='No')),
                            ('encode', OneHotEncoder(handle_unknown='ignore'))])

Generate 'BsmtFinPer' feature.

In [193]:
class BsmtFinPerGenerator(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return pd.DataFrame({'BsmtFinPer': 1 - X['BsmtUnfSF'] / X['TotalBsmtSF']}).fillna(0)

In [194]:
bsmt_generator = Pipeline([('generate', BsmtFinPerGenerator())])

Build final data transformation pipeline.

In [195]:
num_features = ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
                'MasVnrArea', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
                'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageCars',
                'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch', 'MiscVal']
cat_features = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
                'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
                'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
                'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
                'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
                'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
                'SaleCondition']
bsmt_features = ['BsmtUnfSF', 'TotalBsmtSF']

In [196]:
preprocessor = ColumnTransformer([('num', num_transformer, num_features),
                                  ('cat', cat_transformer, cat_features),
                                  ('bsmt', bsmt_generator, bsmt_features)])

# Baseline
Build the baseline for SGD regression. The target metric is LogRMSE.

Split the data for train and holdout subsets.

In [197]:
X = df_droped_broken.drop('SalePrice', axis='columns')
y = df_droped_broken['SalePrice']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2)
X_train.shape, X_holdout.shape, y_train.shape, y_holdout.shape

((1141, 80), (286, 80), (1141,), (286,))

Build the baseline.

In [201]:
model = SGDRegressor()
baseline = Pipeline([('preprocessing', preprocessor),
                     ('model', model)])

In [202]:
baseline.fit(X_train, y_train)
baseline_predictions = baseline.predict(X_holdout)
np.sqrt(mean_squared_log_error(y_holdout, baseline_predictions))

0.1491187026399476

In [210]:
save_submission(df_test, baseline.predict(df_test))

# Improvements of the baseline