# Ames Housing Dataset - Stacking with Deep Learning Model - Top 3.5%  

This notebook can achieve a reproducible kaggle accuracy of 0.118x. Uses stacking and blending that includes a deep learning model

### Introduction

1. This notebook is a continuation of my previous kaggle notebook - https://www.kaggle.com/code/murugesann/ames-housingprices-deep-learning-model-top-4 where I have explained the model in depth. That notebook could achieve an accuracy of top 4.5%  with Kaggle score of 0.11946 (submission - nm-dl-final9.csv) <span style="color:blue">without stacking and blending


2. Objective of this notebook is find the effect and role of randomness vis-a-vis stacking and blending. How much does stacking improves the accuracy, over and above the previous notebook's 0.11946?


3. In my previous notebook above, I had noted, based on several trials and experimentations, that the accuracies of machine learning models depends on not only the content of input data but also order of columns. However, by using python hashseed and random state, we can very well reproduce any result in any computer. But if we change the random state (row order) or the order of columns (column order), the accuracies will change. These accuracies have been found to vary leading to Kaggle scores of 0.10 to 0.13. It is a pure random luck whether you get a kaggle score of 0.10 or 0.13 - it depends on random state and the column tranformation order! 
    

4. But how relevant is above kind of accuracy of 0.10 or 0.13 for the real world production dataset? Can we say that the confidence level, with which the sales prices are predicted, will be same BOTH for the model with accruacy of 0.10 and model with accuracy of 0.13?


4. Also, with respect to the methodology of stacking, another important question arises: Given the well established fact that Stacking improves accuracy, How relevant is any improvement through stacking given the fact that I can achieve the maximum score of, say, 0.11 even without stacking? 


5. Does stacking does nothing more than solve the above "random luck phenomenon" by allowing multiple random states to work together through several models, thereby improving accuracy? - 


6. OR How relevant is the accuracy of 0.11 achieved by 'random luck' vs 0.11 achieved through a stable stacking procedure? Does both the models can be used with same confidence level?


7. How confident we can be that our Kaggle score will still be valid for a Ames dataset (2006-2010) of subequent period say 2010-2015 assuming that factors that affect the prices remain same? 

### Model Results

1. XGBoost Cross Validation - Average - 11.15%, Minimum - 10.35% and Kaggle Score for Minimum - 12.35%
2. Deep Learning Model Cross Validation - Average - 11.16%, Minimum - 10.2% 
3. SKlearn Stacking Regressor (without Deep learning Model) - Kaggle Score - 12.02%
4. MLXtend StackingCV Regressor (Without Deep Learning Model) - Kaggle Score - 11.96%
5. MLXtend StackingCV Regressor (With deep learning model) - Kaggle Score - 11.98%
6. Stacking and Blending along with Deep Learning Model - Kaggle Score - 11.95%

**An important observation with respect to Stacking as a methodology - As like cross validation of any machine learning model, the cross validation of stacked regression also gives a wide range of 10% to 11x% implying that the accuracy gained through stacking is minimal and that it is also dependent on random state and input data order**

## Conclusions & Insights

1. The cross validation accuracies for XGBoost and Deep Learning are nearly same. The range of accuracies obtained in cross validation are the examples of the range we may get on accuracies depending on random state or order of input data etc

2. The accuracies obtained through Stacking is higher than the accuracy obtained for basic model without stacking, a given random state.

3. The accuracy obtained through stacking is still within the range of cross validation scores of basic model. This means a better basic model (depending on random state) can achieve higher accuracy than stacked model

4. Best accuracy obtainable will be when the model is stacked with a random state / hashseed that gives lowest accuracy. For example, the best accuracy possible is near 10% as per cross validation scores. We don't know the input data / content oder for that particular cross validated data (though it can be saved and reused). If we use that input data and same random state numbers, and then use stacking, we will get the best possible accuracy.

5. However, we cannot conclude which of the model should be used in production: whether the model with the lowest accuracy obtained without stacking (11.9) vs model with lowest accuracy obtained with stacking (11.8)? 

6. It may even be that one of the above two models might be overfitting the current kaggle test data and may fail in regularizing the actual production data. In fact, most of the competition models with large number of models stacked together giving high accuracy might be overfitting the given test data

7. In otherwords, each accuracy of particular random state is equivalent to a particular sample in overall population. That sample may or may not reflect the overall population. The field of statistics helps decide which of the sample / model might work better in production by giving confidence level based on standard error (IN deep learning section, I have mentioned that we don't need to bother about general assumptions behind linear regression like normality, homoskedacity etc. They are however applicable when we want to do inferential statistics

8. We do all these model design and analyses to get better accuracies, in order to better the data science models.  In reality, no housing price can be predicted with any such accuracy, we are fighting here for! The prices are not determined solely by these quantitative factors and several other subjective factors might determine the price which are applicable for individual purchase and the range of prices could be wide. Hence, we can conclude that any of the above models with reasonable accuracy should be fine rather than striving for lowest accuracy which may neither be achievable in real world real estate market nor be applicable for production data

## Import Libraries and Data

In [1]:
!pip3 install --upgrade tensorflow

Collecting tensorflow
  Downloading tensorflow-2.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.29.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m75.5 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf<3.20,>=3.9.2
  Downloading protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m

In [2]:
!pip3 install scikeras[tensorflow]

Collecting scikeras[tensorflow]
  Downloading scikeras-0.10.0-py3-none-any.whl (27 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.10.0
[0m

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression
import math
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import KFold
import seaborn as sns

import xgboost as xg 
import pickle
import category_encoders as ce
from scipy.special import boxcox1p

In [4]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from keras.models import load_model
import absl.logging
import logging

2023-01-18 02:03:34.317445: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-18 02:03:35.358286: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-01-18 02:03:35.358538: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Co

In [5]:
pip show tensorflow

Name: tensorflow
Version: 2.11.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: explainable-ai-sdk, tensorflow-cloud, tensorflow-decision-forests, tensorflow-io, tensorflow-serving-api, tensorflow-transform, tfx-bsl, witwidget
Note: you may need to restart the kernel to use updated packages.


In [6]:
from scikeras.wrappers import KerasRegressor
from mlxtend.regressor import StackingCVRegressor

In [7]:
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import Ridge
from sklearn.svm import LinearSVR
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV


from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
 
import lightgbm as lgb

## Pre-processing

Random numbers play a significant role in machine learning. If we need to get same results, everytime we run the functions,  and also everytime we open the Jupyter Notebook, we need to use the same set of random numbers. In order to facilitate this, we need to set the random seed number. The seed number used here 1123 shall be used everywhere random states are applicable.

In [8]:
# I am using same random_state of "1123" all through the notebook here. However, each of the model 
# can have different random_states. As long as we use the respective random_state number in each model, we
# will be able to reproduce the accuracies exactly (the accuracies in commented are obtained with different 
# random_state numbers in my laptop) 

from numpy.random import seed
seed(1123) 

In [9]:
import random as pyrandom
pyrandom.seed(1123)

In [10]:
# If you want to reproduce exact accuracies as given by this Kaggle notebook, the hash for "123" should be 
# same as the outout of this cell. Else set "pythonhashseed" of "1234" in kernel.json file to get the accurcies 
# given in the comments of respective submissions

print(hash("123"))
# -1713696605291986919

-6436572149221465873


**The following section on pre-processing of training data is same as that given in my previous detailed notebook. However, I have incorporated the deletion of outliers as part of the preprocessing itself in this notebok. The outliers were as found in the previous notebook**

#### Listing of different column types

In [11]:
train=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

idx=np.where(train[train.columns].count()<train.shape[0])
nullcol=train.columns[idx].values.tolist()
colcat1=train.select_dtypes(include='O').columns.tolist()
colint1=train.select_dtypes(exclude='O').columns.tolist()
nullcat=set(colcat1).intersection(nullcol)

print("The categorical feature columns with null values are :","\n",nullcat,'\n')

nullint=set(nullcol).intersection(colint1)

print("The numerical feature columns with null values are :","\n",nullint,'\n')

The categorical feature columns with null values are : 
 {'BsmtExposure', 'MiscFeature', 'GarageQual', 'GarageFinish', 'BsmtCond', 'BsmtFinType2', 'MasVnrType', 'BsmtFinType1', 'GarageType', 'Fence', 'Electrical', 'PoolQC', 'Alley', 'BsmtQual', 'GarageCond', 'FireplaceQu'} 

The numerical feature columns with null values are : 
 {'LotFrontage', 'MasVnrArea', 'GarageYrBlt'} 



#### Imputation of Numerical Variables

There are two numeric features that can be imputed by regressing with other features viz., LotFrontage and MasVnrArea. LotFrontage is related to LotArea. A regression of LotFrontage with LotArea shows a linear relationship. Hence, the missing values of LotFrontage are imputed through regression.

While regressing the values, the outliers are removed before fitting the model. To facilitate that another copy of training data is used.

In [12]:
train1=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")  # Another copy of training data allows dropping of outliers

#Detected outliers in LotArea and LotFrontage are removed - this analysis is not part of this notebook

train1.drop(train1[train1.LotArea>200000].index,inplace=True)
train1.drop(train1[(train1.LotFrontage>300)].index,inplace=True)

idx2=train1[(train1.LotFrontage>100) & (train1.SalePrice>700000)].loc[:,['LotFrontage','SalePrice']].index 
idx3=train1[(train1.LotFrontage>150) & (train1.SalePrice<100000)].loc[:,['LotFrontage','SalePrice']].index 
 
train1.drop(index=idx2,inplace=True)
train1.drop(index=idx3,inplace=True)


lm1=LinearRegression()
x=train1[['LotArea']].copy()
x['LotFrontage']=train1['LotFrontage'].copy()

x.dropna(inplace=True)

xt=x.iloc[:,:-1].values
yt=x['LotFrontage'].values
lm1.fit(xt,yt)

import math 
for item in train[['LotFrontage','LotArea']].itertuples(): 
#     print(item)

    if math.isnan(item[1]):
#         print(np.array([[ train.iloc[item[0]].loc['LotArea'],train.iloc[item[0]].loc['SalePrice'] ]]))
        value=np.round(np.squeeze(lm1.predict(np.array([[ train.iloc[item[0]].loc['LotArea']]] ))).tolist())
        train.loc[item[0],'LotFrontage']=value
        
import warnings
warnings.filterwarnings('ignore')


The second feature/variable that can be imputed using regression is MasVnrArea. 

An analysis of variable features shows that MasVnrArea is linearly related to GrLivArea. Hence, missing values of MasVnrArea are imputed by regressing against GrLivArea.

Based on analysis of the columns, as mentioned before, some preprocessing of the column is done before regression

In [13]:
train.MasVnrType=train.MasVnrType.fillna('Stone')

train.loc[(train.MasVnrType=='None') & (train.MasVnrArea>0),'MasVnrType']='BrkFace'

idx1=train1[train1.MasVnrArea>1400].index
idx2=train1[(train1.MasVnrArea>700) & (train1.SalePrice>600000)].index 
train1.drop(idx1,inplace=True)
train1.drop(idx2,inplace=True)
train1.drop(train1[train1.MasVnrArea==0].index,inplace=True)

train1.drop(train1[train1.GrLivArea>4000].index,inplace=True)

from sklearn.linear_model import LinearRegression
lmmv=LinearRegression()
x=train1[['GrLivArea']].copy()
x['MasVnrArea']=train1['MasVnrArea'].copy()

x.dropna(inplace=True)

xt=x.iloc[:,:-1].values
yt=x['MasVnrArea'].values
lmmv.fit(xt,yt)

import math 
for item in train[['MasVnrArea','GrLivArea']].itertuples(): 

    if math.isnan(item[1]):
        
        value=np.round(np.squeeze(lmmv.predict(np.array([[ train.iloc[item[0]].loc['GrLivArea']]] ))).tolist())
        train.loc[item[0],'MasVnrArea']=value
        
import warnings
warnings.filterwarnings('ignore')


##### Missing Values of Object Columns

* The missing values of object columns could indicate either the data is not available or the value is not applicable for the feature. 


* For most of the object columns, the missing values imply that the value is not relevant. For example, if there are no garages, then all related fields will be None.  


* Most of the object columns are thus filled with 'None' value. The elaborate study of the columns enabled identification of specific values for some columns. For example, in the case of 'Electrical' feature, an analysis of the values shows that the likely missing value is 'skbr'


* The GarageYearBuilt will be zero where there are no garages and hence can be filled with zero values


* The MSSubClass feature is a categorical variable but has numerical values. Hence, it is converted into string type 

In [14]:
# Drop the "Id" column as index is sufficient
train.drop(columns=['Id'],inplace=True)     

# Object columns
train['Alley'].fillna('None',inplace=True)
train['BsmtQual'].fillna('None',inplace=True)
train['BsmtCond'].fillna('None',inplace=True)
train['BsmtExposure'].fillna('None',inplace=True)
train['BsmtFinType1'].fillna('None',inplace=True)
train['BsmtFinType2'].fillna('None',inplace=True)
train['Electrical'].fillna('SBrkr',inplace=True)
train['FireplaceQu'].fillna('None',inplace=True)
train['GarageType'].fillna('None',inplace=True)
train['GarageFinish'].fillna('None',inplace=True)
train['GarageQual'].fillna('None',inplace=True)
train['GarageCond'].fillna('None',inplace=True)
train['PoolQC'].fillna('None',inplace=True)
train['Fence'].fillna('None',inplace=True)
train['MiscFeature'].fillna('None',inplace=True)

 
# GarageYrBlt is not relevant for houses where there are no garages. 
#This column will also be replaced later with garage age. Hence, it is filled with zero values

train['GarageYrBlt'].fillna(0,inplace=True)
 
# Type Conversion - MSSubClass is not a numerical feature
train['MSSubClass']=train['MSSubClass'].astype(str)


# Dropping these outliers did not improve the Kaggle accuracy - later found several such outliers in test data also!!
    # train.drop(train[train.BedroomAbvGr==8].index,inplace=True)
    # train.drop(index=train[train.GarageCars==4].index,inplace=True)
    # idx1=train[train.MasVnrArea>1400].index
    # idx2=train[(train.MasVnrArea>700) & (train.SalePrice>600000)].index 
    # train.drop(idx1,inplace=True)
    # train.drop(idx2,inplace=True)  
#     train.drop(train[train.GrLivArea>4000].index,inplace=True) 
    
# Processing based on test data - while processing test data, several other missing values and anamolies are found
# These are incorporated for training data also:
train.loc[train[train.GarageCars.isna()].index,"GarageType"]='None'
train.YrSold=np.where(train.YrSold<train.YearBuilt,train.YearBuilt,train.YrSold)
train.YearRemodAdd=np.where(train.YrSold<train.YearRemodAdd,train.YrSold,train.YearRemodAdd)
train.GarageYrBlt=np.where(train.GarageYrBlt>train.YrSold,train.YearBuilt,train.GarageYrBlt)


print("All null columns processed ",sum(train.isna().sum())==0)

All null columns processed  True


### Feature Selection and Engineering

* Do we need to select only important features? With just 80 columns, we can afford to use all the columns for training. In the case of deep learning, superfluous and redundant columns does not affect accuracy

*** Should we bother about the following?**

 - Linear relationship?
 - Multivariate normality?
 - Multicollinearity?
 - Auto-correlation?
 - Homoscedasticity?
 
 Thankfully, for Deep Learning no such assumptions are relevant. Hence, no such analysis has been done. 
 
**Note: As mentioned in the first section of this notebook, we don't do any statistical inference analysis about suitability or statistical significance of the sample input data with respect to production data. Hence, we don't need to bother about the above assumptions. The accuracies obtained will be valid irrespective of the applicability of the above assumptions. However, we cannot set any confidence level for the predictions if above assumptions are not applicable**
 
 
 * Should we bother about skewed columns?
 
     - Transformation of skewed columns slightly improves the accuracy of the model. Hence, boxcox transformation is done for skewed columns
     - It was found that the log transformation of sale price, for the deep learning model, does not improve accuracy much and hence not done - this could also be because I was anyway using Keras' meansquaredlogarithmicerror as loss function, though predicted prices will not be log transformed. However, log transformation of target variable does bring in some additional accuracy in the case of sklearn models.

#### Ranking of features of importance

* A rough idea of important features and their ranking can be obtained through sklearn.feature_selection.RFE function. 

In [15]:
# Study of features in terms of ranking of thier importance
train1=train.copy()
y=train['SalePrice'] 
train1.drop(columns='SalePrice',inplace=True)
estimator = LinearRegression()
rfe = RFE(estimator, n_features_to_select=1, step=1)
selector = rfe.fit(train1.select_dtypes(exclude='object'), y)
selectedFeatures = list(train1.select_dtypes(exclude='object').columns[selector.support_])
print(selectedFeatures,'\n',selector.ranking_)
rank=selector.ranking_
# selector.support_
topcolorder=[x for _, x in sorted(zip(rank, train1.select_dtypes(exclude='O').columns.values))]
topcolorder

['OverallQual'] 
 [13 32  1 11 14 15 19 29 33 35 24 17 18 25 23  3  7  2 10  9  6  8  5 26
  4 30 21 31 28 22 16 20 34 27 12]


['OverallQual',
 'FullBath',
 'BsmtFullBath',
 'GarageCars',
 'Fireplaces',
 'KitchenAbvGr',
 'BsmtHalfBath',
 'TotRmsAbvGrd',
 'BedroomAbvGr',
 'HalfBath',
 'OverallCond',
 'YrSold',
 'LotFrontage',
 'YearBuilt',
 'YearRemodAdd',
 'ScreenPorch',
 '1stFlrSF',
 '2ndFlrSF',
 'MasVnrArea',
 'PoolArea',
 'WoodDeckSF',
 '3SsnPorch',
 'GrLivArea',
 'TotalBsmtSF',
 'LowQualFinSF',
 'GarageYrBlt',
 'MoSold',
 'EnclosedPorch',
 'BsmtFinSF1',
 'GarageArea',
 'OpenPorchSF',
 'LotArea',
 'BsmtFinSF2',
 'MiscVal',
 'BsmtUnfSF']

* The above study shows that "OverallQual","FullBath" and "GarageCars" are three most important features in determining the sale price. 


* The columns like "BsmtFinSF2", "MiscVal","BsmtUnfSF" etc are least important features. If these are negatively correlated, then we can remove these columns as "Data Anomaly"


* We can now check the correlation of various columns with sale price

The correlation of various features with saleprice can be obtained using pd.corr function

In [16]:
train.corr().sort_values(by='SalePrice',ascending=False)

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
SalePrice,0.317984,0.263843,0.790982,-0.077856,0.522897,0.507118,0.478273,0.38642,-0.011378,0.214479,...,0.324413,0.315856,-0.128578,0.044584,0.111447,0.092404,-0.02119,0.046432,-0.028923,1.0
OverallQual,0.180773,0.105806,1.0,-0.091932,0.572323,0.55061,0.413362,0.239666,-0.059119,0.308159,...,0.238923,0.308819,-0.113937,0.030371,0.064886,0.065166,-0.031406,0.070815,-0.027347,0.790982
GrLivArea,0.350597,0.263116,0.593007,-0.079686,0.19901,0.287199,0.392902,0.208171,-0.00964,0.240257,...,0.247433,0.330224,0.009113,0.020643,0.10151,0.170205,-0.002416,0.05024,-0.036526,0.708624
GarageCars,0.245384,0.154871,0.600671,-0.185758,0.53785,0.420583,0.365436,0.224054,-0.038264,0.214175,...,0.226342,0.213569,-0.151434,0.035765,0.050494,0.020934,-0.04308,0.040522,-0.039117,0.640409
GarageArea,0.293183,0.180403,0.562022,-0.151521,0.478954,0.37155,0.37382,0.29697,-0.018227,0.183303,...,0.224666,0.241435,-0.121777,0.035087,0.051412,0.061047,-0.0274,0.027974,-0.027378,0.623431
TotalBsmtSF,0.343854,0.260833,0.537808,-0.171098,0.391452,0.290919,0.365767,0.522396,0.10481,0.41536,...,0.232019,0.247264,-0.095478,0.037384,0.084489,0.126053,-0.018479,0.013196,-0.014969,0.613581
1stFlrSF,0.406076,0.299475,0.476224,-0.144203,0.281986,0.240218,0.346115,0.445863,0.097117,0.317987,...,0.235459,0.211671,-0.065292,0.056104,0.088758,0.131525,-0.021096,0.031372,-0.013604,0.605852
FullBath,0.176306,0.126031,0.5506,-0.194149,0.468271,0.438976,0.280255,0.058543,-0.076444,0.288886,...,0.187703,0.259977,-0.115093,0.035353,-0.008106,0.049604,-0.01429,0.055872,-0.019669,0.560664
TotRmsAbvGrd,0.289863,0.190015,0.427452,-0.057583,0.095589,0.191655,0.281377,0.044316,-0.035227,0.250647,...,0.165984,0.234192,0.004151,-0.006683,0.059383,0.083757,0.024763,0.036907,-0.034516,0.533723
YearBuilt,0.08115,0.014228,0.572323,-0.375983,1.0,0.592837,0.317637,0.249503,-0.049107,0.14904,...,0.22488,0.188686,-0.387268,0.031355,-0.050364,0.00495,-0.034383,0.012398,-0.013618,0.522897


#### Addition and Deletion of some features

Based on the above, addition and deletion of features are decided

1. The price of the houses will depend on the age of the house. This information is available through year sold and year built columns. However, it would be better if we could create a separate feature 'Age' of the house and then delete the yearbuilt column. A new feature Age is created.


2. Similarly, RemodelAge and GarageAge are newly created as additional features


3. An analysis of the features showed that presence of Garage is an important factor in determining prices. Hence, it would be better if we could create a GarageFlag that says yes or no about availability of Garage thereby we can delete GarageYearBuilt column which has many (81) missing values


4. Similarly, flags to indicate presence or absence of pools, fence, basement, secondfloor are created as new columns


5. The features like halfbaths can be converted into number of full baths and those columns can also be deleted

It was found that the final model does not predict sale prices above 500000!! An EDA analysis shows that the house prices depend on Neighborhood. 

In [17]:
train[train.SalePrice>400000][['Neighborhood','SalePrice']].sort_values('SalePrice',ascending=False)

Unnamed: 0,Neighborhood,SalePrice
691,NoRidge,755000
1182,NoRidge,745000
1169,NoRidge,625000
898,NridgHt,611657
803,NridgHt,582933
1046,StoneBr,556581
440,NridgHt,555000
769,StoneBr,538000
178,StoneBr,501837
798,NridgHt,485000


Based on the above, a new feature premium flag is introduced to indicate the neighborhood (though however it did not improve accuracy much)

#### New Columns Added

In [18]:
train['Age']=train['YrSold']+train['MoSold']/12-train['YearBuilt']
train['ReModelAge']=train['YrSold']+train['MoSold']/12-train['YearRemodAdd']
train['GarageFlag']=np.where(train.GarageArea!=0,1,0)
train['GarageAge']=np.where(train.GarageArea!=0,train['YrSold']+train['MoSold']/12-train['GarageYrBlt'],0)
train['Baths']=train['FullBath']+0.5*train['HalfBath']
train['BsmtBaths']=train['BsmtFullBath']+0.5*train['BsmtHalfBath']

train['haspool'] = train['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
train['has2ndfloor'] = train['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0) 
train['hasbsmt'] = train['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
train['hasfence'] = train['Fence'].apply(lambda x: 1 if x=="None" else 0)

train['Total_sqr_footage'] = (train['TotalBsmtSF']+train['GrLivArea']+train['LotArea'])
train['Total_porch_sf'] = (train['OpenPorchSF'] + train['3SsnPorch'] +
                              train['EnclosedPorch'] + train['ScreenPorch'])

p=['StoneBr','NrdigHt','NoRidge']

train['PremiumFlag']=train['Neighborhood'].apply(lambda x: 1 if x in p else 0)

We have to change the new features with appropriate dtypes as these will be one-hot encoded

In [19]:
train['haspool']=train['haspool'].astype(str)
train['has2ndfloor']=train['has2ndfloor'].astype(str)
train['hasbsmt']=train['hasbsmt'].astype(str)
train['hasfence']=train['hasfence'].astype(str)
train['PremiumFlag']=train['PremiumFlag'].astype(str)
train["GarageFlag"]=train["GarageFlag"].astype(str)
train["CentralAir"]=train["CentralAir"].astype(str)

#### Columns deleted

A correlation of sales price with various variables show that some of the columns have negative impact on the prediction. This is not a feature characteristic (for example, age will be inversely related) but may have arisen due to data anomaly. The columns identified to impact prediction are: BsmtFinSF2, MiscVal, LowQualFinSF and EncolsedPorch. Also, the columns alike Yearbuilt are superfluous owing to above addition of features. Hence, these columns are deleted

In [20]:
coldel=['BsmtFinSF2','MiscVal','LowQualFinSF','YearBuilt','YrSold','MoSold','GarageYrBlt',
        'EnclosedPorch','BsmtHalfBath']
 
train.drop(columns=coldel,inplace=True)

print(len(coldel)," columns deleted")

9  columns deleted


We can check again the ranking of features and verify that everthing is in order

In [21]:
# STudy of features in terms of ranking of thier importance
train1=train.copy()
y=train['SalePrice'] 
train1.drop(columns='SalePrice',inplace=True)
estimator = LinearRegression()
rfe = RFE(estimator, n_features_to_select=1, step=1)
selector = rfe.fit(train1.select_dtypes(exclude='object'), y)
selectedFeatures = list(train1.select_dtypes(exclude='object').columns[selector.support_])
print(selectedFeatures,'\n',selector.ranking_)
rank=selector.ranking_
# selector.support_
topcolorder=[x for _, x in sorted(zip(rank, train1.select_dtypes(exclude='O').columns.values))]
topcolorder

['Baths'] 
 [29 25  4 11 20 19 28 32 30 15 16 31 12  2  3 10  8  9  7  6 33 22 26 21
 17 23 13 18 14  1  5 24 27]


['Baths',
 'FullBath',
 'HalfBath',
 'OverallQual',
 'BsmtBaths',
 'GarageCars',
 'Fireplaces',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'BedroomAbvGr',
 'OverallCond',
 'BsmtFullBath',
 'Age',
 'GarageAge',
 '1stFlrSF',
 '2ndFlrSF',
 'ScreenPorch',
 'ReModelAge',
 'MasVnrArea',
 'YearRemodAdd',
 '3SsnPorch',
 'WoodDeckSF',
 'PoolArea',
 'Total_sqr_footage',
 'LotArea',
 'OpenPorchSF',
 'Total_porch_sf',
 'BsmtFinSF1',
 'LotFrontage',
 'TotalBsmtSF',
 'GrLivArea',
 'BsmtUnfSF',
 'GarageArea']

In [22]:
train.corr().sort_values(by='SalePrice',ascending=False)

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,1stFlrSF,...,ScreenPorch,PoolArea,SalePrice,Age,ReModelAge,GarageAge,Baths,BsmtBaths,Total_sqr_footage,Total_porch_sf
SalePrice,0.317984,0.263843,0.790982,-0.077856,0.507118,0.478273,0.38642,0.214479,0.613581,0.605852,...,0.111447,0.092404,1.0,-0.523063,-0.50873,-0.387809,0.597966,0.224953,0.319082,0.195739
OverallQual,0.180773,0.105806,1.0,-0.091932,0.55061,0.413362,0.239666,0.308159,0.537808,0.476224,...,0.064886,0.065166,0.790982,-0.572166,-0.551877,-0.426322,0.585038,0.10264,0.156239,0.171172
GrLivArea,0.350597,0.263116,0.593007,-0.079686,0.287199,0.392902,0.208171,0.240257,0.454868,0.566024,...,0.10151,0.170205,0.708624,-0.199951,-0.289148,-0.173501,0.710087,0.030717,0.326509,0.272853
GarageCars,0.245384,0.154871,0.600671,-0.185758,0.420583,0.365436,0.224054,0.214175,0.434585,0.439317,...,0.050494,0.020934,0.640409,-0.538486,-0.422872,-0.308172,0.493479,0.128046,0.193102,0.083265
GarageArea,0.293183,0.180403,0.562022,-0.151521,0.37155,0.37382,0.29697,0.183303,0.486665,0.489782,...,0.051412,0.061047,0.623431,-0.479253,-0.373195,-0.311985,0.416037,0.174871,0.220247,0.118346
TotalBsmtSF,0.343854,0.260833,0.537808,-0.171098,0.290919,0.365767,0.522396,0.41536,1.0,0.81953,...,0.084489,0.126053,0.613581,-0.391443,-0.291886,-0.256237,0.261114,0.309627,0.31968,0.155471
1stFlrSF,0.406076,0.299475,0.476224,-0.144203,0.240218,0.346115,0.445863,0.317987,0.81953,1.0,...,0.088758,0.131525,0.605852,-0.281941,-0.240874,-0.176382,0.2823,0.246994,0.355234,0.158072
Baths,0.16218,0.114805,0.585038,-0.192197,0.452705,0.32294,0.052395,0.233977,0.261114,0.2823,...,0.021611,0.051815,0.597966,-0.501413,-0.453804,-0.424965,1.0,-0.0807,0.159158,0.131201
FullBath,0.176306,0.126031,0.5506,-0.194149,0.438976,0.280255,0.058543,0.288886,0.323722,0.380637,...,-0.008106,0.049604,0.560664,-0.468039,-0.439855,-0.419795,0.920116,-0.077647,0.168651,0.102435
TotRmsAbvGrd,0.289863,0.190015,0.427452,-0.057583,0.191655,0.281377,0.044316,0.250647,0.285573,0.409516,...,0.059383,0.083757,0.533723,-0.096691,-0.193571,-0.112686,0.616319,-0.059208,0.23925,0.179536


#### Ordinal Encoding

* Some of the numerical features gives ratings of quality in ordinal scale. 


* These variables should be ordinal encoded instead of categorical one-hot encoded. For example, a categorical feature's presence or absence affects the predictions but for the ordinal scaled variable it is the degree of the presence of values.


* In the usual sklearn.OrdinalScale function, it is not possible to give the order of the values - which is necessary for any ordinal encoding. However, this is possible with another sklearn library. We can use thecategory_encoders library of sklearn


* The mapping of the values have to be done manually and then provided for encoding

In [23]:
ordlabels=set(["ExterCond","ExterQual","BsmtQual","BsmtCond", "BsmtExposure","HeatingQC", 
                "FireplaceQu", "KitchenQual","GarageQual", "GarageCond", "PoolQC", 
                "OverallQual","OverallCond"           
              ])

# "GarageFlag",'haspool','has2ndfloor','hasbsmt','hasfence','PremiumFlag'

In [24]:
import category_encoders as ce

ordinal_mappings = {
    
    "ExterCond": ['None','Po','Fa','TA','Gd','Ex'],      
    "ExterQual": ['None','Po','Fa','TA','Gd','Ex'],     
    "BsmtQual": ['None','Po','Fa','TA','Gd','Ex'], 
    "BsmtCond": ['None','Po','Fa','TA','Gd','Ex'], 
    "BsmtExposure": ['None','No','Mn','Av','Gd'], 
    "HeatingQC": ['None','Po','Fa','TA','Gd','Ex'],     
    "FireplaceQu": ['None','Po','Fa','TA','Gd','Ex'], 
    "KitchenQual": ['None','Po','Fa','TA','Gd','Ex'], 
    "GarageQual": ['None','Po','Fa','TA','Gd','Ex'], 
    "GarageCond": ['None','Po','Fa','TA','Gd','Ex'], 
    "PoolQC": ['None','Fa','TA','Gd','Ex'],   
    "OverallCond": ['None',1,2,3,4,5,6,7,8,9,10],
    "OverallQual": ['None',1,2,3,4,5,6,7,8,9,10]
}

ce_ordinal_mappings = []
for col, unique_values in ordinal_mappings.items():
    local_mapping = {val:idx for idx, val in enumerate(unique_values)}
    ce_ordinal_mappings.append({"col":col, "mapping":local_mapping})
    
y=train['SalePrice']
train=train.drop('SalePrice',axis=1)

encoder = ce.OrdinalEncoder(cols=list(ordlabels),mapping=ce_ordinal_mappings, return_df=True)
encoder.fit(train)
train=encoder.transform(train)

#Add back the sale price column which was dropped before encoding
train['SalePrice']=y

#### Checking the encoding of variables

Total columns are original columns + columns added - columns deleted = 80-13-9=84

In [25]:
train.shape[1]

84

Total columns consists of ordinal encoded columns, numerical columns to be normalized and columns to be one-hot encoded

In [26]:
colint=set(train.select_dtypes(exclude='O').columns).difference(ordlabels)
colint.remove('SalePrice')
catlabels=set(train.select_dtypes(include='O').columns).difference(ordlabels)
len(ordlabels)+len(colint)+len(catlabels)==train.shape[1]-1

True

Total number of features other than target variable should be 83

In [27]:
normlabels=set(colint)

print('Features to be Ordinal Encoded: ',len(ordlabels),'\n','Features to be onehot encoded: ',len(catlabels),'\n',
      'Features to be normalized: ',len(normlabels),'\n','Features deleted: ',len(coldel),'\n',
      'Total Number of Features: ',(len(catlabels)+len(ordlabels)+len(normlabels)),'\n',
       
      'Sanity check: Target Variabe exlcuded: ',(92-len(catlabels)-len(ordlabels)-len(normlabels)
                                   -len(coldel))==0) 

Features to be Ordinal Encoded:  13 
 Features to be onehot encoded:  39 
 Features to be normalized:  31 
 Features deleted:  9 
 Total Number of Features:  83 
 Sanity check: Target Variabe exlcuded:  True


#### Transformation of Skewed Variables

In [28]:
skewness = train.skew().sort_values(ascending=False)
skewness = skewness[abs(skewness) > 0.75]

train[skewness.index.values.tolist()].info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PoolQC             1460 non-null   int64  
 1   PoolArea           1460 non-null   int64  
 2   haspool            1460 non-null   object 
 3   LotArea            1460 non-null   int64  
 4   Total_sqr_footage  1460 non-null   int64  
 5   3SsnPorch          1460 non-null   int64  
 6   KitchenAbvGr       1460 non-null   int64  
 7   PremiumFlag        1460 non-null   object 
 8   LotFrontage        1460 non-null   float64
 9   ScreenPorch        1460 non-null   int64  
 10  MasVnrArea         1460 non-null   float64
 11  OpenPorchSF        1460 non-null   int64  
 12  Total_porch_sf     1460 non-null   int64  
 13  SalePrice          1460 non-null   int64  
 14  BsmtFinSF1         1460 non-null   int64  
 15  WoodDeckSF         1460 non-null   int64  
 16  TotalBsmtSF        1460 

* Transformation of skewed variables results only in a very slight improvement in accuracy. I am transforming only the continuous numerical variables and excluding the discrete numerical variables and ordinal scaled variables.  

In [29]:
skewness = skewness[abs(skewness) > 0.5]
skewed_cols = skewness.index

rv=['MSSubClass','SalePrice','Age','BsmtBaths','BsmtFullBath','Fireplaces','GarageAge','KitchenAbvGr','ReModelAge',
   'TotRmsAbvGrd',"GarageFlag",'haspool','has2ndfloor','hasbsmt','hasfence','PremiumFlag','CentralAir']

sk_train=set(skewed_cols).difference(set(rv)).difference(set(ordlabels))

lamb = 0.15
for cols in sk_train:  
    train[cols] = boxcox1p(train[cols], lamb)
    

# Transforming with log gives slightly lesser accuracy    
# for columns in sk_train:
# #     print(np.log(1+train[columns]))
#     col=str(columns)
#     train[col] = np.log(1 + train[col])

We need to do transformation for the same set of variables in set data also. Hence, it is necessary to save the list of columns transformed

In [30]:
sk_train

{'1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'GrLivArea',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'PoolArea',
 'ScreenPorch',
 'TotalBsmtSF',
 'Total_porch_sf',
 'Total_sqr_footage',
 'WoodDeckSF'}

Several experimentations and trials are done. It is better to save the output of processed dataset which can be reloaded when required instead of running all cells again

In [31]:
train.to_csv("train2.csv")

#### Final Encoding and Transformation

In [32]:
# The outliers are detected based on regression - refer my previous notebook. 
# The values are directly being used here for the sake of simplicity
outliers=[ 523,  691,  803,  898, 1169, 1182, 1298, 1324]
train.drop(outliers, inplace=True)

In [33]:
y=train['SalePrice']
X=train.drop(columns='SalePrice')
train.shape,X.shape

((1452, 84), (1452, 83))

In [34]:
#Sanity check
len(catlabels)+len(ordlabels)+len(colint)==train.shape[1]-1

True

Normalizing the variables are important for deep learning models. Though the robustscaler results in better performance in the presence of skewed dataset, I found standard scaler to give better accuracy. Also, I chose not to scale the ordinal labels

In [35]:
# Note that here in this notebook, X_train is after deletion of outliers 
# Variables other than ordinal encoded and one-hot-encoded are normalized using Standard Scaler Function

ct=ColumnTransformer([('ohem1',OneHotEncoder(sparse=False,handle_unknown='ignore'),list(catlabels)),                                    
                      ('rbsm1',StandardScaler(),list(normlabels))
                     ], remainder='passthrough')
ct.fit(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,shuffle=True,random_state=1123)
X_train,X_test=ct.transform(X_train),ct.transform(X_test)

# It is better to save the output - found helpful when running several experimentations
X.to_csv("Xt.csv")
np.savetxt("y.csv",y,delimiter=',')
np.savetxt("X_test1.csv",X_test,delimiter=',')
np.savetxt("y_test1.csv",y_test,delimiter=',')
np.savetxt("X_train1.csv",X_train,delimiter=',')
np.savetxt("y_train1.csv",y_train,delimiter=',')
X1=ct.transform(X)
np.savetxt("x1.csv",X1,delimiter=',') 

X_train.shape

(1161, 281)

### Test data preprocessing script

In [36]:
testdata=pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
test_id=testdata["Id"]
def process_testdata2(file2,coldel,encoder,sk_train,ct):  

    import numpy as np
    import pandas as pd          
    from sklearn.model_selection import train_test_split
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import RobustScaler
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import OrdinalEncoder,StandardScaler,MinMaxScaler
    from sklearn.feature_selection import RFE
    from sklearn.linear_model import LinearRegression
    from sklearn.linear_model import LinearRegression
    import math
    import matplotlib.pyplot as plt
    from sklearn.metrics import mean_squared_error, r2_score
    from sklearn.model_selection import KFold
    
    import category_encoders as ce
    from scipy.special import boxcox1p

    import xgboost as xg 
    import pickle

    test_data1=pd.read_csv(file2)
    test_data=pd.read_csv(file2)

#     test_data1=pd.read_csv('test.csv')
#     test_data=pd.read_csv('test.csv')

    test_data1.drop(test_data1[test_data1.LotArea>200000].index,inplace=True)
    test_data1.drop(test_data1[(test_data1.LotFrontage>300)].index,inplace=True)


    from sklearn.linear_model import LinearRegression
    lm2=LinearRegression()
    x=test_data1[['LotArea']].copy()
    x['LotFrontage']=test_data1['LotFrontage'].copy()

    x.dropna(inplace=True)

    xt=x.iloc[:,:-1]
    yt=x['LotFrontage']
    lm2.fit(xt,yt)

    import math 
    for item in test_data[['LotFrontage','LotArea']].itertuples(): 
    #     print(item)

        if math.isnan(item[1]):
    #         print(np.array([[ test_data.iloc[item[0]].loc['LotArea'],test_data.iloc[item[0]].loc['SalePrice'] ]]))
            value=np.round(np.squeeze(lm2.predict(np.array([[ test_data.iloc[item[0]].loc['LotArea']]] ))).tolist())
            test_data.loc[item[0],'LotFrontage']=value

    import warnings
    warnings.filterwarnings('ignore') 


    test_data1.MasVnrType=test_data1.MasVnrType.fillna('Stone')
    test_data1.loc[(test_data1.MasVnrType=='None') & (test_data1.MasVnrArea>0),'MasVnrType']='BrkFace'

    idx1=test_data1[test_data1.MasVnrArea>1400].index
#     idx2=test_data1[(test_data1.MasVnrArea>700) & (test_data1.SalePrice>600000)].index 
    test_data1.drop(idx1,inplace=True)
#     test_data1.drop(idx2,inplace=True)
    test_data1.drop(test_data1[test_data1.MasVnrArea==0].index,inplace=True)

    test_data1.drop(test_data1[test_data1.GrLivArea>4000].index,inplace=True)

    from sklearn.linear_model import LinearRegression
    lmmvt=LinearRegression()
    x=test_data1[['GrLivArea']].copy()
    x['MasVnrArea']=test_data1['MasVnrArea'].copy()

    x.dropna(inplace=True)

    xt=x.iloc[:,:-1]
    yt=x['MasVnrArea']
    lmmvt.fit(xt,yt)

    import math 
    for item in test_data[['MasVnrArea','GrLivArea']].itertuples(): 
    #     print(item)

        if math.isnan(item[1]):
    #         print(np.array([[ test_data.iloc[item[0]].loc['LotArea'],test_data.iloc[item[0]].loc['SalePrice'] ]]))
            value=np.round(np.squeeze(lmmvt.predict(np.array([[ test_data.iloc[item[0]].loc['GrLivArea']]] ))).tolist())
            test_data.loc[item[0],'MasVnrArea']=value

    import warnings
    warnings.filterwarnings('ignore') 


#     test_id=test_data['Id']
    test_data.drop(columns=['Id'],inplace=True) 


    # #### 1. Preprocessing of Columns with Null values 

    # Object columns
    test_data['Alley'].fillna('None',inplace=True)
    test_data['BsmtQual'].fillna('None',inplace=True)
    test_data['BsmtCond'].fillna('None',inplace=True)
    test_data['BsmtExposure'].fillna('None',inplace=True) 

    test_data['BsmtFinType1'].fillna('None',inplace=True)
    test_data['BsmtFinType2'].fillna('None',inplace=True)
    test_data['Electrical'].fillna('SBrkr',inplace=True)
    test_data['FireplaceQu'].fillna('None',inplace=True)
    test_data['GarageType'].fillna('None',inplace=True)
    test_data['GarageFinish'].fillna('None',inplace=True)
    test_data['GarageQual'].fillna('None',inplace=True)
    test_data['GarageCond'].fillna('None',inplace=True)

    test_data['PoolQC'].fillna('None',inplace=True)
    test_data.loc[test_data[(test_data.PoolArea>0) & (test_data.PoolQC.isna())].index,'PoolQC']='Fa'

    test_data['Fence'].fillna('None',inplace=True)
    test_data['MiscFeature'].fillna('None',inplace=True)


    # Type Conversion
    test_data['MSSubClass']=test_data['MSSubClass'].astype(str)

    test_data.MasVnrType=test_data.MasVnrType.fillna('Stone')
    test_data.loc[(test_data.MasVnrType=='None') & (test_data.MasVnrArea>0),'MasVnrType']='BrkFace'

    test_data['GarageYrBlt']=test_data['GarageYrBlt'].fillna(0)

    test_data.loc[test_data[test_data.GarageCars.isna()].index,"GarageType"]='None'


    test_data.YrSold=np.where(test_data.YrSold<test_data.YearBuilt,test_data.YearBuilt,test_data.YrSold)
    test_data.YearRemodAdd=np.where(test_data.YrSold<test_data.YearRemodAdd,test_data.YrSold,test_data.YearRemodAdd)
    test_data.GarageYrBlt=np.where(test_data.GarageYrBlt>test_data.YrSold,test_data.YearBuilt,test_data.GarageYrBlt)


    test_data.loc[455,'MSZoning']="C (all)"
    test_data.loc[756,'MSZoning']="C (all)"
    test_data.loc[1444,'MSZoning']="C (all)"
    test_data.loc[790,'MSZoning']="FV"    
    test_data.loc[[455,485],'Utilities']='AllPub'
    test_data.loc[691,'Exterior1st']="Wd Sdng"
    test_data.loc[691,'Exterior2nd']="Wd Sdng"
    test_data.loc[95,'KitchenQual']='TA'
    test_data.loc[[756,1013],'Functional']='Typ'
    test_data.loc[666,'GarageYrBlt']=1910
    test_data['GarageArea']=np.where(test_data.GarageYrBlt==0,0,test_data.GarageArea)
    test_data.loc[1029,'SaleType']='WD' 


    idx=np.where(test_data[test_data.columns].count()<test_data.shape[0])
    nullcol=test_data.columns[idx].values.tolist()
    colcat1=test_data.select_dtypes(include='O').columns.tolist()
    colint1=test_data.select_dtypes(exclude='O').columns.tolist()
    nullcat=set(colcat1).intersection(nullcol)
    for i in nullcat:
        test_data[i].fillna('None',inplace=True)
    nullint=set(nullcol).intersection(colint1)

    for i in nullint:
        test_data[i].fillna(0,inplace=True)
    #     test_data[i].fillna(test_data[i].mean(),inplace=True)

    len(colcat1),len(colint1),len(colcat1)+len(colint1)

    print("Null columns: ",sum(test_data.isna().sum()))
    test_data['Age']=test_data['YrSold']+test_data['MoSold']/12-test_data['YearBuilt']
    test_data['ReModelAge']=test_data['YrSold']+test_data['MoSold']/12-test_data['YearRemodAdd']
    test_data['GarageFlag']=np.where(test_data.GarageArea!=0,1,0)
    test_data['GarageAge']=np.where(test_data.GarageArea!=0,test_data["YrSold"]+test_data["MoSold"]/12-test_data["GarageYrBlt"],0)
    test_data['Baths']=test_data['FullBath']+0.5*test_data['HalfBath']
    test_data['BsmtBaths']=test_data['BsmtFullBath']+0.5*test_data['BsmtHalfBath']


    test_data['haspool'] = test_data['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    test_data['has2ndfloor'] = test_data['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0) 
    test_data['hasbsmt'] = test_data['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
    test_data['hasfence'] = test_data['Fence'].apply(lambda x: 1 if x=="None" else 0)

    test_data['Total_sqr_footage'] = (test_data['TotalBsmtSF']+test_data['GrLivArea']+test_data['LotArea'])

    test_data['Total_porch_sf'] = (test_data['OpenPorchSF'] + test_data['3SsnPorch'] +
                              test_data['EnclosedPorch'] + test_data['ScreenPorch'])


    p=['StoneBr','NrdigHt','NoRidge']

    test_data['PremiumFlag']=test_data['Neighborhood'].apply(lambda x: 1 if x in p else 0)


    test_data["CentralAir"]=test_data["CentralAir"].astype(str)
    test_data['haspool']=test_data['haspool'].astype(str)
    test_data['has2ndfloor']=test_data['has2ndfloor'].astype(str)
    test_data['hasbsmt']=test_data['hasbsmt'].astype(str)
    test_data['hasfence']=test_data['hasfence'].astype(str)
    test_data['PremiumFlag']=test_data['PremiumFlag'].astype(str)
    test_data["GarageFlag"]=test_data["GarageFlag"].astype(str)

    test_data.drop(columns=coldel,inplace=True)

    print(len(coldel)," columns deleted")

    test_data2=encoder.transform(test_data)


#     skewness = skewness[abs(skewness) > 0.75]

    from scipy.special import boxcox1p
#     skewed_cols = skewness.index

#     rv=['MSSubClass','SalePrice','Age','BsmtBaths','HalfBath','BsmtFullBath','Fireplaces',
#         'GarageAge','KitchenAbvGr','ReModelAge','TotRmsAbvGrd','YearRemodAdd']

#     sk_train=set(skewed_cols).difference(set(rv)).difference(set(ordlabels))
    lamb = 0.15
    for cols in sk_train:  
        test_data2[cols] = boxcox1p(test_data2[cols], lamb)

    test_data3=ct.transform(test_data2)

    print("Shape of transformed test data :",test_data3.shape)

    return test_data3

In [37]:
# ames_preprocessing_script is based on appendix and uploaded to the notebook
# import ames_preprocessing_script as ames 
# test_data=ames.process_testdata2("test.csv",coldel,encoder,sk_train,ct)

test_data=process_testdata2("../input/house-prices-advanced-regression-techniques/test.csv",coldel,encoder,sk_train,ct)
np.savetxt("test_data1.csv",test_data,delimiter=",")

Null columns:  0
9  columns deleted
Shape of transformed test data : (1459, 281)


## Training the Models and Predictions

### Cross Validation for XGBoost

In [38]:
# Though it is not required to save and reload everytime. Once we have finalized the dataset - in terms of processing 
# null values, adding deleting new columns, encoding the data with column transformer etc., then we can load the 
# processed data and focus on experimenting with the machine learning model

X_test=np.loadtxt("X_test1.csv",delimiter=',')
y_test=np.loadtxt("y_test1.csv",delimiter=',')
X_train=np.loadtxt("X_train1.csv",delimiter=',')
y_train=np.loadtxt("y_train1.csv",delimiter=',') 
y=np.loadtxt("y.csv",delimiter=',')
 
X=pd.read_csv("Xt.csv")
X.drop(columns="Unnamed: 0",inplace=True)
X1=np.loadtxt("x1.csv",delimiter=',')
test_data=np.loadtxt("test_data1.csv",delimiter=',')

In [39]:
X.shape

(1452, 83)

In [40]:
# xgr=xg.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
#                              learning_rate=0.05, max_depth=3, 
#                              min_child_weight=1.7817, n_estimators=2200,
#                              reg_alpha=0.4640, reg_lambda=0.8571,
#                              subsample=0.5213, silent=1,
#                              random_state =1123, nthread = -1,verbosity=0)


# num_folds=5
# kfold = KFold(n_splits=num_folds,shuffle=True,random_state=1123)

# fold_no = 1
# acc_per_fold = []
# loss_per_fold = []
# ypred_testdata={}

# # test_data=preprocess_testdata("test.csv")
# ct.fit(X)
# for tra, tes in kfold.split(X, y):
    
#     X_train=ct.transform(X.iloc[tra])
#     y_train=y[tra]
#     X_test=ct.transform(X.iloc[tes])
#     y_test=y[tes]
    
#     xgr.fit(X_train,np.log(y_train))
    
# #     filename=r"F:\Kaggle\HousingPrices\xgboost\xgcvmodel_" + str(fold_no)    
# #     xgr.save_model(filename)    

#     ypred_xg=xgr.predict(X_test)  
    
#     ypred_testdata[fold_no]=np.exp(xgr.predict(test_data))    
    
#     scores = mean_squared_error((ypred_xg),np.log(y_test),squared=False)
    
#     print(f'Score for fold {fold_no}: {scores}')
#     acc_per_fold.append(scores) 
#     fold_no +=1
          
# print("Average RMSE over ",num_folds," folds :",sum(acc_per_fold)/len(acc_per_fold))

# # Result 1 - without removing outliers
# # Score for fold 1: 0.11046911788633561
# # Score for fold 2: 0.12659792674069162
# # Score for fold 3: 0.15770987377810577
# # Score for fold 4: 0.108230843149171
# # Score for fold 5: 0.12680772941599697
# # Average RMSE over  5  folds : 0.12596309819406018
 
# #Result 2 - after removing 8 outliers
# # Score for fold 1: 0.11102508831659731
# # Score for fold 2: 0.10350964965561603
# # Score for fold 3: 0.11602991923229042
# # Score for fold 4: 0.10807433147223645
# # Score for fold 5: 0.11883951688117621
# # Average RMSE over  5  folds : 0.11149570111158329

Note that in the original model above, the accuracy could have been anything from 0.10 to 0.157. We can test Kaggle score for the first model which has got lowest error, though it is not necessary that the model that gives lowest error will also give the lowest error for Kaggle test data!!

In [41]:
# ypred_testdata[2]

In [42]:
# testdata=pd.read_csv("test.csv")
# test_id=testdata["Id"]

# xgrsubmission=pd.DataFrame({'Id':test_id,'SalePrice':ypred_testdata[2]})
# xgrsubmission.to_csv('nm-final-xgbcv.csv',index=False) 
# xgrsubmission

#Kaggle score for above - 0.12349

**In the above cross validation, the accuracies range from 10.35% to 11.88%. The best prediction is from second cross validation at 10.35%. Submitting the result to Kaggle however gives only a Kaggle score of 0.12349 implying variance**

### We can now apply the model for the entire dataset to get the maximum possible accuracy 

### XGBoost with Full Dataset

In [43]:
#load the input data
y=np.loadtxt("y.csv",delimiter=',') 
X1=np.loadtxt("x1.csv",delimiter=',')
test_data=np.loadtxt("test_data1.csv",delimiter=',')

In [44]:
# X1=ct.transform(X)
xgr=xg.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =1123, nthread = -1,verbosity=0)
xgr.fit(X1,np.log(y))

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.4603,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0.0468, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0,
             min_child_weight=1.7817, missing=nan, monotone_constraints='()',
             n_estimators=2200, n_jobs=-1, nthread=-1, num_parallel_tree=1,
             predictor='auto', random_state=1123, reg_alpha=0.464, ...)

In [45]:
#0.077459
y_pred_xgf=xgr.predict(X1)
mean_squared_error((y_pred_xgf),np.log(y),squared=False)

0.07778072943625701

In [46]:
test_data=np.loadtxt("test_data1.csv",delimiter=',')
np.exp(xgr.predict(test_data))

array([126623.8 , 158878.03, 184267.28, ..., 151378.55, 119538.97,
       223126.8 ], dtype=float32)

In [47]:
# testdata=pd.read_csv("test.csv")
# test_id=testdata["Id"]
 
# xgrsubmission=pd.DataFrame({'Id':test_id,'SalePrice':np.exp(xgr.predict(test_data))})
# xgrsubmission.to_csv('nm-final-xgbfull.csv',index=False) 
# xgrsubmission
# #Kaggle Score - 0.12452

### Alternate XGBoost Model

In [48]:
# Another check - this model is from https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook#Models

xgr2 = xg.XGBRegressor(learning_rate=0.01,n_estimators=3460,
                                     max_depth=3, min_child_weight=0,
                                     gamma=0, subsample=0.7,
                                     colsample_bytree=0.7,
                                     objective='reg:linear', nthread=-1,
                                     scale_pos_weight=1, seed=27,
                                     reg_alpha=0.00006)

xgr2.fit(X1,np.log(y))

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=0,
             missing=nan, monotone_constraints='()', n_estimators=3460,
             n_jobs=-1, nthread=-1, num_parallel_tree=1, objective='reg:linear',
             predictor='auto', random_state=27, ...)

In [49]:
np.exp(xgr2.predict(test_data))

array([125243.16, 163625.7 , 183693.89, ..., 158435.9 , 117357.85,
       217811.58], dtype=float32)

In [50]:
y_pred_xgf=xgr2.predict(X1)
mean_squared_error((y_pred_xgf),np.log(y),squared=False)

0.048246621242634576

In [51]:
np.exp(xgr2.predict(test_data))

array([125243.16, 163625.7 , 183693.89, ..., 158435.9 , 117357.85,
       217811.58], dtype=float32)

In [52]:
# testdata=pd.read_csv("test.csv")
# test_id=testdata["Id"]
 
# xgrsubmission=pd.DataFrame({'Id':test_id,'SalePrice':np.exp(xgboost.predict(test_data))})
# xgrsubmission.to_csv('nm-xgboost-final-model2.csv',index=False) 

## Stacking

We shall now do stacked regressions. I have used both Sklearn's stacked regressor and also mlxtend's stackedCV regressor. Here are some insights:

1. Mlextend's StackedCVRegressor gives higher accuracy
2. Using the option "use_features_in_secondary" as "False" - implying that the meta regressor does not use actual dataset itself - gives better accuracy than otherwise
3. Adding deep learning model to a stacked regressor does not perceptibly improve accuracy
4. Higher number of stacked models does not automatically result in better accuracy. Some times 3 models gave better accuracy than 5-7 models
5. Changing the meta-regressor, say, from lasso to xgboost, changes the accuracy. But to really see the difference, we need to do cross validation of entire stacked regression model which takes quite sometime. I have given one output as commented cell

**6. As like cross validation of any machine learning model, the cross validation of stacked regression also gives a wide range of 10% to 11x% implying the accuracy gained through stacking is minimal and that it is also dependent on random state and input data order**

#### Import various models

**While using various models for stacking, I have not done any gridsearch. Instead, I have directly used parameters from some of the top accuracy notebooks from Kaggle - reducing clutter and saving time!**

In [53]:
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import Ridge
from sklearn.svm import LinearSVR
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
 
import lightgbm as lgb

lgbm = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11,random_state=1123)

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =1123)

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)


ENet = ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=1123)

lass = Lasso(alpha =0.0005, random_state=1123)

lasso = LassoCV(random_state=1123)


xgr=xg.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =1123, nthread = -1)


xgr2 = xg.XGBRegressor(learning_rate=0.01,n_estimators=3460,
                                     max_depth=3, min_child_weight=0,
                                     gamma=0, subsample=0.7,
                                     colsample_bytree=0.7,
                                     objective='reg:linear', nthread=-1,
                                     scale_pos_weight=1, random_state=1123,
#                        seed=27,
                                     reg_alpha=0.00006)


### Sklearn Stacking Regressor

In the first model, I am using training dataset without deleting the outliers

In [54]:
train2=pd.read_csv("train2.csv")
train2.drop(columns='Unnamed: 0', inplace=True)
 
y=train2['SalePrice']
X=train2.drop(columns='SalePrice')

train2.shape,X.shape,y.shape

((1460, 84), (1460, 83), (1460,))

In [55]:
ct=ColumnTransformer([('ohem1',OneHotEncoder(sparse=False,handle_unknown='ignore'),list(catlabels)),                                    
                      ('rbsm1',StandardScaler(),list(normlabels))
                     ], remainder='passthrough')
ct.fit(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,shuffle=True,random_state=1123)
X_train,X_test=ct.transform(X_train),ct.transform(X_test)

In [56]:
lgbm = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =1123)

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)


ENet = ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=1123)

lasso = Lasso(alpha =0.0005, random_state=1123)


xgr=xg.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =1123, nthread = -1)

 
estimators = [
              ('ls', lasso),
              ('lgbm',lgbm),
              ('gboost',GBoost),
              ('krr',KRR),
              ('enet',ENet),
              ('xg',xgr)]


stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=LassoCV(),verbose=0)
stacking_regressor.fit(X_train,np.log(y_train))
  



StackingRegressor(estimators=[('ls', Lasso(alpha=0.0005, random_state=1123)),
                              ('lgbm',
                               LGBMRegressor(bagging_fraction=0.8,
                                             bagging_freq=5, bagging_seed=9,
                                             feature_fraction=0.2319,
                                             feature_fraction_seed=9,
                                             learning_rate=0.05, max_bin=55,
                                             min_data_in_leaf=6,
                                             min_sum_hessian_in_leaf=11,
                                             n_estimators=720, num_leaves=5,
                                             objective='regression')),
                              ('gboost',
                               GradientBoost...
                                            importance_type=None,
                                            interaction_constraints=None,
       

In [57]:
y_pred=stacking_regressor.predict(X_test)
mean_squared_error(y_pred,np.log(y_test),squared=False)

0.10582794591780029

In [58]:
test_data2=process_testdata2("../input/house-prices-advanced-regression-techniques/test.csv",coldel,encoder,sk_train,ct)

Null columns:  0
9  columns deleted
Shape of transformed test data : (1459, 282)


In [59]:
ypred_testdata=stacking_regressor.predict(test_data2)
np.exp(ypred_testdata)

array([125115.74027351, 163239.57683883, 192827.22255182, ...,
       175352.24786657, 121367.35001446, 230052.18703478])

In [60]:
# testdata=pd.read_csv("test.csv")
# test_id=testdata["Id"]
 
# stacksubmission=pd.DataFrame({'Id':test_id,'SalePrice':np.exp(ypred_testdata)})
# stacksubmission.to_csv('nm-stacked2.csv',index=False) 
# stacksubmission
#Kaggle Score - 0.12093

#### With Full dataset - after deleting outliers

In [61]:
#load the input data
y=np.loadtxt("y.csv",delimiter=',') 
X1=np.loadtxt("x1.csv",delimiter=',')
test_data=np.loadtxt("test_data1.csv",delimiter=',')

In [62]:
stacking_regressor.fit(X1,np.log(y))



StackingRegressor(estimators=[('ls', Lasso(alpha=0.0005, random_state=1123)),
                              ('lgbm',
                               LGBMRegressor(bagging_fraction=0.8,
                                             bagging_freq=5, bagging_seed=9,
                                             feature_fraction=0.2319,
                                             feature_fraction_seed=9,
                                             learning_rate=0.05, max_bin=55,
                                             min_data_in_leaf=6,
                                             min_sum_hessian_in_leaf=11,
                                             n_estimators=720, num_leaves=5,
                                             objective='regression')),
                              ('gboost',
                               GradientBoost...
                                            importance_type=None,
                                            interaction_constraints=None,
       

In [63]:
y_pred=stacking_regressor.predict(X1)
mean_squared_error(y_pred,np.log(y),squared=False)

0.073020347391113

In [64]:
test_data=np.loadtxt("test_data1.csv",delimiter=',')

In [65]:
ypred_testdata=stacking_regressor.predict(test_data)
np.exp(ypred_testdata)

array([121756.92216155, 158977.87917177, 183847.388238  , ...,
       160896.82830394, 118178.06157849, 224662.75630614])

In [66]:
# testdata=pd.read_csv("test.csv")
# test_id=testdata["Id"]
 
# stacksubmission=pd.DataFrame({'Id':test_id,'SalePrice':np.exp(ypred_testdata)})
# stacksubmission.to_csv('nm-stacked3.csv',index=False) 
# stacksubmission
# Kaggle Score - 0.12023

### Deep Learning Model for Stacking

In [67]:
def create_model(): 
    
#     boundaries = [9600,19200,28800]
#     values = [0.001,0.001, 0.0005,0.0001]

    boundaries = [19201,28801]
    values = [0.001, 0.0005,0.0001]

 
    lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

    model =Sequential([  
    Dense(80,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234),
          kernel_regularizer=tf.keras.regularizers.L2(0.01)),
    Dense(60,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234),
          kernel_regularizer=tf.keras.regularizers.L2(0.00075)),
    Dense(40,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)),
    Dense(20,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)), 
    Dense(10,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)), 
    Dense(1,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234))
    ])

    model.compile(loss=tf.keras.losses.MeanSquaredLogarithmicError(),optimizer=optimizer, 
              metrics=[tf.keras.metrics.MeanSquaredLogarithmicError()])

    return model


In [68]:
#We shall also create a deep learning model using the basic mse loss function instead of logarithmic

def create_model2(): 
    
#     boundaries = [9600,19200,28800]
#     values = [0.001,0.001, 0.0005,0.0001]

    boundaries = [19201,28801]
    values = [0.001, 0.0005,0.0001]

 
    lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

    model =Sequential([  
    Dense(80,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234),
          kernel_regularizer=tf.keras.regularizers.L2(0.01)),
    Dense(60,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234),
          kernel_regularizer=tf.keras.regularizers.L2(0.00075)),
    Dense(40,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)),
    Dense(20,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)), 
    Dense(10,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234)), 
    Dense(1,activation='relu',kernel_initializer=tf.keras.initializers.HeNormal(seed=1234))
    ])

    model.compile(loss=tf.keras.losses.MeanSquaredError(),optimizer=optimizer, 
              metrics=[tf.keras.metrics.MeanSquaredError()])

    return model


In [69]:
dlmodel=create_model()

num_epochs=800

hist_model=dlmodel.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=32,
                      verbose=0,epochs=num_epochs)


You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

You may not need to update to CUDA 11.1; cherry-p

In [70]:
absl.logging.set_verbosity(absl.logging.ERROR)
logging.getLogger("tensorflow").setLevel(logging.WARNING)

dlmodel=create_model()

num_epochs=300

checkpoint_stackdl_path = '/stackdl'

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_stackdl_path,save_best_only=True,verbose=0,
                                                      monitor="val_loss")

hist_model=dlmodel.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=32,
                      verbose=0,epochs=num_epochs,callbacks=[model_checkpoint]) 

In [71]:
y_pred=dlmodel.predict(X_test)
rmsle=mean_squared_error(y_pred,y_test,squared=False)
print("Root mean squared log error  is ",rmsle)
print("R2 Score is ",r2_score(y_pred,y_test))

Root mean squared log error  is  36699.051252753925
R2 Score is  0.61970773527714


In [72]:
checkpoint_stackdl_path = '/stackdl'
dlmodel=tf.keras.models.load_model(checkpoint_stackdl_path)

yx_pred1=dlmodel.predict(X_test)
rmsle=mean_squared_error(np.log(yx_pred1),np.log(y_test),squared=False)
print("Root mean squared log error is ",rmsle)
print("R2 Score is ",r2_score(np.log(yx_pred1),np.log(y_test)))

Root mean squared log error is  0.14578878704165688
R2 Score is  0.817695675553403


In [73]:
# yx_pred=dlmodel.predict(test_data)
# yx_pred

### Cross Validation of Deep Learning Model

**This takes a lot of time, I have given the output and not running it in kaggle**

In [74]:
# tf.random.set_seed(1123)

# cv_model=create_model()

# absl.logging.set_verbosity(absl.logging.ERROR)
# logging.getLogger("tensorflow").setLevel(logging.ERROR)

# num_folds=5
# kfold = KFold(n_splits=num_folds, shuffle=True,random_state=1123)

# fold_no = 1
# acc_per_fold = []
# loss_per_fold = []

# cv_model=create_model() 

# ct.fit(X)

# for tra, tes in kfold.split(X, y):    
    
#     X_train=ct.transform(X.iloc[tra])
#     y_train=y[tra]
#     X_test=ct.transform(X.iloc[tes])
#     y_test=y[tes]

#     num_epochs=900

#     cv_model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=32,
#                       verbose=0,epochs=num_epochs) 
    
#     scores = cv_model.evaluate(X_test, y_test, verbose=0)
    
# #     ypred_testdata[fold_no] = cv_model.predict(test_data)
    
#     print(f'Score for fold {fold_no}:{cv_model.metrics_names[0]} of {scores[0]};{cv_model.metrics_names[1]} of {np.sqrt(scores[1])}')
#     acc_per_fold.append(np.sqrt(scores[1]))
#     loss_per_fold.append(scores[0])
    
#     fold_no += 1
          
# print("Average loss over ",num_folds," folds :",sum(loss_per_fold)/len(loss_per_fold),"\n",
#       "Average RMSLE over ",num_folds," folds :",sum(acc_per_fold)/len(acc_per_fold))


# # Result without removing outliers
# # # Score for fold 1:loss of 0.01851341500878334;mean_squared_logarithmic_error of 0.11922487031967988
# # # Score for fold 2:loss of 0.020068099722266197;mean_squared_logarithmic_error of 0.12776380806454743
# # # Score for fold 3:loss of 0.028314953669905663;mean_squared_logarithmic_error of 0.15836473734414347
# # # Score for fold 4:loss of 0.01694272831082344;mean_squared_logarithmic_error of 0.11727990242707094
# # # Score for fold 5:loss of 0.020004048943519592;mean_squared_logarithmic_error of 0.13101463928261295
# # # Average loss over  5  folds : 0.020768649131059646 
# # #  Average RMSLE over  5  folds : 0.13072959148761093

# #Result after removing 8 outliers
# # Score for fold 1:loss of 0.017953351140022278;mean_squared_logarithmic_error of 0.1176721208248425
# # Score for fold 2:loss of 0.014617688953876495;mean_squared_logarithmic_error of 0.10603736433876926
# # Score for fold 3:loss of 0.015291799791157246;mean_squared_logarithmic_error of 0.11097516609180755
# # Score for fold 4:loss of 0.013014090247452259;mean_squared_logarithmic_error of 0.1020088657284752
# # Score for fold 5:loss of 0.01701594702899456;mean_squared_logarithmic_error of 0.12146330785692717
# # Average loss over  5  folds : 0.015578575432300568 
# #  Average RMSLE over  5  folds : 0.11163136496816435

#### Using Scikeras to wrap Deep Learning Model

In [75]:
from scikeras.wrappers import KerasRegressor

dlmodel2=KerasRegressor(model=dlmodel, warm_start=True, random_state=1123) 
#                                        optimizer=None, loss=None, metrics=None, batch_size=None, 
#                                        validation_batch_size=None, verbose=1, callbacks=None, validation_split=0.0, 
#                                        shuffle=True, run_eagerly=False, epochs=900)

### Stacking including Deep Learnng Model - Cross Validation

In [76]:
# stackdlmodel = StackingCVRegressor(regressors=(dlmodel2,model_lgb,GBoost,ENet,KRR,xgr2),
#                             meta_regressor=lasso, cv=5,
#                             use_features_in_secondary=False,
#                             store_train_meta_features=True,
#                             shuffle=True,
#                             random_state=89765)

In [77]:
def rmsle_stack(model):
     
    rmse= np.sqrt(-cross_val_score(model, X_train, np.log(y_train), scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

In [78]:
# Dont run this cell.. takes a long time - result for given pythonseed and random state noted here

# rmsle_stack(stackdlmodel)

#Results - array([0.11673763, 0.11025071, 0.09852142, 0.10274116, 0.09937728]) - Mean --> 0.1055

### Stacking without a deep learning model

In [79]:
y=np.loadtxt("y.csv",delimiter=',')

X1=np.loadtxt("x1.csv",delimiter=',')

test_data=np.loadtxt("test_data1.csv",delimiter=',')
X1.shape,y.shape,test_data.shape

((1452, 281), (1452,), (1459, 281))

In [80]:
stackmodel = StackingCVRegressor(regressors=(lgbm,GBoost,ENet,KRR,xgr2),
                            meta_regressor=lass, cv=5,
                            use_features_in_secondary=False,
                            store_train_meta_features=True,
                            shuffle=True,
                            random_state=1123)
stackmodel.fit(X1,np.log(y))
y_pred=stackmodel.predict(X1)
mean_squared_error(y_pred,np.log(y),squared=False)


#0.07127655369326556



0.06721123376076008

In [81]:
np.exp(stackmodel.predict(test_data))

array([121760.83270956, 159603.44327575, 183209.76886411, ...,
       161636.42357817, 117000.43834318, 222706.26696641])

In [82]:
# submission

ypred_testdata=np.exp(stackmodel.predict(test_data))
 
stacksubmission=pd.DataFrame({'Id':test_id,'SalePrice':ypred_testdata})
stacksubmission.to_csv('nm-stacked-stack1.csv',index=False) 

In [83]:
stacksubmission
#Kaggle Score - nm-stacked-stack1.csv - 0.1196

Unnamed: 0,Id,SalePrice
0,1461,121760.832710
1,1462,159603.443276
2,1463,183209.768864
3,1464,196815.707905
4,1465,194208.873965
...,...,...
1454,2915,80874.423627
1455,2916,81721.242318
1456,2917,161636.423578
1457,2918,117000.438343


### Stacking along with a deep learning model

In [84]:
dlmodel=create_model2()

num_epochs=800

hist_model=dlmodel.fit(X1,np.log(y),batch_size=32,verbose=0,epochs=num_epochs)


You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.


In [85]:
absl.logging.set_verbosity(absl.logging.ERROR)
logging.getLogger("tensorflow").setLevel(logging.ERROR)

num_epochs=200

checkpoint_stackdl_path = '/stackdl'

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_stackdl_path,save_best_only=True,verbose=0,
                                                      monitor="loss")

hist_model=dlmodel.fit(X1,np.log(y),batch_size=32,verbose=0,epochs=num_epochs,callbacks=[model_checkpoint]) 

In [86]:
y_pred=dlmodel.predict(X1)
mean_squared_error(y_pred,np.log(y),squared=False)



0.06875906103096573

In [87]:
checkpoint_stackdl_path = '/stackdl'
model=tf.keras.models.load_model(checkpoint_stackdl_path)
y_pred=model.predict(X1)
mean_squared_error(y_pred,np.log(y),squared=False)



0.06921012803688972

In [88]:
checkpoint_stackdl_path = '/stackdl'
model=tf.keras.models.load_model(checkpoint_stackdl_path)
ypred_testdata_dl=np.exp(model.predict(test_data).ravel())
ypred_testdata_dl



array([117826.85, 160807.11, 193592.08, ..., 157691.4 , 108558.25,
       228204.28], dtype=float32)

In [89]:
# Submission for deep learning model
 
stacksubmission=pd.DataFrame({'Id':test_id,'SalePrice':ypred_testdata_dl})
stacksubmission.to_csv('nm-stacked-dl2.csv',index=False) 

In [90]:
stacksubmission
#Kaggle score - 0.12547

Unnamed: 0,Id,SalePrice
0,1461,117826.851562
1,1462,160807.109375
2,1463,193592.078125
3,1464,202378.796875
4,1465,192238.390625
...,...,...
1454,2915,81308.242188
1455,2916,79874.421875
1456,2917,157691.406250
1457,2918,108558.250000


In [91]:
from scikeras.wrappers import KerasRegressor

checkpoint_stackdl_path = '/stackdl'
dlmodel=tf.keras.models.load_model(checkpoint_stackdl_path)


dlmodel2=KerasRegressor(model=dlmodel,random_state=1123)

# , warm_start=True, random_state=1123)
#                                        optimizer='Adam', loss=tf.keras.losses.MeanSquaredLogarithmicError,
#                         metrics=tf.keras.metrics.MeanSquaredLogarithmicError, batch_size=32, 
#                                        validation_batch_size=None, verbose=1, callbacks=None, validation_split=0.0, 
#                                        shuffle=True, run_eagerly=False, epochs=1)

stackdlmodel = StackingCVRegressor(regressors=(dlmodel2,lgbm,GBoost,xgr2,ENet,KRR),
                            meta_regressor=lasso, cv=5,
                            use_features_in_secondary=False,
                            store_train_meta_features=True,
                            shuffle=True,
                            random_state=1123)

stackdlmodel.fit(X1,np.log(y))
y_pred=stackdlmodel.predict(X1)
mean_squared_error(y_pred,np.log(y),squared=False)

2023-01-18 02:24:57.324893: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://06f0b869868d4c87ae995fcff6fa24d0: INVALID_ARGUMENT: ram://06f0b869868d4c87ae995fcff6fa24d0 is a directory.
2023-01-18 02:24:58.655715: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://ac6323db6ee94a3a9eb3736a98ab7c0b: INVALID_ARGUMENT: ram://ac6323db6ee94a3a9eb3736a98ab7c0b is a directory.




2023-01-18 02:25:02.219371: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://443ecf79091843c591c43ce06de9828a: INVALID_ARGUMENT: ram://443ecf79091843c591c43ce06de9828a is a directory.




2023-01-18 02:25:06.702802: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://2c2e3b8801214fcf8b3d60bb6a4525ca: INVALID_ARGUMENT: ram://2c2e3b8801214fcf8b3d60bb6a4525ca is a directory.




2023-01-18 02:25:10.462269: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://8a76202b61eb4f07a2b6b721a7e0f0fb: INVALID_ARGUMENT: ram://8a76202b61eb4f07a2b6b721a7e0f0fb is a directory.




2023-01-18 02:25:14.180041: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ram://8e81d277c6cd47039f5aed2423128539: INVALID_ARGUMENT: ram://8e81d277c6cd47039f5aed2423128539 is a directory.




0.06812995267906787

In [92]:
stackdlmodel.train_meta_features_[:,:3]

array([[12.16743088, 12.22451264, 12.24017565],
       [12.17197037, 12.00082   , 11.98563599],
       [12.32175732, 12.27993671, 12.30058895],
       ...,
       [12.42737007, 12.48900015, 12.40267143],
       [11.8841753 , 11.82585388, 11.77797933],
       [12.03614235, 11.927258  , 11.90832438]])

In [93]:
ypred_testdata_stackdl=np.exp(stackdlmodel.predict(test_data))
ypred_testdata_stackdl



array([120587.23157643, 159179.23311223, 184551.33864951, ...,
       159434.67790744, 115549.67183055, 225551.62604887])

In [94]:
# submission
 
stacksubmission=pd.DataFrame({'Id':test_id,'SalePrice':ypred_testdata_stackdl})
stacksubmission.to_csv('nm-stacked-withdl3.csv',index=False) 

In [95]:
stacksubmission
#Kaggle Score for nm-stacked-withdl.csv --> 0.11982

Unnamed: 0,Id,SalePrice
0,1461,120587.231576
1,1462,159179.233112
2,1463,184551.338650
3,1464,197719.599475
4,1465,199057.146179
...,...,...
1454,2915,81605.743050
1455,2916,82590.307771
1456,2917,159434.677907
1457,2918,115549.671831


### Stacking & Blending

In [96]:
xgr2.fit(X1,y)
xgr_pred=xgr2.predict(X1)
xgr_rmsle=mean_squared_error(np.log(xgr_pred),np.log(y),squared=False)
print("Root Mean Square error for xgr :",xgr_rmsle)

lgbm.fit(X1,y)
lgbm_pred=lgbm.predict(X1)
lgbm_rmsle=mean_squared_error(np.log(lgbm_pred),np.log(y),squared=False)
print("Root Mean Square error for lgb :",lgbm_rmsle)

GBoost.fit(X1,y)
gb_pred=GBoost.predict(X1)
gb_rmsle=mean_squared_error(np.log(gb_pred),np.log(y),squared=False)
print("Root Mean Square error for gb :",gb_rmsle)

ENet.fit(X1,y)
enet_pred=ENet.predict(X1)
enet_rmsle=mean_squared_error(np.log(enet_pred),np.log(y),squared=False)
print("Root Mean Square error for enet :",enet_rmsle)

KRR.fit(X1,y)
krr_pred=KRR.predict(X1)
krr_rmsle=mean_squared_error(np.log(krr_pred),np.log(y),squared=False)
print("Root Mean Square error for krr :",krr_rmsle)

Root Mean Square error for xgr : 0.0541080682997864
Root Mean Square error for lgb : 0.0791277333122889
Root Mean Square error for gb : 0.04520287675210172
Root Mean Square error for enet : 0.1364357843173614
Root Mean Square error for krr : 0.09983571955920241


In [97]:
# some accuracies obtained in my laptop

# gb         - 0.04427760526602732  / 0.04369149712360021
# xgboost    - 0.0541               / 0.054171942067719185
# Stackmodel - 0.0709               / 0.07305513972444029
# lgb        - 0.07878933984199095  / 0.07925133713317101
# dlmodel    - 0.0911               / 0.09350974438083604
# krr        - 0.09983571955920889  / 0.09983571955920686
# enet       - 0.1370665505010631   / 0.13706640526876213

In [98]:
xgr_testpred=xgr2.predict(test_data)
xgr_testpred

array([127175.984, 160538.11 , 186192.77 , ..., 155076.11 , 122756.61 ,
       218190.98 ], dtype=float32)

In [99]:
dl=pd.read_csv("nm-stacked-dl2.csv")
st1=pd.read_csv("nm-stacked-stack1.csv")
st2=pd.read_csv("nm-stacked-withdl3.csv")

In [100]:
final_pred=(dl['SalePrice']*0.1+xgr_testpred*0.1+st1['SalePrice']*0.6+st2['SalePrice']*0.2)
final_pred

0       121674.229574
1       159732.435111
2       184814.614392
3       197141.679716
4       194036.643396
            ...      
1454     81088.661942
1455     81242.886640
1456    160145.541057
1457    116441.683505
1458    223373.612999
Name: SalePrice, Length: 1459, dtype: float64

In [101]:
# Final submission 
submission=pd.DataFrame({'Id':test_id,'SalePrice':final_pred.values})
submission.to_csv('submission.csv',index=False) 
#Kaggle 0.11951

In [102]:
submission

Unnamed: 0,Id,SalePrice
0,1461,121674.229574
1,1462,159732.435111
2,1463,184814.614392
3,1464,197141.679716
4,1465,194036.643396
...,...,...
1454,2915,81088.661942
1455,2916,81242.886640
1456,2917,160145.541057
1457,2918,116441.683505
