# Group Project – Predicting House Price
## Members: Shiqi Wang, Tao Tao, Ying Ben

## 1. Abstract

Purchasing a house is a major milestone that tops many people's lifetime to-do lists, and possibly their lists of financial fears too. When people tend to buy a house, there are too many elements to consider, e.g., the neighborhood, and the area. However, among these elements, house price is probably the one which nearly everyone cares the most about. There is a project in the Kaggle targeting such a problem, i.e., House Prices: Advanced Regression Techniques. In this project, our team will make a prediction of the house prices based on lots of features and the knowledge learnt from Machine Learning course.


## 2. Introduction

The dataset contains 79 explanatory variables describing almost every aspect of residential home in Ames, Iowa. 

Our goal is to predict price of each home. 

## 3. Related work

Previous work has indicated that the location of houses are important when predicting house price, and Steven C. Bourassa, Eva Cantoni and Martin Hoesli had developed spatial statistical models to improve the performance of conventional models, such as OLS model [1]. We will try to apply their models in our current work because our dataset has some spatial attributesand we hope it will have good performance.

Besides, we will use a hybrid regression model to predict house price. This idea is based on the work of Sifei Lu, who presented a regression method combined by Gradient boosting regression model and Lasso for house price regression [2]. According to the conclusion, this hybrid model has low rmse in their house dataset, so we will see whether this idea can perform well in our dataset. 


## 4.  Detailed problem 

### Data Overview
    - Large number of features (80 features).
    - Most data is categorical data which illustrates condition of houses.
    - Have “NA” values in many features.
### Problems
    - Lots of features are needed to process.
    - More features will be generated after transforming categorical data to numerical data.
    - The large amount of features may influence the efficiency of models and be difficult to interpret.
    - How to deal with “NA” values properly.

### Objective:
    - Reduce the number features by selecting or combining  proper features.
    - How to reduce features:
        Looking the distribution and variance of features.
        Looking the coefficient between features and target.
        By experience (e.g. business logic).

## 5. Approaches
### 5-1 train-data overview

In [13]:
import numpy as np
import pandas as pd
import math
import csv
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

In [14]:
# Read Data
dataset = pd.read_csv('train.csv')

# Extract data and target
target = np.log1p(dataset['SalePrice'])
data = dataset.drop(columns='SalePrice')
print(data[0:5])
print(target[0:5])

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities      ...       ScreenPorch PoolArea PoolQC Fence  \
0         Lvl    AllPub      ...                 0        0    NaN   NaN   
1         Lvl    AllPub      ...                 0        0    NaN   NaN   
2         Lvl    AllPub      ...                 0        0    NaN   NaN   
3         Lvl    AllPub      ...                 0        0    NaN   NaN   
4         Lvl    AllPub      ...                 0        0    NaN   NaN   

  MiscFeature MiscVal MoSold  YrSold  SaleType  SaleCondition  
0         NaN       0      2    20

### 5-2 Preprocessing

In [15]:

# Drop features, features have been selected in a file
col_list = []
f = open("FeatureEngineering.txt", "r")
line = f.readline()
for line in f:
    if line != "\n":
        col_list.append(line.split('\t')[0])
    else:
        break

col_list.append("MasVnrType")
# Temporarily drop Neighborhood for we have not had any insight now
# .append("Neighborhood")
print(col_list)

['SaleCondition', 'SaleType', 'WoodDeckSF', 'MiscFeature', 'MiscVal', 'Fence', 'PoolArea', 'PoolQC', 'ScreenPorch', '3SsnPorch', 'EnclosedPorch', 'PavedDrive', 'GarageCond', 'GarageQual', 'GarageYrBlt', 'Foundation', 'Functional', 'KitchenAbvGr', 'BsmtHalfBath', 'LowQualFinSF', 'Electrical', 'CentralAir', 'Heating', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtCond', 'ExterCond', 'Exterior1st', 'Exterior2nd', 'RoofMatl', 'RoofStyle', 'BldgType', 'Condition1', 'Condition2', 'LandSlope', 'Utilities', 'LandContour', 'Alley', 'Street', 'LotConfig', 'MSZoning', 'MSSubClass', 'Id', 'MasVnrType']


In [16]:
data = data.drop(columns = col_list)
print(data[1:5])

   LotFrontage  LotArea LotShape Neighborhood HouseStyle  OverallQual  \
1         80.0     9600      Reg      Veenker     1Story            6   
2         68.0    11250      IR1      CollgCr     2Story            7   
3         60.0     9550      IR1      Crawfor     2Story            7   
4         84.0    14260      IR1      NoRidge     2Story            8   

   OverallCond  YearBuilt  YearRemodAdd  MasVnrArea   ...   TotRmsAbvGrd  \
1            8       1976          1976         0.0   ...              6   
2            5       2001          2002       162.0   ...              6   
3            5       1915          1970         0.0   ...              7   
4            5       2000          2000       350.0   ...              9   

  Fireplaces FireplaceQu GarageType  GarageFinish  GarageCars  GarageArea  \
1          1          TA     Attchd           RFn           2         460   
2          1          TA     Attchd           RFn           2         608   
3          1          

In [17]:
pd.options.mode.chained_assignment = None
# Label Encoding
def labelEncode(column, score_set):
    na_table = column.isnull()
    for i in range(0, len(column)):
        for j in range(0, len(score_set)):
            if column[i] == score_set[j]:
                column[i] = j
                break
            if na_table[i] == True:
                column[i] = 0

# LotShape
score_set = ["IR3", "IR2", "IR1", "Reg"]
labelEncode(data["LotShape"], score_set)
# ExterQual
score_set = ["Po", "Fa", "TA", "Gd", "Ex"]
labelEncode(data["ExterQual"], score_set)
# BsmtQual
score_set = ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
labelEncode(data["BsmtQual"], score_set)
# BsmtExposure
score_set = ["NA", "No", "Mn", "Av", "Gd"]
labelEncode(data["BsmtExposure"], score_set)
# BsmtFinType1
score_set = ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]
labelEncode(data["BsmtFinType1"], score_set)
# HeatingQC
score_set = ["Po", "Fa", "TA", "Gd", "Ex"]
labelEncode(data["HeatingQC"], score_set)
# KitchenQual
score_set = ["Po", "Fa", "TA", "Gd", "Ex"]
labelEncode(data["KitchenQual"], score_set)
# Fireplace
score_set = ["NA", "Po", "Fa", "TA", "Gd", "Ex"]
labelEncode(data["FireplaceQu"], score_set)
# GarageFinish
score_set = ["NA", "Unf", "RFn", "Fin"]
labelEncode(data["GarageFinish"], score_set)

print(data[1:5])

   LotFrontage  LotArea LotShape Neighborhood HouseStyle  OverallQual  \
1         80.0     9600        3      Veenker     1Story            6   
2         68.0    11250        2      CollgCr     2Story            7   
3         60.0     9550        2      Crawfor     2Story            7   
4         84.0    14260        2      NoRidge     2Story            8   

   OverallCond  YearBuilt  YearRemodAdd  MasVnrArea   ...   TotRmsAbvGrd  \
1            8       1976          1976         0.0   ...              6   
2            5       2001          2002       162.0   ...              6   
3            5       1915          1970         0.0   ...              7   
4            5       2000          2000       350.0   ...              9   

  Fireplaces FireplaceQu GarageType  GarageFinish  GarageCars  GarageArea  \
1          1           3     Attchd             2           2         460   
2          1           3     Attchd             2           2         608   
3          1          

### 5-3 Data encoding

In [18]:
# Fill NA value
print(data.isnull().any())

LotFrontage      True
LotArea         False
LotShape        False
Neighborhood    False
HouseStyle      False
OverallQual     False
OverallCond     False
YearBuilt       False
YearRemodAdd    False
MasVnrArea       True
ExterQual       False
BsmtQual        False
BsmtExposure    False
BsmtFinType1    False
BsmtFinSF1      False
BsmtUnfSF       False
TotalBsmtSF     False
HeatingQC       False
1stFlrSF        False
2ndFlrSF        False
GrLivArea       False
BsmtFullBath    False
FullBath        False
HalfBath        False
BedroomAbvGr    False
KitchenQual     False
TotRmsAbvGrd    False
Fireplaces      False
FireplaceQu     False
GarageType       True
GarageFinish    False
GarageCars      False
GarageArea      False
OpenPorchSF     False
MoSold          False
YrSold          False
dtype: bool


In [19]:
pd.options.mode.chained_assignment = None
# NA Process
# Feature LotFrontagee
na_table = data["LotFrontage"].isnull()
# Use mean value to replace NA
mean = np.mean(data["LotFrontage"])
for i in range(0, len(na_table)):
    if na_table[i] == True:
        data["LotFrontage"][i] = mean
# Feature MasVnrArea
na_table = data["MasVnrArea"].isnull()
mean = np.mean(data["MasVnrArea"])
for i in range(0, len(na_table)):
    if na_table[i] == True:
        data["MasVnrArea"][i] = mean

In [20]:
# Combine Variable
# GarageType
na_table = data["GarageType"].isnull()
remainLevel = ["Attchd", "Detchd"]
length = len(remainLevel)
for i in range(0, len(data["GarageType"])):
    if na_table[i] == True:
        data["GarageType"][i] = "NoGarage"
        continue
    for j in range(0, length):
        if data["GarageType"][i] == remainLevel[j]:
            break
        if j == length - 1 and data["GarageType"][i] != remainLevel[j]:
            data["GarageType"][i] = "Others"

In [21]:
# HouseStyle
for i in range(0, len(data["HouseStyle"])):
    if data["HouseStyle"][i] == "1.5Fin" or data["HouseStyle"][i] == "1.5Unf":
        data['HouseStyle'][i] = "1.5Story"
        continue
    if data["HouseStyle"][i] == "2.5Fin" or data["HouseStyle"][i] == "2.5Unf":
        data["HouseStyle"][i] = "2.5Story"
        continue

In [22]:
# Dummy Variable
data = pd.get_dummies(data, columns=["Neighborhood", "GarageType", "HouseStyle"])

In [23]:
# Scale Data
scaler = StandardScaler()
scale_data = scaler.fit_transform(data)
print(scale_data[1:5])


[[ 0.4519361  -0.09188637  0.70129102 -0.07183611  2.17962776  0.15673371
  -0.42957697 -0.57441047 -0.68960393  0.58316783  2.22099903  0.69011514
   1.17199212 -0.64122799  0.46646492  0.89117944  0.25714043 -0.79516323
  -0.48251191 -0.81996437  0.78974052 -0.76162067  0.16377912 -0.77109084
  -0.31868327  0.60049493  0.64889007  0.31847458  0.31172464 -0.06073101
  -0.70448325 -0.48911005 -0.61443862 -0.10854037 -0.03703704 -0.10526316
  -0.20339487 -0.1398323  -0.33838413 -0.19025216 -0.27116307 -0.23917551
  -0.16124951 -0.10854037 -0.1863522  -0.42683279 -0.07875671 -0.22941573
  -0.16998114 -0.23595776 -0.28963792 -0.13199092 -0.23106504 -0.20521398
  -0.25018188 -0.13199092 -0.16347148 11.47725023  0.82350526 -0.60055892
  -0.24235968 -0.3019617  -0.36059806  1.00549455 -0.11482721 -0.66213567
  -0.16124951 -0.21585871]
 [-0.09311018  0.07347998 -1.01663664  0.65147924 -0.51719981  0.9847523
   0.83021457  0.32306034  1.05230219  0.58316783  0.34662991  1.16471151
   0.0929071

### 5-4 Train Data with different model

####  Linear Regression(Baseline Model)

In [24]:
# Train Data
# Baseline Model (Linear Regression)
model = LinearRegression()
y_pred = cross_val_predict(model, scale_data, target, cv=10)

print(np.sqrt(mean_squared_error(y_pred, target)))

0.14658811628248603


In [26]:
import pandas as pd
import numpy as np
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict


predictors = pd.read_csv('newtrain.csv')
predictors = predictors.iloc[:,1:-1]
target = pd.read_csv('target.csv',header=None)
target = target.iloc[:,1]

print(predictors.shape[0])
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))


1460


In [27]:
## Linear regression
from sklearn.linear_model import LinearRegression

linreg= LinearRegression().fit(predictors, target)
target_predicted = cross_val_predict(linreg, predictors, target, cv=10)
rmse(target,target_predicted)

0.14658605782219736

#### Ridge & Lasso

In [28]:
## Ridge
from sklearn.linear_model import Ridge 

best_rmse = 1
for a in [0.0001,0.001,0.01,0.1,1,10,100,1000]:
    RidgeModel=Ridge(alpha = a)
    target_predicted = cross_val_predict(RidgeModel, predictors, target, cv=10)         
    score = rmse(target,target_predicted)
    
    if score < best_rmse:
        best_rmse = score
        best_parameter = a

print(best_rmse)
print(best_parameter)

0.14658309093979632
0.1


In [29]:
# Lasso
from sklearn.linear_model import Lasso
import warnings
warnings.filterwarnings('ignore')

best_rmse = 1
for a in [0.0001,0.001,0.01,0.1,1,10,100,1000]:
    LassoModel = Lasso(alpha = a)
    target_predicted = cross_val_predict(LassoModel, predictors, target, cv=10)         
    score = rmse(target,target_predicted)
    
    if score < best_rmse:
        best_rmse = score
        best_parameter = a

print(best_rmse)
print(best_parameter)

0.1468117255118387
0.0001


#### knn regression

In [30]:
## knn regression
from sklearn.neighbors import KNeighborsRegressor
best_rmse = 1

for n_neigh in [1,3,5,10]:
    knnreg = KNeighborsRegressor(n_neighbors = n_neigh)
    target_predicted = cross_val_predict(knnreg, predictors, target, cv=10)         
    score = rmse(target,target_predicted)
    
    if score < best_rmse:
        best_rmse = score
        best_parameter = n_neigh

print(best_rmse)
print(best_parameter)

0.22196999469826473
5


#### Dtree regression

In [31]:
## Dtree regression
from sklearn.tree import DecisionTreeRegressor
best_rmse = 1

for depth in [1,3,5,10,20,60]:
    dtreereg = DecisionTreeRegressor(max_depth = depth)
    target_predicted = cross_val_predict(dtreereg, predictors, target, cv=10)         
    score = rmse(target,target_predicted)
    
    if score < best_rmse:
        best_rmse = score
        best_parameter = depth

print(best_rmse)
print(best_parameter)

0.19763745470961303
5


#### Adaboost

In [32]:
## Adaboost
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

adareg = AdaBoostRegressor(LinearRegression(),n_estimators=100, random_state=0)
adareg.fit(predictors, target)
target_predicted = cross_val_predict(adareg, predictors, target, cv=10)
rmse(target,target_predicted)

0.15591714554189626

#### NN regression

In [34]:
## NN regression
from sklearn.neural_network import MLPRegressor

nnreg = MLPRegressor().fit(predictors, target)
target_predicted = cross_val_predict(nnreg, predictors, target, cv=10)
rmse(target,target_predicted)

11.589286735034715

#### SVM regression

In [38]:
## SVM regression
from sklearn.svm import SVR

svmreg = MLPRegressor().fit(predictors, target)
target_predicted = cross_val_predict(svmreg, predictors, target, cv=10)
rmse(target,target_predicted)

16.013821308758544

#### randomforest

In [36]:
## randomforest
from sklearn.ensemble import RandomForestRegressor

rfreg = RandomForestRegressor(n_estimators=100,max_depth=50).fit(predictors, target)
target_predicted = cross_val_predict(rfreg, predictors, target, cv=10)
rmse(target,target_predicted)

0.14221426306108007

 ## 6. Conclusions
 
Compared with different models, we find that randomforest has better performance than the others. But it can still imporve. Considering the model's parameters, we didn't get the best combination to predict th house pricing.There are two important limitations affecting the performance of our approach. 
<br><br>
First limitation is related to the large number of features. It makes the performance of our model hard to improve. A solution to this issue is that we try to select features that are more related to the predictor and introduce some dummy variables, which helps a little to final results. 
<br><br>
The second limitation is the ensemble methods. In this project, we do not find a good way to ensemble our models. We have learned that some ensemble methods such as stacking may work better in this problem.

## 7. Reference

[1] Bourassa, S. C., Cantoni, E., & Hoesli, M. (2007). Spatial dependence, housing submarkets, and house price prediction. The Journal of Real Estate Finance and Economics, 35(2), 143-160.

[2] Lu, S., Li, Z., Qin, Z., Yang, X., & Goh, R. S. M. (2017, December). A hybrid regression technique for house prices prediction. In Industrial Engineering and Engineering Management (IEEM), 2017 IEEE International Conference on(pp. 319-323). IEEE.

### responsibilities
Shiqi Wang：feature expansion,feature selection,report writing

Tao Tao: data preprocessing,feature selection,model training

Ben Ying: data preprocessing,feature selection,model training