# Introduction

Purchasing a home is one of the most  decisions people make. It is pivotal that a prospective home buyer makes this purchase at the correct price. However, when facing a decision of such financial magnitude, people may consider that they are paying more for the house than it’s worth. Previously the complexity of this task led people to simply take the price of the house and divide by the total square feet, obtaining a price per square foot which one could compare to other houses in the neighborhood.  Now with the data and increased computational power, we are able to increase the sophistication of our analysis based on a house’s features. However, because of different reasons, it is still necessary to extrapolate to the house actually being purchased. This project created models to find the best way of predicting housing prices in certain areas to make the process more transparent. The goal of this project is to enable house buyer having a better way of estimating house prices, and evaluating houses before purchasing.

# Data

The data was compiled by Dean De Cock and published in https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview. The dataset under analysis was the House Prices: Advanced Regression Techniques set on Kaggle.com.  The location of the houses sold was in Ames, Iowa during the period 2006 to 2010.  There were 1460 houses and their sale prices for 79 features (e.g. the number of bathrooms, square footage, overall quality rating). The goal is to predict the prices of another set of houses for which the prices were not provided.
Of the 79 explanatory variables, we found that 51 are categorical and 28 are continuous. We discovered that each predictor variable could be categorized into variables such as lot/land, location, age, appearance, external features, room/bathroom, kitchen, basement, roof, garage, and utilities, etc.
Complete information about the variables can be found: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data .

# Data Preprocessing 

To check the total number of observations, I find there are 1460 in the train set, and 1459 in the test set. As Id is unnecessary for the prediction process so I remove it from both datasets. All the other columns except SalePrice will be fitted then in the models as features. Imputations to missing values will be talked about in the part of feature engeneering. 

After removing Id, I then take a look at outliers by plotting the relationship between GrLivArea and SalePrice. As it's shown below, the overall relationship turns to be positive, while there are two points with extremely large areas and very low prices. Based on the situation that larger area number usually leads to higher prices, it seems the two data points are for some reason recorded by mistake, so they are removed. <img src="pics/scatter1.png" width='400'>

This is what the relationship is like after removing the two dots: <img src="pics/scatter2.png" width='400'>

The next step is to take a look at the dependent variable--SalePrice. As it is shown in the histogram, the distribution of prices looks a little right skewed, with the standard deviation huge. The probability plot confirms this observation (that the overall trend of quantiles doesn't follow a straight line):
<img src="pics/saleprice1.png" width='400'>
<img src="pics/qq1.png" width='400'>

As most models require normal distribution of the target variable, a log-transformation is added as a remedy to the situation. As it is shown below, the mean of all prices is 12.02 with .4 standard deviation, which indicating a normal distribution. The quantile plot also confirms this result, so in both datasets, log-transformation is applied to Saleprice.
<img src="pics/saleprice2.png" width='400'>
<img src="pics/qq2.png" width='400'>

# Feature Engineering

The first thing to do here is to investigate missing values. Missing values are detected combining the train and test datasets together. The target value Saleprice is ignored here. The type of imputation will largely depend on the amount of missing values for each feature, so the missing rates are shown below:
<img src="pics/missing.png" width='400'>

For most categorical values in which large amount of values are missing such as PoolQC, MiscFeature, Alley and Fence, since there is not a specific reason for the amount of missing values, we just impute 'None' to be as a new level in the feature.

For a numeric feature like LotFrontage, we use median to impute the column; For some ordinal features which have the value format of numbers but are categorical, we use the mode as well for the imputation. 

MSZoning and Electrical is categorical, but most of its values are 'RL'. so we impute 'RL' as well instead of 'none'.
For the feature Utilities, we see all records shown in the column are 'AllPub', except for one 'NoSeWa' and two nas in the training dataset, which indicates this fearure won't help in predictive modelling, thus it's safely removed.

According to the data description, the missing values in Functional mean 'typical', so just impute with the value.

After all actions of imputation, we do need to transform some numerical variables to its own type, like changing year and month variable to categotical, though the values are numerical.

To make sure the features will fit in each model, we apply label encoding for the categorical variables that may contain information in their ordering set.

Also, a new feature called TotalSF is created representing the total square foot of a house, which is calculted basically by using the sum of all square feet for the 1st floor, the 2nd floor, and the basement.

The last step for numeric features is to check the skewness. As we know, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Using the skew function in python gives indices for each feature, the higher the value is, the skewer the data. Here is a preview of 10 features that are highed skewed:
<img src="pics/skew.png"  width='200'>

For features that are highly skewed (>.75), transformations need to be added. To dicide which types of transformation to use. We use the method--Box Cox Transformation. For each variable, a Box Cox transformation estimates the value lambda from -5 to 5 that maximizes the normality of the data using the equation:
<img src="pics/boxcox.png"  width='300'>

For negative values of lambda, the transformation performs a variant of the reciprocal of the variable. At a lambda of zero, the variable is log transformed, and for positive lambda values, the variable is transformed the power of lambda. In this dataset, we transformed 59 skewed numetical features in total.

After feature engineering, we split the data into train and test according to its original amount of data points again, and fit into models.

# Methods

To train the models, we use cross-validation to split the train dataset into 5 folds, and define a function to take each fold as the test set once, and find the root mean square error (rmse) when fitting each one model. The rmse is a score measuring the differences between values predicted by a model or an estimator and the values observed, details can be found here: https://en.wikipedia.org/wiki/Root-mean-square_deviation. 

At first we fit 4 models namely lasso regression, kernel ridge, elastic net and gradient boosting. We adjust tuning parameters for the models, fit the train dataset and get the rmse, here is  what the tuning parameters look like:
<img src="pics/tuningparameters.png">

Here are scores we get from each model:

| Algorithm          | score    | 
|--------------------|---------|
| Lasso              | 0.1139    | 
| Kernel Ridge       | 0.1190    | 
| ElasticNet         | 0.1187    | 
| Gradient Boosting  | 0.1195    | 

From which we can see elactic net is doing the best amoung all models we have.

Since all models we have are regression models, we'd like to average our model to get more precise results. Model averaging is an approach to ensemble learning where each ensemble member contributes an equal amount to the final prediction. In the case of regression, the ensemble prediction is calculated as the average of the member predictions. We define a class first to allow averaging models. We first define clones of the original four models we preciously have to fit the data in; Then train the cloned base models; Finally make predictions for cloned models and average the predictions got from each model. After this, we get the rmse score as well:

| Algorithm          | score    | 
|--------------------|---------|
| Averaged base models sore              | 0.1103    | 

Here is the score from using the test set:

| Algorithm          | score    | 
|--------------------|---------|
| Averaged base models sore              | 0.0773    | 

For the new averaged model, we also modified the model combining with cross-validation to be a stacked average model. What we do is to define the class, and first fit the data on clones of the original models; Train cloned base models then create out-of-fold predictions that are needed to train the cloned meta-model; Then train the cloned meta-model using the out-of-fold predictions as a new feature; Finally make the predictions of all base models on the test data and use the average predictions as meta-features for the final prediction which is done by the meta-model. The order of the models are changeable inside the class we defined, and the finally score is as follow:

| Algorithm          | score    | 
|--------------------|---------|
| Stacking averaged models sore              | 0.0722    | 

# Conclusion

When using the initial four algorithms: Lasso, Kernel Ridge, Elastic net and gradient boosting, the rmse we get are pretty much the same, with Lasso is the best. We then average the four models not because our data is poor to represent the real distribution of the data, but to increase the accuracy of predictions. The average model leads to a rmse of .0773, which is better than any one of the previous four. When stacking the average model, the score becomes .0722 which is even less.

# Discussion

Stacking uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms.

The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble.

There are still some more options to try on, such as boosting and bagging, or majority votes which combined with tree based models.