Skip to content

kaclewtak/Pricing-Model-for-Residential-Properties

Repository files navigation

Pricing Model for Residential Properties - Project Summary

A study was conducted on the house sale prices between May 2014 to May 2015 in King County, USA. The purpose of the study was to build a predictive model to enable decision making and to reduce risk during our automated underwriting process that relies on automated valuation of properties. Through this study, we analyzed data on more than 21,000 houses sold during the year, and across 17 variables that included, among others, the number of bedrooms, the number of bathrooms, the square footage of the house, zip code. We propose a Polynomial Regression Model in this study.

The model is intended to be used as a tool to help make decisions and to reduce risk and enable faster pre-approvals, and is not intended to be used as a replacement for a professional appraisal as it has several limitations including inconsistencies in data that need further investigation. The model is also based on a limited geographical area and may not be applicable to other areas outside of King County, USA. Because the model is based on a limited time frame, it should be continually monitored and updated with the latest pricing.

Description of Methods

The predictive model was developed using Statistical Analysis using multiple regression models. We employed rigorous exploratory analysis of data, built multiple models including Linear, Polynomial, Weighted Least Square and Regression Tree models, and used a variety of statistical tests to ensure that the model is robust and reliable.

We started with extensive exploratory analysis to understand the data and to identify any missing values, outliers and inconsistencies. Through this analysis, we excluded variables that were either not relevant to the model as they were not a characteristic of a property, or were highly correlated to avoid overfitting. Next we split the data and utilized 70% of the observations to train a model, and the rest 30% to test the performance of the models.

We built multiple candidate Linear models, and utilized statistical techniques, to identify and retain the characteristics of the property which were most significant in explaining the price of a house. We also conducted a variety of both visual and statistical tests to ensure that the model is compliant with the assumptions of Normality, Linearity, Homoscedasticity and Independence.This helped narrow down to a "champion" model, which formed our baseline for further analysis.

Next, we built some non-linear "challenger" models, including Polynomial, Weighted Least Square and Regression Tree to compare the robustness, reliability and performance of the models, the most promising of which were a Polynomial model and the Regression Tree model. We again conducted numerical statistical tests by using the 30% test data to understand a models predictive power. We concluded by picking a Polynomial model, but also suggesting Regression Tree could be an alternate because the performance of the models was relatively close.

Conclusions

Based on the results for both the out-of-sample and in-sample tests for the tested models, we believe a regression tree model is fit for purpose as its results are similar to the results obtained from the regression models in earlier sections. When compared to the polynomial model developed in Section IV.IV.II, the regression tree model developed in this section performs slightly worse when comparing its out-of-sample results. However, when looking at its in-sample results, the regression tree performed better than the polynomial model in both its R-squared and RMSE metrics.

Future Work and Model Monitoring

There are several aspects which must be considered when monitoring ongoing performance of the existing model.

To start we must first address data inconsistencies and missing context information. In our initial data exploration phase we identified several variables that lacked clarity, such as the bathrooms, sqft_lot, and grade. For the bathrooms variable, it exists as a sum of several constituent components. It is recommended for future data collection to identify the individual components of the bathroom, as described in Section II, and tally that count for more precise data interpretation (For example, interpretation of 3.75). For sqft_lot, we identified some inconsistencies within the measurements that would have been better explained with more robust data clarification. As for the grade variable, this is a largely subjective entity. It exists as a matter of personal or arbitrary judgments on the part of the individual(s) assessing the property. No information is given as to the criterion used in determining the grade, nor is there any note of if those criteria were ever changed throughout the data collection process.

Therefore it is our overall recommendation, that in the ongoing monitoring of the model, to note any changes in the approach to data collection. More precise data collection would have substantial ramifications on the quality of the existing model. One such example would be if the factors determining the grade of homes were to change and all houses reassessed. This could potentially lead to changes in the predictive qualities of the variable and would force us to reconsider our models.

Next, in the OLS models, we note that the variables which have the largest influence over sale price mostly consist of factors external to the house itself. That is, factors which either require substantial development of the home or are inherent to the location of the house. Considering first the square feet of living space and the grade of the home, these are factors that, while they can be changed, are also subject to regulatory policies. To accurately maintain future performance of this model, our next suggestion is to closely monitor regulations involving housing developments and renovations. New legislation could change new housing developments or mandate that certain changes be made to existing ones. Any such changes would have to be noted and consequently measured over time to determine whether or not they had any influential impact on the model. Should changes to the model be observed, perhaps additional variables should then be considered for model performance.

Quantitative thresholds that should be considered are primarily centered around the existence of outliers and the RMSE, Root Mean Square Error, of the model. The best performance metric we were able to achieve with our models is $RMSE = 163,999.0$ with $Adj. R^2 = 0.8110829$ for polynomial regression and $RMSE = 176,995.5$ with $Adj. R^2 = 0.7824997$ for the regression tree. These out-of-sample test results should be used as benchmarks going forward, with future models attempting to either reduce RMSE or increase adjusted R-squared, preferably both. For data outliers, as can be seen in figure ###, the presence of said outliers presents a challenge in correctly fitting the model to the data. Should further data compound these issues then additional remedial measures must be taken to properly adjust the model for best fitment.

We can also look to existing recommendations given by the Federal Reserve in regards to model risk management. Per the supervisory guidance, we must first admit that "[our] model may have fundamental errors and may produce inaccurate outputs when viewed against the design objective and intended business uses"[2]. We believe this to be true given the nature of errors we identified in the model, such as high multicollinearity between several variables, heteroskedasticity issues, and some slight departures from normality. However we assume that despite these issues, the performance of the model remains robust and is viable for sale price prediction.

Likewise the National Association of Insurance Commissioners, a regulatory board overseeing matters of public and private insurance, recommends that model performance be assessed in combination with real world feedback from businesses and users of the model [3]. In their white paper, they discuss that only individuals with sufficient skills, knowledge and domain expertise should be responsible for the successful assessment of any particular model. This is due to complex statistical nature inherent to these models which some may find difficult to both understand and explain to other stakeholders their significance.

Finally, it should be noted that a "challenge from model users may be weak if the model does not materially affect their results"[2]. This is to say that if the model is taken with blind faith, there may be adverse business effects further down the line. This is why we present here several challenger models for price prediction, utilizing both traditional statistical methods of creating OLS models, as well as regression tree models. The ability to compare the performance of these models is crucial in understanding their individual strengths and weaknesses. Going forward we expect that the same approach should be had in order to maintain consistent and relevant model performance.

About

Pricing Model for Residential Properties

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published