House is the primary need of everyone. Almost everyone when starting to start a family chooses to buy a house, this is indicate that the need to own a home will increase over time. Buying a house is commonly the most important financial transaction. The fact that most prices are negotiated individually (unlike a stock exchange system) creates an environment that results in an inefficient system. That's why many of real estate agency using house prediction concept to drive the efficiency of house pricing.
House price prediction can help real estate agency to determine the selling price of a house. It can also help the customer to make a better decision to purchase a house. There are some factors that influence the price of a house which depends on physical conditions, concept, location and others. House prices vary for each place and in different communities. There are various techniques for predicting house prices. One of the efficient ways is by the use of the regression technique. Regressions are an exciting area of data analysis since it enables us to make very specific predictions, incorporating different variables simultaneously.
This project using The Boston Housing Dataset. The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. From this dataset, we want to predict the house price (medv) using some features. The following describes the dataset columns / features:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- BLACK - the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's
This project uses several methods of regression, we want compare it and get the best model to predict the house price. Here are some of the methods that used :
- Multiple linear regression
- Regularized linear regression with the best lambda (Lasso & Ridge Regression)
- Random forest regression
Here is the comparison of the evaluation metrics among those methods :
Metric | Multiple Linreg | Lasso Reg | Ridge Reg | Random Forest Reg |
---|---|---|---|---|
R2 | 0.714963 | 0.484510 | 0.713695 | 0.971135 |
RMSE | 5.463073 | 5.315646 | 5.446738 | 4.061765 |
MAE | 3.376244 | 3.535684 | 3.375222 | 2.538490 |
MAPE | 0.180371 | 0.293518 | 0.252971 | 0.130002 |
From those metrics score, we can conclude that:
- Random Forest Regression Model is the best model to predict the house price (medv).
- Random Forest Regression Model has the highest R2 Score, that is 0.971135. It means that 97,11% of the total variability of medv was successfully modeled using that features.
- Random Forest Regression Model has the smallest RMSE, MAE, & MAPE Value.
- The three most important features that affect the house price (medv) based on random forest model are rm, ptratio, & zn.
pandas, numpy, sklearn, statsmodels, seaborn, matplotlib