Goal: To come up with the predicted sale price of a 4-room flat in Yishun using Machine Learning Techniques using models within the scikit-learn module in Python. Other supporting libraries include pandas, numpy, matplotlib and seaborn for EDA.
Steps:
-
We collected the HDB resale transaction data can be obtained from the website maintained by the Singapore government here.
- Only data from 2017 onwards has been collected for increased relevance to the current course of events.
-
We examined the features provided and determined that additional factors can potentially affect the sale price e.g. proximity to:
-
- MRT and/or LRT stations
-
- Bus interchanges
-
- Shopping Centres
-
-
Calls were made to the OneMapSG API for the required proximity information.
This project aims to come up with a Machine Learning (ML) model to predict the resale prices of HDB flats in Singapore using python libraries such as scikit-learn, matplotlib, seaborn, pandas and numpy.
-
The prepared data is decomposed with PCA. Sufficient components were retained to capture 95% of the label variance while the rest are discarded.
-
The modified data are then split into 2 separate sets: training (80%) and validation (20%). The training using linear regression is performed on the modified features suggested by PCA.
-
The MSE of the fitted model is then calculated against the validation data.
-
Selected relevant non-numerical features are imputed, with the understanding of the potential bias introduced to the data from doing so.
-
The modified dataset is then split into training (80%) and validation (20%). All imputed and numerical features are used to train the model.
-
The MSE of the fitted model is calculated against the validation data and compared to that from the First Model to ascertain suitability.