## Abstract 

(delete part of this section)

Singapore has emerged as one of the world’s most prosperous countries. In addition to being a financial center, it's an achievement in urban planning and serves as a model for developing nations. It is consistently ranked as one of the most livable cities in the world, and a large part of this is due to its intriguing housing development strategy.

Public housing in Singapore is currently subsidized, built, and managed by the Government of Singapore. Singapore has one of the world’s highest home ownership rates. More than 80% of the 5.8M population live in Housing Development Board (HDB) apartments, **commonly known as HDB "flats"**.  More than 90% of Singaporeans in public housing own the apartment they live in. The government subsidizes the cost of new homes, and buyers can get loans from the HDB, along with a 10% down payment. Singapore’s housing estates are considered mixed-income developments.   

<img align="left" width="700" height="100" src="https://www.researchgate.net/publication/329910161/figure/fig1/AS:707705744392193@1545741606532/Example-of-Singapores-public-housing-HDB-2.jpg">




With more than 1 million flats spread across 24 towns and 3 estates, Singapore’s housing is uniquely different. As a geographic reference, the mainland of Singapore measures 50 kilometers from east to west and 27 kilometers from north to south with over 190 kilometers of coastline. As of 2020, a significant portion (78.7) of Singapore residents live in public housing ([ref](https://en.wikipedia.org/wiki/Public_housing_in_Singapore))


**Background** - In 1960 the Singapore Housing and Development Board (HDB) was formed to provide affordable and high-quality housing for residents of this city-state nation.  **Housing is issued by the state on 99-year leaseholds**(which we will dive into), and the value of the home in general we will find depends on many factors:  inherent utility value of the property, flat size/square footage, flat type, flat model, flat age,  overall region and location, geographical proximity to certain entities, etc).  


*We will explore and examine various factors (including the creation of an enormous amount of geospatial features as additional inputs) that can be used to accurately predict the flat resale value, and display this information in a clean and highly functional user interface.*


**Our Goal**: identify the most important factors/features/drivers of HDB flat resale price using advanced machine learning techniques and algorithms


<br>

## Motivation

(maybe use this instead)

<img align="left" width="600" height="100" src="https://www.researchgate.net/publication/329910161/figure/fig1/AS:707705744392193@1545741606532/Example-of-Singapores-public-housing-HDB-2.jpg">






While there exists numerous studies that look at real estate globally, we were drawn to analysing Singapore’s resale market for HDB flats which we believe formed a natural experiment that would enable us to better isolate and study the effects of location and location features on housing values. Across flat types, HDB flats generally have similar sizes, layouts and features yet we observe significantly varying transactions values. We also observed some of the more expensive resale transactions occurring on older units with shorter remaining leases, running counter to our expectations on leases and depreciation.





(Image scatter -  lease year/price)




(Image scatter - flat type/price)


Pricing then becomes a key challenge for potential buyers and sellers of HDB resale flats with limited information beyond past transactions and the advice of property agents with potentially skewed incentives (commissions are typically % of transaction value). The goal of our project, ***HDBestimate***, looks to demystify pricing by creating location features and training machine-learning models to predict prices, accounting for these location features. 

We hypothesize that beyond inherent flat features, location features play a significant role in influencing prices and we hope to not only provide predictions on price but to also identify how location and different location features actually influence the price of a HDB resale flat. 

<br>

## Data Used

### Primary Baseline Data:

The primary sets of data utilized for the project included Singapore's HDB  *Resale Flat Prices* (Resale transacted prices), and are currently published by the Singapore Housing and Development Board (HDB) and updated on a weekly basis.  
* Source:  Data can be found here (https://data.gov.sg/dataset/resale-flat-prices)
* Range: Dataset extends from January 1990 through present-day. 
* Baseline dataset consisted of five core data files, covering five time series ranges:  1990-1999, 2000-2012, 2012-2014, 2015-2016, 2017-present day.  These files were merged into one consolidated master dataset and imported for analysis.  
* **Features from this consolidated set include**:  *month* (the month/year of the resale transaction), *town* (the Singapore town of the actual flat), *block* (the address component of the flat), *street name* (the physical street address), *flat type* (the category of flat including room count), *floor area* (flat's square meter surface floor area), *flat model* (category of flat by model), *lease commence data* (the year the flat was commenced), *storey range* (the range of the flat storey/floor level, e.g. '04 to 06' identifying the flat was in the range between the fourth and sixth storey), *remaining lease* (the remaining years and months of the 99-year lease), and *resale price* (price sold for in Singapore dollars, and a major feature of this dataset we will be trying to predict).  Data rows example can be found [here](https://github.com/mcmanus-git/Singapore-HDB/blob/main/tom/blog/example_dataset.svg), and high level column descriptions are included in this [table](https://github.com/mcmanus-git/Singapore-HDB/blob/main/tom/blog/data_chart.svg). 






* Remaining lease:  the number of years left before the lease ends; this information is computed as at the resale flat application.  
* Resale Price:  these should be taken as indicative only as the resale prices agreed between buyers and sellers; these are dependent on many factors which we will explore. 
* Remember, lease left refers to the number of years to the expiry of the 99-year lease; after which, ownership of the HDB will return to the government.  This is a very different concept than what is done in the United States. 

* **Total Transaction Observations**:  867,677 number of rows (with y features), covering 11,747 days. 

*Note:  A helpful map to get a feel for Singapore in general is provided [here](https://www.arcgis.com/home/webmap/viewer.html?webmap=90a6084f84134b13a463168b9dee30d9&extent=103.4982,1.2289,103.9586,1.4936)*

<br>

<table style="border-collapse:collapse;border-spacing:0;table-layout: fixed; width: 536px" class="tg"><colgroup><col style="width: 65px"><col style="width: 141px"><col style="width: 125px"><col style="width: 205px"></colgroup><thead><tr><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:11px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Feature</span></th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:11px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Name</span></th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:11px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Title</span></th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:11px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Description</span></th></tr></thead><tbody><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">1</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">month</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Month</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Timestamp of resale transaction</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">2</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">town</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Town</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Name of Singapore town</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">3</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">flat_type</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Flat type</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Category of HDB flat type</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">4</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">block</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Block</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Address block of HDB flat</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">5</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">street_name</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Street name</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Singapore street name of flat</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">6</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">storey_range</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Storey range</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Range of stories the flat was hosted</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">7</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">floor_area_sqm</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Floor area sqm</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Floor area of the flat in surface face area </span><br><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">(square-meters)</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">8</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">flat_model</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Flat model</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Category of HDB flat model</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">9</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">lease_commence_date</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Lease commence date</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Lease commence data (historical)</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">10</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">remaining_lease</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Remaining lease</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Remaining lease time (year / month)</span></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">11</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">resale_price</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Resale price</span></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, Helvetica, sans-serif !important;font-size:10px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#434343;background-color:transparent">Resale Price (Singapore Dollars $)</span></td></tr></tbody></table>

A dataset example below:

<br>

### Secondary Data

Additional source of data included:

- HDB Resale Price Index (https://data.gov.sg/dataset/hdb-resale-price-index): This tracked the overall price movement of the public residential market (based on stratified hedonic regression methods) and thus allowed scaling and normalization of the resale price values, taking into account factors like inflation.  (We adjusted our resale price to remove the effects thus of inflation). This was extremely important to leverage in order to allow comparison between resale values year by year. 


*plotted resale price index placeholder:*

![resale_price_index.jpg](attachment:resale_price_index.jpg)

### Location Data

In addition to the baseline data, we added various location features that we believe would be useful in helping predict flat prices. Location data for each dataset was standardized to the following format:
- **SRID**: 4326
- **Geometries**: Geometries: (Longitude, Latitude) 

*List of additional location features shown below:*

*(maybe consolidate some of the bullets?)*

#### Schools
- Preschools
  - In addition to providing early childhood education, many preschools also double up as childcare services providing full-day care for children aged 1 to 6 years

- Primary Schools  
  - Primary education is for children aged 7 to 12 and the government’s motto is “every school is a good school”.  In reality however, some primary schools are more popular than others and as a result are frequently oversubscribed. 
  - When balloting for a spot, priority is typically given to those who live closer to the preferred schools (Source: https://www.moe.gov.sg/primary/p1-registration/distance).
  - While there are no official school rankings, we adopt a wisdom of the crowd approach and use the subscription rate (number of applications/number of spots available) as a proxy for school rankings and create a primary school score in addition to the location data.
- Secondary Schools 
  - Secondary education is for children aged 13 to 16.  Admission is based on scores attained in the Primary School Leaving Exams (PSLE) and are generally a good indication of the quality of the school.
  - Distance does not factor into admission criteria but generally we believe that people want to stay in areas with good school options.
  - There are no official rankings so we use the average PSLE score to calculate a secondary school score in addition to the location data
  
#### Transport

- Taxi Stands
  - While street hires are possible and occur frequently, taxi stands enable taxis to wait for passengers (and vice versa) and staying close to a taxi stand can improve one’s chances of getting a taxi
  - Bus Stops 
    - Buses represent an important mode of public transport in Singapore and they can only stop to let passengers board and alight at designated bus stops which makes these locations vital for increasing accessibility.
  - Transit Stations 
    - Mass Rapid Transit (MRT) and Light Rapid Transit (LRT) represent inter and intra town modes of public transportation within Singapore respectively.
    - Each town typically has one main MRT station while some towns are serviced additionally by an LRT system.
    - In addition to the location of each station, we create a station score based on the level of connectivity of each station. 
    - Scores were created as follows:
      - All stations start with a score of 1
      - If station is an LRT station: +0
      - If station is an MRT station: +1
      - For every line the station connects to: +1
      - Downloadable rail map [here](https://www.lta.gov.sg/content/dam/ltagov/getting_around/public_transport/rail_network/pdf/tel2_sm-20-03-en-exp.pdf)






<img src="https://www.lta.gov.sg/content/dam/ltagov/getting_around/public_transport/rail_network/image/tel2_sm-20-03-en-exp.png" width="600" height="340">


   - Transit Station Exit
     - As most transit stations are either underground or above ground, the locations of the station entrances and exits are indicative of accessibility in addition to the actual station location. 
     - Major Expressways 
         - We suspect that staying close to a major expressway could be a **double-edged** sword.  On one hand, it provides increased accessibility but on the other hand it also provides undesirable traits like noise, sound and light pollution.  In addition to the actual expressway, we also add the locations of expressway entrances and exits to provide richer features related to expressways. 


#### Government Services 

  - Police Stations
    - Singapore is both small and enjoys a low crime rate. Essentially a city state, police coverage is widespread and comprehensive.
  - Fire Stations 
    - Similar to police coverage, fire coverage is likewise widespread and comprehensive.
  - Hospitals
    - As Singapore is relatively small, we do not expect the distance to a hospital to factor too much on flat prices.



#### Amenities

  - Shopping Malls
    - Major source of food, shopping and entertainment in Singapore, providing an all-in-one destination for residents to spend their free time
  - Supermarkets
     - Provides a wide variety of food, beverages and household products for day-to-day use.
  - Wet Markets
     - An alternative to supermarkets, providing fresh uncooked food such as seafood, meat and vegetables. 
     - Only markets owned and run by the National Environment Agency
  - Hawker Centers
    - A key source of local cooked food in Singapore for residents. Only hawkers owned and run by the National Environment Agency
  - Parks 
    - Includes both minor parks like neighbourhood playgrounds and major parks like the Botanic Gardens, a UNESCO World Heritage Site. Provides an alternative outlet for leisure activities.
  - Waterbodies 
    - Locations of water bodies such as major reservoirs. Proxy for potential waterfront or riverfront living which typically draws a premium. 
  - Conservation Areas 
    - Historic areas identified by the Urban Redevelopment Authority in Singapore for conservation purposes.  Typically home to cafes, restaurants, bars and other small shops - providing alternative destination for residents of Singapore to spend their leisure time.
  - Custom Points of Interest
    - Other key areas of interest. 
    - Geocoded using ARCGIS.





Additional Sources of information:  
  - [Singapore primary schools](https://en.wikipedia.org/wiki/List_of_schools_in_Singapore), [Singapore MRT](https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations)
  - [Singapore Shopping Malls](https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore)
  - [Singapore LRT](https://en.wikipedia.org/wiki/List_of_Singapore_LRT_stations)
  - [Singapore postal sector and districts](https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm)


### Ethical Considerations:

Our core data is governed by the Singapore Open Data License (https://data.gov.sg/open-data-licence), which aims to "promote and enable easy reuse of Public Sector data to create value for the Singapore area community and businesses".  According to the bylaws, we are allowed to use, access, download, copy, distribute, transmit, modify and adapt the datasets, or any derived analyses or applications. 

We followed this bylaw **explicitly**; we are not allowed to use the datasets in a way that suggests any official status or that a Singapore agency endorsed us or use of their set datasets.  We specifically followed their guidance that in our application/website that uses the data, a conspicuous notice (at the bottom of the website) acknowledging the source of the datasets and including a link to the most recent version of their posted license be created. 

Note that location data (such as rail station information, locations of taxi stands, etc etc etc) are all publicly available.  This data was scraped and imported for usage in the modeling. 




<br>

## EDA



After importing the data, we did conventional deep-dive Exploratory Data Analysis (EDA) in order to get a feel for the dataset.  
High-level:  We had **867,891** rows of observations encompassing 11 features (month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_data, resale_price, and remaining_lease).  This encompassed 11,747 days of data.

Summarized, this was 27 unique towns, 577 unique street names, 9520 unique addresses, 5 unique regions, 7 unique flat types, 33 flat models, and 25 distinct storey ranges.  
Flat Types:  The majority of flat types were 4-room (38%), with 3-room (32%) and 5-room (21%) following behind.  **4-room flat types were the most transacted flat type in the resale market**.  

Number of raw rooms: Rooms ranged in count from 1 through 5.  37.6% were 4 room count, 32.4% were 3 room count, and 28.6% were five room count, with 2 and 1 room counts making up less than 1.5%     

Lease commence year spanned from 1966 to 2019. 

Floor Space:  Square foot in m/s^2 ranged from a minimum of 28, to as large as 307 (a range of 279).  Average floor area was 95.7 with a median of 93.  The majority of square meter values was in a range from 28 to approximately 160.  

Storey Range:  The storey ranges continued 25 distinct values, and were somewhat imbalanced, with floors 4-6 (25.2%), floors 7-9 (22.8%), floors 1 - 3 (20.3%), and floors 10-12 (19.3%) dominating, and the rest split between the other 21 ranges in a small percentage.  Effectively what was determined is that the vast majority of flats didn’t extend beyond about 20 floors, with outliers going as high as 51 storeys high.  

Normalized resale price histogram was right skewed. 

Plots were created to show the distribution of initial features, the value counts breakout per features, and summary statistics were investigated and plotted as well.   We examined features to identify if there were anomalies in the data that were candidates for potential deletion (so as not to skew our model).  Average resale price per square-meter versus feature categories showed remarkable insight (these were plotted).   

Certain towns appear to have higher value, we can attribute that to locations within the Central Region for instance (which is closest to the city center) 

High amount of correlation, VIF used for analysis (show VIFs) 



**Correlations Observed:**
- One of the strongest correlations was seen between normalized resale price and floor area in sq_m.  One would argue that this was to be expected.  
- Floor area sq_m and n_rooms
- Floor area sq_m and flat_type (moderate)
- There appeared to be a moderate correlation between resale price and height (storey range).  This was confirmed with EDA.

Correlation was done multiple times (on initial data, cleaned data, final data with features). The following types of correlation matrices were examined:  Spearman, Pearson, Kendall, Cramer, and Phi_k.  Some of these compared true numeric values, some could take into consideration categorical, ordinal, and interval variables, in addition to capturing non-linear dependence. Spearman appeared to be able to catch nonlinear monotonic correlations better than Pearsons.


<br>

After importing the data, we did conventional deep-dive Exploratory Data Analysis (EDA) in order to get a feel for the dataset.  Plots were created to show the distribution of intial features, the value counts breakout per features. 

Summary statistics are investigated and plotted as well. 

Trying to identify anomalies to possibly be removed. 

Average price per square-meter versus categories showed remarkable insight. 



#### EDA Notebooks:


* EDA part I:  [github](https://github.com/mcmanus-git/Singapore-HDB/blob/main/tom/nb_EDA_pt_I_of_II.ipynb) [notebook](https://nbviewer.org/github/mcmanus-git/Singapore-HDB/blob/main/tom/nb_EDA_pt_I_of_II.ipynb) [html](https://htmlpreview.github.io/?https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/nb_EDA_pt_I_of_II.html)
* EDA part II:  [github](https://github.com/mcmanus-git/Singapore-HDB/blob/main/tom/nb_EDA_pt_I_of_II.ipynb) [notebook](https://nbviewer.org/github/mcmanus-git/Singapore-HDB/blob/main/tom/nb_EDA_pt_I_of_II.ipynb) [html](https://htmlpreview.github.io/?https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/nb_EDA_pt_I_of_II.html)
* Feature Correlation Matrix [square](https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/images/correlation_matrix_baseline.png) [triangular](https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/images/correlation_matrix_baseline_triangular.png) 

![correlation_matrix_baseline_triangular.png](attachment:correlation_matrix_baseline_triangular.png)

![correlation_with_price_per-sqm_normed.png](attachment:correlation_with_price_per-sqm_normed.png)

<br>

#### Observations:
* High amount of correlation, VIF used for analysis 
* The top 10 storey_ranges encompassed the vast majority of flats. In addition, there appeared to be a moderate correlation between resale price and height (storey range) 
* HDB flat floor area ranged from 28-307 square meters, in a trimodal distribution 
* Normalized resale price histogram was right skewed
* **Correlation** - one of the strongest correlations was seen between normalize resale price and floor area in sq_m. 
* Certain towns appear to have higher value, we can attribute that to locations within the Central Region for instance (which is closest to the city center)m

<br>

## Feature Engineering

#### Initial Dataset - Cleaning and Feature Creation:


* Initial cleaning on the dataset to repeating flat_types
* Merged together the block and the street_name to create an address
* Created new storey_range_min and storey_range_max features (which were split from the storey_range text input (04 to 06 became 4 and 6) 
* n_rooms (number of rooms) was created via the flat_? (encompassing flats with floors from 1 to 5 rooms)
* floor_area_sqm
* One-hot encoded features such as abc and def 
* remaining_lease_years was created mathematically by using the original lease_commence_date and 99
* remaining_lease_years was created from remaing_lease, dropping out remaining_lease_months. 
* It was necessary to normalize the resale price values taking into consideration the HDB Resale Price Index, which can be found [here](https://data.gov.sg/dataset/hdb-resale-price-index); this currently covers the time frame from January 1990 to September 2021. This data allowed us to normalize our resale price to a comparable value.  Thus resale_price_norm was created from resale_price (converted via index). 
* Outliers are attempted to be removed

* Multiple original-price features were created mathemetically:  price_per_sq_ft, price_per_sq_m, price_per_sq_ft_per_lease_yr, and price_per_sq_m_per_lease_yr.  
* Multiple normalized-price features were created mathematically:  price_per_sq_ft_norm, price_per_sq_ft_per_lease_yr_norm, price_per_sq_m_norm, and price_per_sq_m_per_lease_yr_norm. 
* End result was the following [features](https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/feature_engineering/initial_cleaned_features.txt), with the following as an [example observation](https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/blog/initial_features_normed.jpg). 
* Singapore towns were mapped to pertinent regions (for instance, Central Region)
* Data was peridiocially exported in parquet format due to speed (pkl)
* flat_model was later identified as a candidate to remove (20 unique values), and encompassed what some others were doing  




<br>

*cleaned features - base dataset - example observation):*

![features_initial_normed.bmp](attachment:features_initial_normed.bmp)

<br>

#### Merging Geo-spatial features for more data


An *enormous* amount of Singapore geospatial location features were consolidated and pushed to the database:
- Market /  Food Centers
- Road information 
- Fire stations, policy stations, healthcare/hospitals, 
- MRT train station data (including number of lines, exit locations, etc.)
- Bus stops, taxi stands
- Schools (pre-schools,primary schools, secondary schools, high schools, etc), as well as their ranked score was uploaded 
- Conservation area information 

Then it was possible to calculate the distances to nodes via the 'geometry' feature. 

Python code was called to query OneMap API

https://epsg.io/4326


<br>

## Machine Learning Modeling Approach

**Initial Dataset** - The initial dataset (prior to embedding location features) was split chronologically and was trained with Linear Regression, XGBoost, and Random Forest Regression models. 





**Evaluation** - In order to evaluate our model, we needed metrics to tell us how accurate our predictions were, and what was the amount of deviation from the actual values. In order to determine how well the model fit the data, we used $R^2$ (the proportion of variance explained), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE). 

The validation dataset was used to initially investigate the performance, but also used for tuning the hyperparameters, while the test dataset was used to evaluate the model's performance. 

After extensive tuning, the Random Forest Regression model on the initial dataset was able to obtain a training R-square value of *insrt*, a validation R-square value of *insrt*, and a final test R-square value of *insrt*. 

For XGBoost, `objective= ‘reg:squarederror’` means that since we are faced with a regression problem, the objective will be to minimize the squared error. As always, the goal is to MINIMIZE the error, so the lower the MSE and RMSE are, the BETTER. 




**Final Dataset (including location features)** - After the location features were embedded, the final dataset was trained with a XBGoost and Random Forest Regression models. 


**Interpretation:** - the ability to intepret our model's output and feature importances was important to us.  For a conventional Linear Regression model, there are standard coefficient outputs that are easy to understand, but due to the complexity of our data, our models would need to have more explainability due to the lack of a conventional coefficient.  

Model feature importance outputs were possible in Random Forest Regression (example here). 

[SHAP](https://shap.readthedocs.io/en/latest/index.html) (SHapley Additive exPlanations) is a graphical/numeric approach to explain the output of any machine learning model, and it is one of the items we chose to use to help explain our model outputs. An example of this output is provided [here](). 

**Overall Observations**:
- Overfitting in models is ***very*** common at the beginning and it is critical to identify (If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model most likely has overfit the training set). Tuning all models was necessary due to the value number of features we utilized.  


* Random Forest is a bagging technique that trains a number of decision trees on various subsets of the given data and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. It is an ensemble algorithm. The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. Although random forest regression can lead to high accuracy, appears robust against outliers, and works well with non-linear data, there is a cost.  In general they take a fair amount of time to train, and are not always a great fit for linear methods with a lot of sparse features.   


* XBGoost is hard to tune (due to a fair number of hyperparameters).  Although there are considered about six key hyperparameters, this means that performing Gridsearch for instance takes time. They are both decision tree based. XBGoost:  The more flexible and powerful an algorithm is, the more design decisions and adjustable hyper-parameters it will have.  XBGoost has many hyperparemters, and even the top six for instance (learning rate, subsample and min child weight).  


* Things we could have done better was spread the training over multiple compute nodes, as it was a time-consuming task.  If one performs exhaustive grid search, the results can take a long time. 



<br>

**Overall Observations - Base**:

Base Model (not including feature locations:

![feature_importances_random_forest_regressor.png](attachment:feature_importances_random_forest_regressor.png)

<br>

![prediction_error_random_forest_regressor_baseline.png](attachment:prediction_error_random_forest_regressor_baseline.png)

<br>

![residuals_random_forest_regressor_baseline.png](attachment:residuals_random_forest_regressor_baseline.png)

<br>

## Overall Model Results

**Feature Importances** - The most important features appeared to be: 
 - a
 - b
 - c
 - d
 - e

**Initial Dataset** - Feature importances for our best baseline random forest regressor are found here, with the following key variables:

```python

# Final Dataset (including location features):  Current Model

# --- R2 Scores ---
Train:       0.962 
Validation:  0.897
Test:        0.795
    
#  --- MSE --- 
Train:       11.153
Validation:  40.907 
Test:        71.863
    
  
    
# New Model Output Results:

--- Test Set ---
Mean Absolute Error: ... 6.078747834396562
Mean Squared Error:..... 67.18
RMSE: .................. 8.196381335337657
Coeff of det (R^2):..... 0.809 (1.4 % better)   

--- Val Set ---
Mean Absolute Error: ... 4.498476271962494
Mean Squared Error:..... 38.02  
RMSE: .................. 6.166026942280251
Coeff of det (R^2):..... 0.904 (0.7 % better)   

--- Train Set ---
Mean Absolute Error: ... 2.339212417385334
Mean Squared Error:..... 10.24
RMSE: .................. 3.200258043072929
Coeff of det (R^2):..... 0.965
    
    
# New Model Hyperparameters (XGBoost-based)
    max_depth=7
    min_child_weight=6
    gamma = 10
    subsample=0.75
    colsample_bytree = 0.5
    reg_alpha = 100
    reg_lambda = 1
    n_estimators=800  (can add more if desired)
    learning_rate=0.16
    seed=42
    tree_method='hist
```

<br>

*hyperparameters that seemed to matter:*

![image.png](attachment:image.png)

<br>

*SHAP final values (re-insert):*

![shap_cut.png](attachment:shap_cut.png)

<br>

## Model Limitations

It should be understood that there are market forces at play over the years during our data time range, so predicting resale prices will not be perfect.   

There was two last years that were ...







<br>

## The Application / Front-End

### Database

We chose to host the project's raw data in a database on Amazon.  Amazon Relational Database Service (**RDS**) is a collection of managed cloud services that enabled our ability to set up, operate, and scale our database instance in the cloud. 

Our choice for the database verion was **PostgreSQL**, a powerful open source object-relational database system with many years of active development and a strong reputation for reliability, robustness, and performance.  

Our code was able to interact with the database via **SQLAlchemy** (a python SQL toolkit and Object Relational Mapper library). This allowed an engine [connection](https://github.com/mcmanus-git/Singapore-HDB/raw/main/tom/blog/sqla_engine_arch.png) to upload, store, manipulate, merge, and join various database tables of data.   PostgreSQL dialect used psycopg2 as the default The PostgreSQL dialect uses psycopg2 as the default python DBAPI. We were able also able to connect to the database remotely via the Postgres tool *pgAdmin*, which allowed ease of viewing changes. 

We enabled **PostGIS** (an [extension](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.PostGIS.html) in PostgreSQL for storing and managing spatial information) in AWS to consolidate location features. Spatial tables were set up in the database. 



<br>


### Database Contents:

Below shows the vast numer of database tables created:


![huge_data_view.svg](attachment:huge_data_view.svg)

- 28 separate database tables

<br>

### Applying the Algorithm:
Database:
insert a bunch of info here 

### Pipeline:


All data is housed in the Amazon RDS Postgres database, and updates to database tables are pushed periodically. 


<br>

## Further Investigation and Research


Some additional areas we plan on researching and investigating:
    
**Modified Distance** - Potentially adding a modification of the straight-line geo-spatial distance features in 'Manhattan' form (i.e. [taxi-cab](https://en.wikipedia.org/wiki/Taxicab_geometry) / city-block geometry). Singapore is a very walkable city, and distances from the HDB flat to nearby feature locations (such as hospitals, etc) many times are only possible via sidewalk or city streets. We also have examined adding potentially in the future a feature of the total travel time (whether walking or driving) to go from the flat to the specific destination location.  


**Crime Data** - Overlaying another set of features associated with historical crime (similar to NY city’s [statistics](https://www1.nyc.gov/site/nypd/stats/crime-statistics/citywide-crime-stats.page)) is an option. Although crime is extremely low in Singapore, it may be interesting to see if this is a factor in resale pricing values. 


**PLH (Prime Location Housing)** - Scenarios where there are no HDBs there currently (and this is direct from HDB, not a resale transaction). PLH is a new scheme of housing that was recently launched which includes more restrictions when it comes to resale. The [concept](https://www.channelnewsasia.com/singapore/keppel-club-bto-hdb-plh-model-how-much-property-analysts-2623251) would be that one could say that a resale could potentially fetch a certain dollar amount based on the machine learning model, allowing a mapping from HDB to appreciation of another set amount. This new housing model for public flats in prime areas includes owners of BTO flats in those areas facing a 10-year minimum occupation period; these flats will be priced with additional subsidies; those who sell their BTO units will have to pay back HDB a percentage of the resale price. The resale buyer criteria for these units will be tighter than for typical resale units…  

**Unsupervised Clustering** - Deeper dive into clustering  

**Interactivity** - Plan to investigate adding more features to the front-end application, including additional drop-down menus



<br>

## Statement of Work

Work Breakout was the following:

**Michael** - *insert*

**Stuart** - *insert*

**Tom** - *insert*


<br>

## Appendix

*arcGIS view if needed, alllowing view of layers as needed, for familiarity with the area: [LINK](https://www.arcgis.com/apps/mapviewer/index.html?webmap=90a6084f84134b13a463168b9dee30d9&extent=103.4982,1.2289,103.9586,1.4936):*

<br>

<br>

<br>