### Objective: Create mall2kmRadius attribute
Steps:
1. Create the malls dataset by scraping wikipedia information about Singapore malls
2. Check which malls are within 2km of the hdb and append it to the training dataset

#### 1.1 Scraping wikipedia mall information

In [55]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [72]:
url = 'https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore'
result = requests.get(url)
doc = BeautifulSoup(result.text, 'html.parser')


In [73]:
def convert_mall(list_item):
    for link in list_item.find_all('a'):
        if link['href'].startswith("/wiki"):
            return (list_item.text, link['href'])
    return (list_item.text,"")

In [74]:
mall_links = [convert_mall(list_item) 
            for div_section in doc.find_all('div', class_='div-col') 
            for list_item in div_section.find_all('li')
        ]
mall_links

        

[('100 AM[1]', ''),
 ('313@Somerset[2]', ''),
 ('Aperia', ''),
 ('Balestier Hill Shopping Centre', ''),
 ('Bugis Cube[3]', ''),
 ('Bugis Junction', '/wiki/Bugis_Junction'),
 ('Bugis+', '/wiki/Bugis%2B'),
 ('Capitol Piazza', '/wiki/Capitol_Piazza'),
 ('Cathay Cineleisure Orchard', '/wiki/Cathay_Cineleisure_Orchard'),
 ('Clarke Quay Central', ''),
 ('The Centrepoint', '/wiki/The_Centrepoint'),
 ('City Square Mall', '/wiki/City_Square_Mall_(Singapore)'),
 ('City Gate Mall[4]', ''),
 ('CityLink Mall', '/wiki/CityLink_Mall'),
 ('Duo', '/wiki/DUO'),
 ('Far East Plaza', '/wiki/Far_East_Plaza'),
 ('Funan', '/wiki/Funan,_Singapore'),
 ('Great World City', '/wiki/Great_World_City'),
 ('HDB Hub', '/wiki/HDB_Hub'),
 ('Holland Village Shopping Mall', '/wiki/Holland_Village,_Singapore'),
 ('ION Orchard', '/wiki/ION_Orchard'),
 ('Junction 8', '/wiki/Junction_8_Shopping_Centre'),
 ('Knightsbridge[5]', ''),
 ('Liat Towers', '/wiki/Liat_Towers'),
 ('Lucky Plaza', '/wiki/Lucky_Plaza'),
 ('Marina Bay Sand

In [75]:
df = pd.DataFrame(mall_links, columns=['Name', 'Link'])
df.head(10)

Unnamed: 0,Name,Link
0,100 AM[1],
1,313@Somerset[2],
2,Aperia,
3,Balestier Hill Shopping Centre,
4,Bugis Cube[3],
5,Bugis Junction,/wiki/Bugis_Junction
6,Bugis+,/wiki/Bugis%2B
7,Capitol Piazza,/wiki/Capitol_Piazza
8,Cathay Cineleisure Orchard,/wiki/Cathay_Cineleisure_Orchard
9,Clarke Quay Central,


#### 1.2 Add Opening Date
Retrieve opening information from mall's wiki page, if available

In [76]:
def get_opening_date(url_ext):
    url = f"https://en.wikipedia.org{url_ext}"
    result = requests.get(url)
    doc = BeautifulSoup(result.text, 'html.parser')

    # get the opening date
    opened_row = doc.find('th', string='Opened') or doc.find('th', string='Opening date')
    if opened_row:
        # Find the next sibling of the 'opened_row' which is the 'td' containing the date
        opened_date_cell = opened_row.find_next_sibling('td')
        return opened_date_cell.text or ''
    else:
        return ''

In [77]:
df['Opening Date'] = df['Link'].apply(get_opening_date)
df.head(10)

Unnamed: 0,Name,Link,Opening Date
0,100 AM[1],,
1,313@Somerset[2],,
2,Aperia,,
3,Balestier Hill Shopping Centre,,
4,Bugis Cube[3],,
5,Bugis Junction,/wiki/Bugis_Junction,"September 8, 1995; 28 years ago (1995-09-08)"
6,Bugis+,/wiki/Bugis%2B,1 June 2009 (as Iluma)
7,Capitol Piazza,/wiki/Capitol_Piazza,March 2015
8,Cathay Cineleisure Orchard,/wiki/Cathay_Cineleisure_Orchard,1997
9,Clarke Quay Central,,


#### 1.3 Add Geocoordinates
Use OneMap SG API to retrieve coordinate information, if available

In [78]:
def get_geocoordinates_from_address(address: str) -> tuple:
    url = f"https://www.onemap.gov.sg/api/common/elastic/search?searchVal={address}&returnGeom=Y&getAddrDetails=Y&pageNum=1"
    
    response = requests.get(url)    
    data = response.json()

    # Check if there are results
    if data['results']:
        first_result = data['results'][0]
        latitude = first_result['LATITUDE']
        longitude = first_result['LONGITUDE']
        return (latitude, longitude)
    else:
        return ("", "")

In [79]:
df[['Latitude', 'Longitude']] = df['Name'].apply(lambda address: pd.Series(get_geocoordinates_from_address(address)))
df.head(10)

Unnamed: 0,Name,Link,Opening Date,Latitude,Longitude
0,100 AM[1],,,1.28155949555229,103.847208361003
1,313@Somerset[2],,,1.30101436404056,103.838360664485
2,Aperia,,,1.3097112065077,103.864326436447
3,Balestier Hill Shopping Centre,,,1.32559594839311,103.842571612968
4,Bugis Cube[3],,,,
5,Bugis Junction,/wiki/Bugis_Junction,"September 8, 1995; 28 years ago (1995-09-08)",1.2991371723215,103.855450325604
6,Bugis+,/wiki/Bugis%2B,1 June 2009 (as Iluma),1.30095171530648,103.855172625542
7,Capitol Piazza,/wiki/Capitol_Piazza,March 2015,1.29307884763132,103.851261982149
8,Cathay Cineleisure Orchard,/wiki/Cathay_Cineleisure_Orchard,1997,1.30149264852924,103.836406753067
9,Clarke Quay Central,,,,


In [80]:
df.describe()

Unnamed: 0,Name,Link,Opening Date,Latitude,Longitude
count,169,169.0,169.0,169.0,169.0
unique,168,91.0,69.0,143.0,143.0
top,Junction 8,,,,
freq,2,79.0,96.0,25.0,25.0


#### 1.4 Save to CSV
This dataset is based on web scraping data, but appears to be significantly incomplete. Additional data will be added manually

In [81]:
df.to_csv('../data/modified/malls_dataset_v1.csv')

#### 1.5 Conduct EDA + Data Cleaning
Check missing data and manually add information
- Unable to find the coordinates of 17 malls via the API. They either do not exist or have been closed down.

Of the above, the followings malls have been closed:
- The Verge
- City Vibe
- JCube
- Jurong Entertainment Centre
- Ellenborough Market
- Capitol Centre
- Amber Mansions
- Serangoon Plaza
- Specialist Shopping Centre

The coordinates of the following malls are manually added:
- Clarke Quay Central
- Scotts Shopping Centre -> Scotts Square
- Shaw House and Centre -> Shaw House
- Mandarin Gallery
- Cosford Container Park
- Change Alley


In [91]:
raw_df = pd.read_csv('../data/modified/malls_dataset_v1.csv')
raw_df = raw_df.drop(columns=['Unnamed: 0', 'Link'])
raw_df.head(10)

Unnamed: 0,Name,Opening Date,Latitude,Longitude
0,100 AM[1],,1.281559,103.847208
1,313@Somerset[2],,1.301014,103.838361
2,Aperia,,1.309711,103.864326
3,Balestier Hill Shopping Centre,,1.325596,103.842572
4,Bugis Cube[3],,,
5,Bugis Junction,"September 8, 1995; 28 years ago (1995-09-08)",1.299137,103.85545
6,Bugis+,1 June 2009 (as Iluma),1.300952,103.855173
7,Capitol Piazza,March 2015,1.293079,103.851262
8,Cathay Cineleisure Orchard,1997,1.301493,103.836407
9,Clarke Quay Central,,,


In [92]:
malls_no_coordinates = raw_df[raw_df['Latitude'].isna()]
malls_no_coordinates.head(5)

Unnamed: 0,Name,Opening Date,Latitude,Longitude
4,Bugis Cube[3],,,
9,Clarke Quay Central,,,
12,City Gate Mall[4],,,
19,Holland Village Shopping Mall,,,
22,Knightsbridge[5],,,


In [93]:
malls_no_opening_dates = raw_df[raw_df['Opening Date'].isna()]
malls_no_opening_dates.head(5)

Unnamed: 0,Name,Opening Date,Latitude,Longitude
0,100 AM[1],,1.281559,103.847208
1,313@Somerset[2],,1.301014,103.838361
2,Aperia,,1.309711,103.864326
3,Balestier Hill Shopping Centre,,1.325596,103.842572
4,Bugis Cube[3],,,


The rest of the data has been manually updated and saved to malls_dataset.csv

#### 2.1 Calculating malls2kmRadius

In [75]:
import pandas as pd

malls_df = pd.read_csv('../data/modified/malls_dataset.csv')
malls_df.head(10)

Unnamed: 0.1,Unnamed: 0,name,latitude,longitude,opening_date
0,0,100 AM,1.274683,103.8435,1/12/2014
1,1,313@Somerset,1.301014,103.8384,3/12/2009
2,2,Aperia,1.309711,103.8643,5/5/2014
3,3,Balestier Hill Shopping Centre,1.325596,103.8426,1/1/1974
4,4,Bugis Cube,1.298141,103.8556,1/1/2009
5,5,Bugis Junction,1.299137,103.8555,8/9/1995
6,6,Bugis+,1.300952,103.8552,28/6/2012
7,7,Capitol Piazza,1.293079,103.8513,1/3/2015
8,8,Cathay Cineleisure Orchard,1.301521,103.8364,1/5/1999
9,9,The Centrepoint,1.30145,103.84,1/1/1983


In [76]:
# reformat date
from datetime import datetime

malls_df['opening_date'] = malls_df['opening_date'].apply(lambda date: datetime.strptime(date, "%d/%m/%Y"))

In [77]:
malls_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Unnamed: 0    161 non-null    int64         
 1   name          161 non-null    object        
 2   latitude      161 non-null    float64       
 3   longitude     161 non-null    float64       
 4   opening_date  161 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 6.4+ KB


In [78]:
class Coordinate:
    def __init__(self, lon, lat):
        self.lon = lon
        self.lat = lat

    def get_lon(self):
        return self.lon
    
    def get_lat(self):
        return self.lat
    
    def get_lat_lon(self):
        return (self.lat, self.lon)

In [79]:
from geopy.distance import geodesic

def calculate_distance(start: Coordinate, end: Coordinate) -> float:
    return geodesic(start.get_lat_lon(), end.get_lat_lon()).km

In [80]:
from datetime import datetime

In [88]:
def get_names_of_malls_within_2km_of_hdb(hdb_coordinate: Coordinate, hdb_record_date: datetime, malls_df: pd.DataFrame) -> list:
    malls_within_2km = []

    for index, row in malls_df.iterrows():
        distance2hdb = calculate_distance(hdb_coordinate, Coordinate(lat=row['latitude'], lon=row['longitude']))
        if distance2hdb <= 2 and row['opening_date'] < hdb_record_date:
            malls_within_2km.append(f"{row['name']} ({round(distance2hdb * 100 * 1000) / 100}m)")

    return malls_within_2km

In [89]:
get_names_of_malls_within_2km_of_hdb(
    Coordinate(lat=1.37509746867904, lon=103.83761896123), 
    datetime(2024, 2, 4),
    malls_df
)


['AMK Hub (1365.57m)', 'Broadway Plaza (985.61m)', 'Jubilee Square (1190.72m)']

1. Read resale transaction data
2. Reformat datetime
3. Run apply function

In [90]:
hdb_transactions = pd.read_csv('../data/modified/hdb_working_data.csv')
print(hdb_transactions.shape)
hdb_transactions.head()

(211395, 24)


Unnamed: 0.1,Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,...,longitude,sora,mrt_stations_within_1km,nearest_mrt_station,bto_within_4km,bto_supply_within_4km,pri_schs_within_1km,count_pri_schs_within_1km,distance2cbd,mallsWithin2kmRadius
0,0,2015-01,ANG MO KIO,3 ROOM,174,ANG MO KIO AVE 4,07 TO 09,60.0,Improved,1986,...,103.837619,0.129019,[],Yio Chu Kang MRT Station (1099.56m),12,6587,"['Ang Mo Kio Primary School (676.95m)', ""CHIJ ...",3,9.764087,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ..."
1,1,2015-01,ANG MO KIO,3 ROOM,541,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1981,...,103.855621,0.129019,['Ang Mo Kio MRT Station (811.53m)'],Ang Mo Kio MRT Station (811.53m),39,23252,['Jing Shan Primary School (431.03m)'],1,9.515643,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ..."
2,2,2015-01,ANG MO KIO,3 ROOM,163,ANG MO KIO AVE 4,01 TO 03,69.0,New Generation,1980,...,103.838169,0.129019,[],Yio Chu Kang MRT Station (1183.8m),10,4941,"['Ang Mo Kio Primary School (495.36m)', ""CHIJ ...",3,9.585589,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ..."
3,3,2015-01,ANG MO KIO,3 ROOM,446,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1979,...,103.855357,0.129019,['Ang Mo Kio MRT Station (703.32m)'],Ang Mo Kio MRT Station (703.32m),34,20043,"['Jing Shan Primary School (611.18m)', 'Teck G...",3,8.833708,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ..."
4,4,2015-01,ANG MO KIO,3 ROOM,557,ANG MO KIO AVE 10,07 TO 09,68.0,New Generation,1980,...,103.857736,0.129019,['Ang Mo Kio MRT Station (939.42m)'],Ang Mo Kio MRT Station (939.42m),45,26356,['Jing Shan Primary School (627.43m)'],1,9.275781,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ..."


In [92]:
hdb_transactions['record_date'] = hdb_transactions['month'].apply(lambda date: datetime.strptime(date, "%Y-%m"))
hdb_transactions.head()

Unnamed: 0.1,Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,...,sora,mrt_stations_within_1km,nearest_mrt_station,bto_within_4km,bto_supply_within_4km,pri_schs_within_1km,count_pri_schs_within_1km,distance2cbd,mallsWithin2kmRadius,record_date
0,0,2015-01,ANG MO KIO,3 ROOM,174,ANG MO KIO AVE 4,07 TO 09,60.0,Improved,1986,...,0.129019,[],Yio Chu Kang MRT Station (1099.56m),12,6587,"['Ang Mo Kio Primary School (676.95m)', ""CHIJ ...",3,9.764087,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ...",2015-01-01
1,1,2015-01,ANG MO KIO,3 ROOM,541,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1981,...,0.129019,['Ang Mo Kio MRT Station (811.53m)'],Ang Mo Kio MRT Station (811.53m),39,23252,['Jing Shan Primary School (431.03m)'],1,9.515643,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ...",2015-01-01
2,2,2015-01,ANG MO KIO,3 ROOM,163,ANG MO KIO AVE 4,01 TO 03,69.0,New Generation,1980,...,0.129019,[],Yio Chu Kang MRT Station (1183.8m),10,4941,"['Ang Mo Kio Primary School (495.36m)', ""CHIJ ...",3,9.585589,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ...",2015-01-01
3,3,2015-01,ANG MO KIO,3 ROOM,446,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1979,...,0.129019,['Ang Mo Kio MRT Station (703.32m)'],Ang Mo Kio MRT Station (703.32m),34,20043,"['Jing Shan Primary School (611.18m)', 'Teck G...",3,8.833708,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ...",2015-01-01
4,4,2015-01,ANG MO KIO,3 ROOM,557,ANG MO KIO AVE 10,07 TO 09,68.0,New Generation,1980,...,0.129019,['Ang Mo Kio MRT Station (939.42m)'],Ang Mo Kio MRT Station (939.42m),45,26356,['Jing Shan Primary School (627.43m)'],1,9.275781,"AMK Hub (8128.9m), Broadway Plaza (8526.07m), ...",2015-01-01


#### 2.2 Run calculations by chunk for better tracking

In [93]:
import numpy as np

chunks = np.array_split(hdb_transactions, 5)

result_chunks = []

for i, chunk in enumerate(chunks):
    print(f"Starting chunk: {i+1}")
    result_chunk = chunk.copy()
    result_chunk['mallsWithin2kmRadius'] = chunk.apply(lambda row: get_names_of_malls_within_2km_of_hdb(
        Coordinate(lat=row['latitude'], lon=row['longitude']), 
        row['record_date'],
        malls_df
    ), axis=1)

    result_chunks.append(result_chunk)

    combined_df = pd.concat(result_chunks)
    combined_df.to_csv('../data/modified/temporary.csv', index=False)
    print(f"Finished chunk: {i+1}")


  return bound(*args, **kwds)


Starting chunk: 1
Finished chunk: 1
Starting chunk: 2
Finished chunk: 2
Starting chunk: 3
Finished chunk: 3
Starting chunk: 4
Finished chunk: 4
Starting chunk: 5
Finished chunk: 5


In [94]:
combined_df.head()

Unnamed: 0.1,Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,...,sora,mrt_stations_within_1km,nearest_mrt_station,bto_within_4km,bto_supply_within_4km,pri_schs_within_1km,count_pri_schs_within_1km,distance2cbd,mallsWithin2kmRadius,record_date
0,0,2015-01,ANG MO KIO,3 ROOM,174,ANG MO KIO AVE 4,07 TO 09,60.0,Improved,1986,...,0.129019,[],Yio Chu Kang MRT Station (1099.56m),12,6587,"['Ang Mo Kio Primary School (676.95m)', ""CHIJ ...",3,9.764087,"[AMK Hub (1365.57m), Broadway Plaza (985.61m),...",2015-01-01
1,1,2015-01,ANG MO KIO,3 ROOM,541,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1981,...,0.129019,['Ang Mo Kio MRT Station (811.53m)'],Ang Mo Kio MRT Station (811.53m),39,23252,['Jing Shan Primary School (431.03m)'],1,9.515643,"[AMK Hub (937.75m), Broadway Plaza (1103.92m),...",2015-01-01
2,2,2015-01,ANG MO KIO,3 ROOM,163,ANG MO KIO AVE 4,01 TO 03,69.0,New Generation,1980,...,0.129019,[],Yio Chu Kang MRT Station (1183.8m),10,4941,"['Ang Mo Kio Primary School (495.36m)', ""CHIJ ...",3,9.585589,"[AMK Hub (1238.44m), Broadway Plaza (878.7m), ...",2015-01-01
3,3,2015-01,ANG MO KIO,3 ROOM,446,ANG MO KIO AVE 10,01 TO 03,68.0,New Generation,1979,...,0.129019,['Ang Mo Kio MRT Station (703.32m)'],Ang Mo Kio MRT Station (703.32m),34,20043,"['Jing Shan Primary School (611.18m)', 'Teck G...",3,8.833708,"[AMK Hub (784.06m), Broadway Plaza (1149.31m),...",2015-01-01
4,4,2015-01,ANG MO KIO,3 ROOM,557,ANG MO KIO AVE 10,07 TO 09,68.0,New Generation,1980,...,0.129019,['Ang Mo Kio MRT Station (939.42m)'],Ang Mo Kio MRT Station (939.42m),45,26356,['Jing Shan Primary School (627.43m)'],1,9.275781,"[AMK Hub (1057.2m), Broadway Plaza (1317.66m),...",2015-01-01


In [95]:
combined_df = combined_df.drop(['record_date', 'Unnamed: 0'], axis=1)

In [96]:
combined_df.to_csv('../data/modified/hdb_working_data.csv')