### Obtaining the MRT Dataset

In this notebook, we will be using the `requests` and `BeautifulSoup` python libraries to scrape MRT stations, their respective coordinates and opening dates from Wikipedia

In [7]:
import requests
from bs4 import BeautifulSoup
import csv


Make a request to `https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations` to find all MRT stations and their respective Wikipedia links; we can store these information with MRT station name being the key and the respective wikipedia page as the value. 

Note that since we will be adding on the MRT station's coordinates and opening dates as well, we can let the values in the dictionary be a list that will store the Wikipedia link, coordinates and opening date.

In [8]:
# make request to wikepdia page
res = requests.get('https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations')
# parse the page
soup = BeautifulSoup(res.text, 'html.parser')

td_elements = soup.find_all('td')

mrt_stations = [td.find('a', title=lambda title: title and "MRT station" in title) for td in td_elements if td.find('a', title=lambda title: title and "MRT station" in title)]

station_dict = {}
for station in mrt_stations:
    if station:  # Check to make sure the station is not None
        station_name = station.text.strip()  # .strip() removes leading/trailing whitespace
        if station_name in station_dict:
            continue  # Skip this station if it's already been processed
        station_link = "https://en.wikipedia.org" + station['href']
        station_dict[station_name] = [station_link]

print(len(station_dict))

185


Now for each unique MRT station, we access their Wikipedia link and obtain the coordinates and opening date.

Note that Wikepedia gives coordinates in the following format `1°20′00″N` so we have to create a function to convert the coordinates into decimal coordiantes too.

In [9]:
# create function to convert coordinates into decimal
def convert_coordinates(coordinate):
    coordinate = coordinate.replace('°', ' ').replace('′', ' ').replace('″', ' ')
    parts = coordinate.split()
    degrees = float(parts[0])
    minutes = float(parts[1])
    seconds = float(parts[2])
    decimal = degrees + minutes / 60 + seconds / 3600
    return decimal

In [10]:
# iterate through station_dict
for station in station_dict:
    # make request to station link
    res = requests.get(station_dict[station][0])
    # parse the page
    soup = BeautifulSoup(res.text, 'html.parser')
    # get the latitude and longitude
    latitude = convert_coordinates(soup.find('span', class_='latitude').text)
    longitude = convert_coordinates(soup.find('span', class_='longitude').text)

    # get the opening date
    opened_row = soup.find('th', string='Opened')
    if opened_row:
        # Find the next sibling of the 'opened_row' which is the 'td' containing the date
        opened_date_cell = opened_row.find_next_sibling('td')
        # Extract the text, split by '<br>' if multiple dates exist, and strip to clean it up
        opened_date = opened_date_cell.text.split('<br>')[0].strip()
        # Replace HTML entities with their corresponding characters, e.g., '&nbsp;' with ' '
        opened_date = opened_date.replace(u'\xa0', u' ')
        # get the cleaned date
        opened_date = opened_date.split(';')[0]
    else:
        opened_date = 'Not opened yet'

    # add the latitude, longitude and opening date to the station_dict
    station_dict[station].append(latitude)
    station_dict[station].append(longitude)
    station_dict[station].append(opened_date)

# print out first 20 stations with their coordinates and opening dates
for station, data in list(station_dict.items())[:20]:
    print(f'{station}: {data[1]}, {data[2]} - {data[3]}')

Jurong East: 1.3333333333333333, 103.74222222222222 - 5 November 1988
Bukit Batok: 1.3491666666666666, 103.74972222222222 - 10 March 1990
Bukit Gombak: 1.3586111111111112, 103.75166666666667 - 10 March 1990
Brickland: 1.3686111111111112, 103.74944444444445 - Not opened yet
Yew Tee: 1.396986111111111, 103.74723888888889 - 10 February 1996
Sungei Kadut: 1.4133333333333333, 103.74888888888889 - Not opened yet
Kranji: 1.4250472222222224, 103.76185277777778 - 10 February 1996
Marsiling: 1.4326361111111112, 103.77428333333333 - 10 February 1996
Woodlands: 1.4370944444444445, 103.78648333333334 - 10 February 1996
Admiralty: 1.440688888888889, 103.80093333333333 - 10 February 1996
Sembawang: 1.449025, 103.82015277777778 - 10 February 1996
Canberra: 1.4430555555555555, 103.82972222222222 - 2 November 2019
Yishun: 1.4294638888888889, 103.83523888888888 - 20 December 1988
Khatib: 1.4171666666666667, 103.8329 - 20 December 1988
Yio Chu Kang: 1.3819055555555555, 103.84481666666666 - 7 November 1987

Now we convert the station_dict into csv format with columns: `mrt_station_name` `latitude` `longtitude` `opening_date`

In [11]:
# convert station_dict into csv
with open('../data/modified/mrt_stations.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(['mrt_station_name', 'latitude', 'longitude', 'opening_date'])
    for station, data in station_dict.items():
        writer.writerow([station + ' MRT Station', data[1], data[2], data[3]])