# Data Fetching and Processing

In [44]:
import config_vars
import pandas as pd
import os
from datetime import datetime
import numpy as np


In [52]:
run_date = datetime.now().strftime('%Y-%m-%d')

In [53]:
data_output_path = f'./data/{run_date}'

In [56]:
os.makedirs(data_output_path,exist_ok=True)

## Accessing LTA Datamall

Singapore's Land Transit Authority provides data via their LTA Datamall. For this exploration, I am making use of a [Dynamic Data Set](https://datamall.lta.gov.sg/content/datamall/en/dynamic-data.html) for **Bus Stops**. On their website, the documentation entry for Bus Stop is:

> **Bus Stops** - Returns detailed information for all bus stops currently being serviced by buses, including: Bus Stop Code, location coordinates.

Since I am keen to explore Data Visualisation of GeoSpatial Data, having a dataset with Location information would be a pre-requisite. Also, there is some cool factor to seeing just how many bus stops are in this island. From a separate dataset in [Data.gov.sg:Commuter Facilities](https://data.gov.sg/dataset/commuter-facilities?view_id=d89f1357-8c3e-4feb-91fb-904104af4473&resource_id=2198da0e-e7fd-43f9-8b66-749255f430bf), the count of bus stops were 4,638. However, that dataset was last updated on 2017 with the latest entry being 2013 😧. In any case, I am asuming that there are more bus stops now, let's see.

In [10]:
import json
import urllib
import requests

> As part of open data initiative (which is great BTW 💚), requesting an API key should be straightforward [datamall API Key Request](https://datamall.lta.gov.sg/content/datamall/en/request-for-api.html).

In [22]:
# As code hygiene, I am not going to key in my personal API Key.

headers = { 'AccountKey' : config_vars.LTA_API_KEY, 'accept' : 'application/json'}

In [13]:
datamall_url = 'http://datamall2.mytransport.sg/ltaodataservice/BusStops'

A note from the API documentation is that we can only have 500 records per query.

> To retrieve subsequent records of the dataset, you need to append the `$skip` operator to the API call (URL). For example, to retrieve the next 500 records (501st to the 1000th), the API call should be:
http://datamall2.mytransport.sg/ltaodataservice/BusRoutes?$skip=500
To retrieve the following set of 500 records, append `?$skip=1000`, and so on. _Just remember, each URL call returns only a **max of 500 records!**_


### Processing First 500

We want to create the dataset for bus stops (one-time) instead of querying it everytime. For that, I will have to build a parser so that I can reformat the data from the API response (which is JSON) to a "friendlier" format (which for me would be CSV).

In [30]:
resp = requests.post(datamall_url, headers= headers)

In [31]:
resp

<Response [200]>

Since we have a good response (`<Response [200]>`), we can assume that the API is good. To get the bus stop details that we are interested in, we will need to get the JSON part of the response. I did some sanity check to make sure that the size of the respose is indeed 500 (500 max rows per query according to the API docs).

In [23]:
response = resp.json()

In [29]:
len(response.get("value"))

500

Exploring the schema of the response, we see that each bus stop is composed of `'BusStopCode', 'RoadName', 'Description', 'Latitude', 'Longitude'` details.

|Attribute|Description|Sample|
|-|-|-|
|**BusStopCode**|A 5-digit unique identified for the bus stop.|01012|
|**RoadName**|The road where the bus stop is located.|Victoria St|
|**Description**|Landmarks next to the bus stop to aid in identification.|Hotel Grand Pacific|
|**Latitude**|Location Coordinate.|1.29684825487647|
|**Longitude**|Location Coordinate.|103.85253591654006|

Source: Data Mall API Documentation (Section 2.4 Bus Stops) [Link](https://datamall.lta.gov.sg/content/dam/datamall/datasets/LTA_DataMall_API_User_Guide.pdf)


In [35]:
response.get("value")[0].keys()

dict_keys(['BusStopCode', 'RoadName', 'Description', 'Latitude', 'Longitude'])

In [59]:
BusStopCode_list = []
RoadName_list = []
Description_list = []
Latitude_list = []
Longitude_list = []
for bus_stop_row in response.get("value"):
    BusStopCode_list.append(bus_stop_row.get('BusStopCode',None))
    RoadName_list.append(bus_stop_row.get('RoadName',None))
    Description_list.append(bus_stop_row.get('Description',None))
    Latitude_list.append(bus_stop_row.get('Latitude',None))
    Longitude_list.append(bus_stop_row.get('Longitude',None))

In [61]:
pd_df_busstop = pd.DataFrame({
    'BusStopCode':BusStopCode_list,
    'RoadName':RoadName_list,
    'Description':Description_list,
    'Latitude':Latitude_list,
    'Longitude':Longitude_list
})

#### Summary of Section

At this point in the section, we are able to:
* *Query the LTA Datamall API* - We are able to fetch our data from the API and we can confirm that our responses are correct.
* *Parsing the JSON response* - We were able to parse the data and convert it from a JSON format to a CSV format.

## Working with PyDeck

In [93]:
import pydeck as pdk

In [132]:

mid_lat = np.median(pd_df_busstop.Latitude.values)
mid_lon = np.median(pd_df_busstop.Longitude.values)

In [183]:
layer = pdk.Layer(
    "ScatterplotLayer",
    pd_df_busstop,
    pickable=True,
    opacity=0.8,
    stroked=True,
    filled=True,
    radius_scale=1,
    line_width_min_pixels=5,
    line_width_max_pixels=8,
    get_line_width=1,
    get_position=['Longitude', 'Latitude'],
    # get_fill_color=[255,32,110],
    get_line_color=[251,255,18],
    radius_units = 'meters'
)

In [184]:

view_state = pdk.ViewState(latitude=mid_lat, longitude=mid_lon, zoom=10, bearing=0, pitch=0)


In [188]:
# Render
r = pdk.Deck(layers=[layer], initial_view_state=view_state, map_style='dark',tooltip={'text':"{BusStopCode}\n{Description}"},height=1080,width=1920)
r.to_html("bus_stop.html")
# r.show() # I still cannot figure out how to do this show in an interactive manner. So far, I can only get it to work on `.to_html()` method.

## Full Data of Bus Stops

### Full Data Fetch

In [207]:
datamall_url = 'http://datamall2.mytransport.sg/ltaodataservice/BusStops'

BusStopCode_list = []
RoadName_list = []
Description_list = []
Latitude_list = []
Longitude_list = []

for index in np.arange(0,10000,500):
    print(index)
    new_url = f"{datamall_url}?$skip={index}"
    if index !=0:
        resp = requests.post(new_url, headers= headers)
    else:
        resp = requests.post(datamall_url, headers= headers)
    
    if resp.status_code == 200:
        response = resp.json()
        for bus_stop_row in response.get("value"):
            BusStopCode_list.append(bus_stop_row.get('BusStopCode',None))
            RoadName_list.append(bus_stop_row.get('RoadName',None))
            Description_list.append(bus_stop_row.get('Description',None))
            Latitude_list.append(bus_stop_row.get('Latitude',None))
            Longitude_list.append(bus_stop_row.get('Longitude',None))
    else:
        break # Do Nothing when the response is invalid

print(f"Completed Data Fetch and Parse")

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
Completed Data Fetch and Parse


In [209]:
pd_df_busstop_full = pd.DataFrame({
    'BusStopCode':BusStopCode_list,
    'RoadName':RoadName_list,
    'Description':Description_list,
    'Latitude':Latitude_list,
    'Longitude':Longitude_list
})

### Full Data Visualisation

In [210]:

mid_lat = np.median(pd_df_busstop_full.Latitude.values)
mid_lon = np.median(pd_df_busstop_full.Longitude.values)

In [268]:
layer = pdk.Layer(
    "ScatterplotLayer",
    pd_df_busstop_full,
    pickable=True,
    opacity=0.8,
    stroked=True,
    filled=True,
    line_width_min_pixels=2,
    line_width_max_pixels=3,
    get_line_width=1,
    radius_min_pixels=2,
    radius_max_pixels=3,
    radius_scale=1,
    get_position=['Longitude', 'Latitude'],
    get_fill_color=[251,255,18],
    get_line_color=[251,255,18],
    radius_units = 'pixels'
)

In [279]:

view_state = pdk.ViewState(latitude=mid_lat, longitude=mid_lon, zoom=11, max_zoom=18, min_zoom=5, bearing=0, pitch=0)


In [280]:
# Render
r = pdk.Deck(layers=[layer], initial_view_state=view_state, map_style='dark',tooltip={'text':"{BusStopCode}\n{Description}"},height=1080,width=1920)
r.to_html("bus_stop_full_sg.html")
# r.show() # I still cannot figure out how to do this show in an interactive manner. So far, I can only get it to work on `.to_html()` method.

Out of curiousity, how many bus stops are there right now in Singapore? As it turns out, there are 5071 (based on the total shape of the final dataset) which is close to the 5000 guess. It somewhat makes sense, we do not expect it to grow exponentially so soon. Still, 5000+ bus stops on an island this is quite amazing I would say.

In [281]:
pd_df_busstop_full.shape

(5071, 5)

### Section Summary

By now, we have extended our "pipeline" to fetch the data from LTA Data Mall for ALL the available bus stops on record.

We also extended our visualisation to show all the 5K++ bus stops around Singapore (and on to JB also).
