# How are Taxis enhancing the Singapore's Public Transport System.

## Introduction
Bus and rail system have fixed routes and schedules. Therefore they have deterministic coverage patterns which meet the general transportation demand. Taxicabs and ride sharing services do not have fixed routes or schedules thus in some sense, they meet the "ad hoc" demands. The goal of the project is to see if the supply pattern of taxicabs fill the gaps in the public transport network thereby enhancing it. 
The study would be done for Singapore, a city state which provides plentiful data for the public transport network. While public transport modelling has been done before, to the best of our knowledge, there have not been an analysis of taxis and public transport together.

## Existing Work

1. Singapore in Motion: Insights on Public Transport Service Level Through Farecard and Mobile Data Analytics, IBM 2016
http://www.kdd.org/kdd2016/papers/files/SingaporeInMotion_v3.pdf
2. Time-Series Data Mining in Transportation: A Case Study on Singapore Public Train Commuter Travel Patterns, SMU 2014 

We got inspired from two projects and the papers that we presented as existing work. The projects are developed and made open source for commuters to use. They are.
   
   
   1. [Taxi Router SG](https://github.com/cheeaun/taxirouter-sg) by [Lim Chee Aun](https://twitter.com/cheeaun). Its primary idea was to showcase the following details
       
       * Taxi stands in Singapore.
       * Shows all available taxis in the whole Singapore.
       * How many available taxis around the commuter?
       * How far is the nearest taxi stand around the commuter?

   2. [TaxiSg](http://uzyn.github.io/taxisg/) by [U-Zyn Chua](https://twitter.com/uzyn). This app helps the commuter to understand the distribution of taxis during a historic window (ranging from 15 minutes to 2 weeks of historic data).
   
   
   Both the apps get their data from a government organisation called, The Land Transport Authority (LTA) of Singapore. LTA publishes a wide variety of transport-related datasets (static and dynamic / realtime) on their DataMall platform for enterprises, third-party developers, and other members of the public to promote citizen co-creation of innovative and inclusive transport solutions. Detailed description of the APIs can be found here.
  

## Problem Statement

   We wanted to answer the following questions and provide inferences based on our results.
   
   1. Is there a difference between the density distribution of taxis (AdHoc Requests made by commuters) and Bus/Rail network (Planned Transportation network) over the period of time?
   
   2. Are the taxis really trying to fill the gaps of the public transport system? (Fully loaded buses)
   
   3. Is there any change in the distribution of taxis vs Public transport system between weekdays and weekends? Statistically decide on a location (identify top ten location) and KL divergence on Relative frequencies of bus vs taxi during the window/ for the whole day
   
   
   4. Can the Taxi demand be predicted based on previous patterns observed from the Taxi/Bus Density distribution? How precise can the prediction be in terms of time/location.
   Try [Auto Regression](http://machinelearningmastery.com/autoregression-models-time-series-forecasting-python/) 
  
   

## Data Sets
2 dynamic data sets are collected were collected using the API between 03/15/2017 to 03/19/2017.

1. Taxi Availability 
2. Bus Arrival

2 static data was collected
1. Urban Land Authority Master Planning Sub Zone 2014
2. MRT train schedule (work in progress)

The URA Zone codes were use to determine the usage of land, whether is it for commercial or residential etc.
<img src="images/ura_2014.png" width='500pix'>

## Taxi Availability Dataset

### Description
Returns location coordinates of all Taxis that are currently available for hire. Does not include "Hired" or "Busy" Taxis. We polled the API every *1min* for this dataset. A total of **40982444** location records were collected.
                                                                                                             
| **Attributes** 	| **Description**                                                    	|
|----------------	|--------------------------------------------------------------------	|
| Latitude       	| provides the latitude of the location where the taxi is available  	| 
| Longitude      	| provides the longitude of the location where the taxi is available 	| 
| Date           	| provides the date when the taxi was available                      	| 
| Time           	| provides when the time was available                               	| 



## Bus Arrival Dataset

### Description
Returns real-time Bus Arrival information for Bus Services at a queried Bus Stop, including: Estimated Time of Arrival (ETA), Estimated Location, Load info. We polled the API all over the bus stops in Singapore every *6min* for this dataset. A total of **6394212** bus stop arrival records were collected.

                                                                                                                 
| **Attributes**    | **Description**                                                       |
|----------------   |--------------------------------------------------------------------   |
| ServiceNo         | Bus service number   | 
| Status         | Bus Status    | 
| Latitude | Estimated location coordinates of bus |
| Longtitude | Estimated location coordinates of bus |
|Load|  Bus occupancy / crowding: Seats Available, Standing Available, Limited Standing|


## Data Collection
[hui han]
### LTA Datamall

### Dates of Collection
in SG time
14/3/17 - 19/4/17

## Modelling transportation flow

### Projection from Lat Lon to UTM [hui han] 
   We wanted to change the projection from the regular (Latitude,Longitude) to Universal Transverse Mercator(UTM) for our project. You can find out the basics of  UTM [here](http://gisgeography.com/utm-universal-transverse-mercator-projection/). The primary reason for our decision is to avoid representing the Geo Locations in a distorted manner. UTM is also the best format to project for narrow Geo-locational area with high density details. 

### Density Estimation
       
   We had data collected for a five day period. To answer all the questions in our problem statement, we had to start with a Kernel Density estimation of Taxis and Buses for a window (10 minute window) slided over a period of five days. Kernel Desity Estimation will help us to identify the latent distribution from which the Taxi and Bus data originate. Also, Knowledge of the Distribution would help us to predict the taxis in the future based on the historic allocation of taxis and buses. We initially collected a random 10 minute sample of Bus Data and Taxi Data (for the same dates) and plotted Distribution of the Random sample. Before getting into the design decisions to formulate algorithm, we identified an outlier in the dataset.
   

### Outlier Removal 

  The Changi Airport is one of the biggest Taxi hub in Singapore. You could see from the figure that, the data is concentrated in the Airport area. When we remove the Airports area from equation, we see many hotspot locations in the sampled dataset.
  
  images go here.


### Different Kernel Density Estimation
[Karthik]
  
  Before we show you how distributions of Taxis and Kernel Densities are basically a generalised version of a Histogram. Its non-parametric, which means that we don't have any belief about from which distribution the data has come from.We will walk you through the problem that we are trying to solve.
  
  Lets say for example, We have a simple 2d Histogram with x axis being value and y axis being frequency, We can see the following problems with histogram
  
  1. histogram is prone to change by changing starting and ending points
  2. Its not smooth
  3. Provides a different interpretation for different bandwidths.
  
  
   <img src="images/hist1.png" width="300px" style="float:left"><img src="images/hist2.png" width="300px">
   
   Kernel Densities solve two of the three problems. 
   
   1. Kernel Densities are not pro
   
   
  
  


### Generating large scale KDE 



## Inference
### Comparison of Transport Densities

## Demand Estimation
### Prediction Model 
[hui han]

## Recommendation

## References

## Code Appendix

## Taxi API Code

In [6]:
#!/home/bks4line/anaconda2/bin/python
# Author : Karthik Balasubramanian

import json
import urllib
from urlparse import urlparse
import httplib2 as http #External library
import pandas as pd
import time
from datetime import datetime
from pytz import timezone
import os
#  please get your account keys and place here
headers = { 'AccountKey' : 'XXXXX','accept' : 'application/json'}

uri = 'http://datamall2.mytransport.sg/' #Resource URL
path = 'ltaodataservice/Taxi-Availability?$skip='
fmt =  '%Y-%m-%d_%H:%M:%S'
sg = timezone('Asia/Singapore')
my_path = %pwd
dir_path = my_path+"/data"



def get_data_from_LTA(filename):
    
    global headers,uri,path,fmt,sg,dir_path

    
    #Build query string & specify type of API call
    
    final_list = []
    target = urlparse(uri + path+str(len(final_list)))

    
    
    method = 'GET'
    body = ''

    #Get handle to http
    h = http.Http()
    
    # Obtain results
    response, content = h.request(target.geturl(),method,body,headers)

    # Parse JSON to print
    jsonObj = json.loads(content)
    
    final_list.extend(jsonObj["value"])
    
    while(len(jsonObj["value"])>0):
        target = urlparse(uri + path+str(len(final_list)))
        # print target.geturl()
        response, content = h.request(target.geturl(),method,body,headers)
        jsonObj = json.loads(content)
        final_list.extend(jsonObj["value"])
    
    
    time_now_in_sg = datetime.now(sg)
    date_and_time_ff =  time_now_in_sg.strftime(fmt)
    date_and_time = date_and_time_ff.split("_")
    date_in_sg = [date_and_time[0]]*len(final_list)
    time_in_sg =  [date_and_time[1]]*len(final_list)
    
    df = pd.DataFrame(final_list)
    df['date'] = pd.Series(date_in_sg, index=df.index)
    df['time'] = pd.Series(time_in_sg, index=df.index)
    
    if not filename:
        filename =  dir_path+"/taxi_"+date_and_time_ff+".csv"
        df.to_csv(filename)
    else:
        file_size_exceed = float(os.path.getsize(filename))/float(5e+6)
        if file_size_exceed>1.0:
            print "file_size_exceed"
            filename = dir_path+"/taxi_"+date_and_time_ff+".csv"
            print "new file name {0}".format(filename)
            df.to_csv(filename)
        else:
            print "file size not exceeded"
            df.to_csv(filename, mode='a', header=False)

    return filename


#  run the below code 

 
# starttime =  time.time()
# filename = None
# # get_data_from_LTA(filename=None)
# while True:
#     filename = get_data_from_LTA(filename)
#     starttime =  time.mktime(datetime.now().timetuple())
#     time.sleep(50.0 - ((time.time() - starttime) % 60.0))

