# Problem Statement

BMTC's route data-set provides information of various bus routes, bus stops covered, and the duration of the trip from origin to destination.  How good are these estimates of the trip duration?  Can we suggest improvements to the schedule based on the duration of trips in different hours of the day?

# Motivation

Have you ever tried to use Google to get bus timings and estimated travel duration using public transport in Bangalore?  In many cases, it would give you clearly incorrect results.  

For instance, the images below show the travel times from Jayanagar to Majestic

Google estimates a (car) ride from 4th Block Jayanagar to Majestic to take 33 minutes. Also, notice the traffic and the route taken.


<img src="img/car.png" style="height: 400px" title="car" height=400 />

The same route, Google claims, can be covered in 35 minutes in a bus, which includes a 9 minute walk. Also, notice that the traffic doesn't have any reds and oranges!

<img src="img/bus.png" style="height: 400px" title="bus" height=400 />

Google seems to be just using the planned duration of trips provided by BMTC to estimate the duration of a trip. This is **evidently wrong**, and discourages me from taking a bus when I have to get somewhere on time. Can we help improve this situation? 

Also, it is common knowledge that trips between the same origins and destinations take different times based on the time of the day. BMTC's single estimate of a trip duration isn't helping people plan their trips effectively!

# Approach

Uber's [movement dataset](https://movement.uber.com/) for Bangalore provides various statistics about travel time duration between pairs of [wards in Bangalore](https://en.wikipedia.org/wiki/List_of_wards_in_Bangalore), by the hour of day.

BMTC's bus route data provides the list of bus-stops on each route, along with latitude/longitude information for those bus stops. 

We attempt to use these two datasets to get a sense of how accurate BMTC's estimates of trip duration are, and to help them improve these esitmates if possible. 

## Details

To be able to estimate roughly each route's trip duration, we do the following. 

1. Pick a route, and get a list of bus stops on that route. 
1. For each bus stop, figure out the ward name/number in which it lies. Uber's data gives us mean travel times between wards. 
1. Simplify the route to be a hop between wards, instead of hops between bus stops. We group all bus stops within a ward as one hop. 
1. Use Uber's data to get the mean travel time between consecutive wards on the route and add it up to get a *very very* rough estimate of the duration of the trip. 
1. Repeat this for all the routes and get a sense of how accurate/inaccurate the BMTC trip durations are.

## Code

In [1]:
import pandas as pd
from utils import get_bmtc_routes, get_uber_travel_time, get_ward, route_to_wards

data = get_bmtc_routes()

BMTC's data provides bus stops on the route as `json` list, the time for the route, and the route number. Below is a sample of what the data looks like. 

In [2]:
data[:3][['route_no', 'time', 'map_json_content']]

Unnamed: 0,route_no,time,map_json_content
0,1,01:25 Min.,"[{""busstop"": ""Jayanagara 9th Block,JAYANAGARA ..."
1,1E,01:45 Min.,"[{""busstop"": ""JPNagara 6th Phase,JP NAGARA 6TH..."
2,1F,00:50 Min.,"[{""busstop"": ""BTM Layout,BTM Layout 2nd Stage,..."


In [3]:
travel_time_data = get_uber_travel_time()

Uber provides the mean travel times between two wards, at different hours of the day. 

In [4]:
travel_time_data[:3][['sourceid', 'dstid', 'hod', 'mean_travel_time']]

Unnamed: 0,sourceid,dstid,hod,mean_travel_time
0,1,1,23,371.44
1,1,2,2,233.39
2,1,2,7,333.99


In [5]:
def mean_time(ttd, src, dst):
    """Return mean travel time between a src and dst ward
    
    ttd is the data from Uber's data set. 
    """
    src_data = ttd[ttd.sourceid == src[0]]
    data = src_data[src_data.dstid == dst[0]]
    return data.mean_travel_time.mean()

In [6]:
def mean_route_time(ward_list, ttd):
    pairs = list(zip(ward_list[:-1], ward_list[1:]))
    mean_times = [mean_time(ttd, *pair) for pair in pairs]
    return mean_times

In [7]:
def estimate_travel_time(route, travel_time_data):
    wards = route_to_wards(route)
    means = pd.Series(mean_route_time(wards, travel_time_data))
    missing_data = means.hasnans
    total_minutes = int(means.sum()/60)
    hours = int(total_minutes / 60)
    minutes = total_minutes % 60
    return (hours, minutes), missing_data

In [8]:
print('{:>10} -- {:^10} || {}'.format('Route no.', 'BMTC', 'Uber estimate'))
for index in range(15):
    r = data.loc[index]
    try:
        (hours, minutes), missing_data = estimate_travel_time(r, travel_time_data)
    except TypeError:
        continue
    end = '*\n' if missing_data else '\n'
    print('{:>10} -- {} || {:02}:{:02} Min.'.format(r.route_no.strip(), r.time.strip(), hours, minutes), end=end)
print('* implies data for travel time for some ward pairs is missing')

 Route no. --    BMTC    || Uber estimate
         1 -- 01:25 Min. || 04:27 Min.
        1E -- 01:45 Min. || 07:12 Min.
        1F -- 00:50 Min. || 02:46 Min.
     CCC-1 -- 00:55 Min. || 00:49 Min.
     FDR-1 -- 01:20 Min. || 04:48 Min.
       G-1 -- 01:35 Min. || 01:29 Min.
       K-1 -- 02:10 Min. || 04:45 Min.
     KHC-1 -- 00:50 Min. || 00:49 Min.*
      MF-1 -- 00:55 Min. || 00:38 Min.
    NLMF-1 -- 00:35 Min. || 01:02 Min.*
     WFS-1 -- 00:55 Min. || 01:01 Min.
         2 -- 00:45 Min. || 04:22 Min.
        2A -- 00:45 Min. || 04:03 Min.
        2B -- 00:50 Min. || 05:16 Min.
        2D -- 00:40 Min. || 04:32 Min.
* implies data for travel time for some ward pairs is missing


For the first few routes in the BMTC dataset, we compare the BMTC provided durations against our estimates using the Uber data.  Wow! Some of our estimates are 3-4 times the estimates given by BMTC. What's going on here?

Let's dig a little deeper, and see why our estimate for the first route looks the way it does!

In [9]:
route = data.loc[0]

In [10]:
route_1_wards = route_to_wards(route)
route_1_wards

[(168, 'Pattabhiram Nagar'),
 (169, 'Byrasandra'),
 (167, 'Yediyur'),
 (154, 'Basavanagudi'),
 (142, 'Sunkenahalli'),
 (140, 'Chamrajapet'),
 (139, 'K R Market'),
 (138, 'Chalavadipalya'),
 (95, 'Subhash Nagar'),
 (94, 'Gandhinagar'),
 (65, 'Kadu Malleshwar Ward'),
 (35, 'Aramane Nagara'),
 (45, 'Malleswaram'),
 (44, 'Marappana Palya'),
 (38, 'HMT Ward')]

We fetch the wards for each bus stop and get a unique list of ward-hops on the bus route.  They seem reasonable, on manual verification. 

In [11]:
route_1_mean_times = mean_route_time(route_1_wards, travel_time_data)
ward_pairs = list(zip(route_1_wards[:-1], route_1_wards[1:]))
print('{:>20}-->{:<20} : {}'.format('From', 'To', 'Time (secs)'))
for (src, dst), time in zip(ward_pairs, route_1_mean_times):
    print('{:>20}-->{:<20} : {:4.0f}'.format(src[1], dst[1], time))

                From-->To                   : Time (secs)
   Pattabhiram Nagar-->Byrasandra           :  567
          Byrasandra-->Yediyur              :  670
             Yediyur-->Basavanagudi         : 3062
        Basavanagudi-->Sunkenahalli         : 1282
        Sunkenahalli-->Chamrajapet          :  782
         Chamrajapet-->K R Market           :  352
          K R Market-->Chalavadipalya       :  188
      Chalavadipalya-->Subhash Nagar        : 1250
       Subhash Nagar-->Gandhinagar          :  738
         Gandhinagar-->Kadu Malleshwar Ward : 1461
Kadu Malleshwar Ward-->Aramane Nagara       : 2576
      Aramane Nagara-->Malleswaram          : 1607
         Malleswaram-->Marappana Palya      : 1037
     Marappana Palya-->HMT Ward             :  495


All the times are in seconds -- 5 minutes is 300 seconds, 30 minutes is 1800. 

**Given that these are wards on a bus route, each ward lies next to each other**. 

This implies that the average travel times between wards shouldn't be very long. Depending on the size of the two wards, the average travel times can very between wards, but anything between 30 minutes to an hour, starts to look suspicious!

The above route has 2 such hops - `Kadu Malleshwar Ward-->Aramane Nagara` and `Yediyur-->Basavanagudi`. It has 6 hops which are over 20 minutes. 

Let's look at what Uber's visualization of this data looks like.

Uber's visualisation shows that the mean time is about 5 minutes for Kadu Malleshwar Ward to Armane Nagara.

![Kadu Malleshwar Ward to Armane Nagara](img/uber-travel-time-kmw-anw.png)

The data provided for download is somehow biased? Or broken? 

Similarly, the mean time is about 4 minutes for Yediyur to Basavanagudi

![Yediyur to Basavanagudi](img/uber-travel-time-yw-bw.png)

Let's see if using a different dump of the data would help us see if we can find out what's going on.

The above estimates were made using data from the **last quarter of 2018**. Let's try to see if we see similar problems when using other data, say the **second quarter of 2016**. 

In [12]:
travel_time_data = pd.read_csv('data/bangalore-wards-2016-2-All-HourlyAggregate.csv')
print('{:>10} -- {:^10} || {}'.format('Route no.', 'BMTC', 'Uber estimate'))
for index in range(15):
    r = data.loc[index]
    try:
        (hours, minutes), missing_data = estimate_travel_time(r, travel_time_data)
    except TypeError:
        continue
    end = '*\n' if missing_data else '\n'
    print('{:>10} -- {} || {:02}:{:02} Min.'.format(r.route_no.strip(), r.time.strip(), hours, minutes), end=end)
print('* implies data for travel time for some ward pairs is missing')

 Route no. --    BMTC    || Uber estimate
         1 -- 01:25 Min. || 04:15 Min.*
        1E -- 01:45 Min. || 07:17 Min.*
        1F -- 00:50 Min. || 02:53 Min.
     CCC-1 -- 00:55 Min. || 00:52 Min.
     FDR-1 -- 01:20 Min. || 05:12 Min.
       G-1 -- 01:35 Min. || 01:41 Min.
       K-1 -- 02:10 Min. || 05:39 Min.
     KHC-1 -- 00:50 Min. || 00:54 Min.*
      MF-1 -- 00:55 Min. || 00:42 Min.
    NLMF-1 -- 00:35 Min. || 01:08 Min.*
     WFS-1 -- 00:55 Min. || 01:03 Min.
         2 -- 00:45 Min. || 04:48 Min.
        2A -- 00:45 Min. || 04:26 Min.
        2B -- 00:50 Min. || 05:44 Min.
        2D -- 00:40 Min. || 05:00 Min.
* implies data for travel time for some ward pairs is missing


We see that similar routes have similar errors in the estimates even with older data. Let's look at the same route no. 1 and see what the estimated times between wards look like. Yediyur to Basavanagudi is again over 50 minutes!

In [13]:
route_1_mean_times = mean_route_time(route_1_wards, travel_time_data)
ward_pairs = list(zip(route_1_wards[:-1], route_1_wards[1:]))
print('{:>20}-->{:<20} : {}'.format('From', 'To', 'Time (secs)'))
for (src, dst), time in zip(ward_pairs, route_1_mean_times):
    print('{:>20}-->{:<20} : {:4.0f}'.format(src[1], dst[1], time))

                From-->To                   : Time (secs)
   Pattabhiram Nagar-->Byrasandra           :  573
          Byrasandra-->Yediyur              :  740
             Yediyur-->Basavanagudi         : 3240
        Basavanagudi-->Sunkenahalli         : 1350
        Sunkenahalli-->Chamrajapet          :  860
         Chamrajapet-->K R Market           :  352
          K R Market-->Chalavadipalya       :  181
      Chalavadipalya-->Subhash Nagar        : 1638
       Subhash Nagar-->Gandhinagar          :  995
         Gandhinagar-->Kadu Malleshwar Ward : 1630
Kadu Malleshwar Ward-->Aramane Nagara       :  nan
      Aramane Nagara-->Malleswaram          : 1949
         Malleswaram-->Marappana Palya      : 1201
     Marappana Palya-->HMT Ward             :  623


And let's look at what Uber's visualization of this data tells us

<img src="img/uber-travel-time-yw-bw-2016-2.png" style="height: 400px" title="Yediyur to Basavanagudi (2016 Q2)" height=400 />

Again, Uber gives us a range of 4 to 11 minutes, and gives us a daily average of about 6 and a half minutes. But, this is what the data in the data-dump looks like. 

In [14]:
!cat data/bangalore_wards.json |jq ".features[].properties"| jq '{(.WARD_NAME): .WARD_NO}'|grep -P "Yediyur|Basavanagudi"
!grep ^167,154 data/bangalore-wards-2018-4-All-HourlyAggregate.csv|cut -d , -f 1,2,4

  "Basavanagudi": "154"
  "Yediyur": "167"
167,154,2031.17
167,154,4448.83
167,154,2863.98
167,154,4083.02
167,154,1843.87
167,154,2110.06
167,154,3261.44
167,154,4098.44
167,154,3019.62
167,154,1895.13
167,154,1962.89
167,154,3879.9
167,154,3418.7
167,154,3368.02
167,154,1758.77
167,154,3522.11
167,154,3193.04
167,154,4489.85
167,154,2493.99
167,154,1875.14
167,154,4473.59
167,154,2921.35
167,154,4399.07
167,154,2083.04


All the average times for Yediyur to Basavanagudi are in 1000s of seconds and none of them even seem to fall in the range that Uber's site shows (4 to 11 minutes).  There's definitely something fishy with the data, and Uber is evidently not using the dumps but directly fetching data from their servers based on the query. 

The travel time between `Basavanagudi` and `Yediyur` (the opposite direction) also looks similar, and doesn't really fall in the range provided by Uber in their visualization. 

In [15]:
!grep ^154,167 data/bangalore-wards-2018-4-All-HourlyAggregate.csv|cut -d , -f 1,2,4

154,167,1840.28
154,167,4107.67
154,167,2685.95
154,167,3819.25
154,167,1987.19
154,167,2303.5
154,167,3117.83
154,167,3486.45
154,167,2987.1
154,167,2016.32
154,167,1937.06
154,167,3320.76
154,167,3031.49
154,167,3293.29
154,167,1722.53
154,167,3266.7
154,167,2976.16
154,167,3852.05
154,167,2537.02
154,167,1937.61
154,167,4118.23
154,167,2738.16
154,167,4003.55
154,167,2158.65


# Summary

We attempted to audit the travel duration estimates by BMTC using Uber's travel time data between wards. But, the audit found problems in the data export files provided by Uber, and any further analysis on this wouldn't be meaningful. Uber's [website](movement.uber.com) itself seems to fetch data directly from their servers, and gives much better estimates. 

Follow-up work on this could try to use Uber's APIs to fetch the data and use that to improve the BMTC estimates. There'd be hundreds of API calls to fetch data for travel times between pairs of wards. But, we would only need to get data for travel times between neighboring wards, and not wards that have other wards between them. This would reduce the size of data to be fetched drastically. 