# Pune ITMS data
In this notebook we will have a look at the ITMS transit data from Pune. The data consists of two parts: 1) the static data corresponding to the different trip details (stop sequences, timings), stop details (stop_id, stop location) etc, and 2) on field data from the buses having information on the current trip undertaken by the bus, the current location of the bus, whether a bus is in transit to a stop or stopped at a stop etc. 


### Import required modules

In [0]:
import numpy as np
import pandas as pd
#import matplotlib 
#matplotlib.use('nbagg')
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import folium
from folium import plugins
import os
import zipfile
import random
%matplotlib inline
import math

## Get GTFS static files and field data files
We download the required GTFS static files like trips.txt, stops.txt, stop_times.txt available at opendata.punecorporation.org
This static meta data will be useful in visualization of the bus routes/stops and schedule.


In [0]:
os.system("wget -N -O pune_ITMS_GTFS.zip 'http://opendata.punecorporation.org/Citizen/CitizenDatasets/Download/481?filepath=%2FDocuments%2F481%2FPMPML%20Bus%20Routes%20%20-%20July%202019.zip'")
with zipfile.ZipFile("pune_ITMS_GTFS.zip","r") as zip_ref:
    zip_ref.extractall("targetdir")

In addition, we also download the field data files (consisting of the bus transit data) which were provided by PMPML. This data is of bus live locations for the month of June.

In [0]:
os.system("wget -N -O targetdir/out.json --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B_MYQpITPUbGMzNLdkwydFhOcDhHMS1JRTllVXVYYTJjU0s0'")

0

Our directory should now look like the following

In [0]:
!ls -l targetdir

total 74352
-rw-r--r-- 1 root root      152 Nov 23 14:27 agency.txt
-rw-r--r-- 1 root root      372 Nov 23 14:27 calendar.txt
-rw-r--r-- 1 root root      164 Nov 23 14:27 feed_info.txt
-rw-r--r-- 1 root root  3388142 Nov 23 14:27 out.json
-rw-r--r-- 1 root root     8540 Nov 23 14:27 routes.txt
-rw-r--r-- 1 root root  5950145 Nov 23 14:27 shapes.txt
-rw-r--r-- 1 root root   307430 Nov 23 14:27 stops.txt
-rw-r--r-- 1 root root 64028482 Nov 23 14:27 stop_times.txt
-rw-r--r-- 1 root root       51 Nov 23 14:27 translations.txt
-rw-r--r-- 1 root root  2423483 Nov 23 14:27 trips.txt


Let's read these files as Pandas dataframes.

These are a bunch of static files which are GTFS compliant. The last file is a field data, collected for buses travelling on a particular day. The static files are csv files and the field data is a json file.

In [0]:
df_stops = pd.read_csv('targetdir/stops.txt')
df_trips = pd.read_csv('targetdir/trips.txt')
df_shapes = pd.read_csv('targetdir/shapes.txt')
df_routes = pd.read_csv('targetdir/routes.txt')
df_trips.sort_values('route_id',inplace=True)
df_stop_time = pd.read_csv('targetdir/stop_times.txt')

df_field_data = pd.DataFrame(field_data)


Let's take a peek at the field data, which was the live data on the particular day we chose.

In [244]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,CNG492,2019-11-21T20:59:59.827+05:30,SCHEDULED,4,,IN_TRANSIT_TO,180,18.499653,40149,2019-11-21T15:29:15,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-0815_0,73.86097
1,,CNG30,2019-11-21T20:59:59.731+05:30,SCHEDULED,1,,STOPPED_AT,158,18.534315,31013,2019-11-21T15:29:14,NORMAL_158_Ma Na Pa To Lohgaon_Up-0700_0,73.84843
2,,1177,2019-11-21T20:59:15.666+05:30,SCHEDULED,9,,IN_TRANSIT_TO,5,18.466667,40715,2019-11-21T15:28:54,NORMAL_5_Pune Station To Swargate_Down-0625_0,73.78061
3,,R503,2019-11-21T20:59:15.559+05:30,SCHEDULED,13,,IN_TRANSIT_TO,298,18.595135,40185,2019-11-21T15:28:46,NORMAL_298_Chinchwad Gaon To Katraj_Down-2040_0,73.78348
4,,R495,2019-11-21T20:59:15.515+05:30,SCHEDULED,27,,STOPPED_AT,235,18.541004,30106,2019-11-21T15:28:41,NORMAL_235_Katraj To Kharadi Gaon_Up-2000_0,73.88367


The POSITION_UPDATE_TIMESTAMP(string) is the time stamp when the latitude and longitude(string) were recorded. These need to bee converted to date-time objects and floats respectively.

In [0]:
df_field_data["POSITION_UPDATE_TIMESTAMP"] = pd.to_datetime(df_field_data["POSITION_UPDATE_TIMESTAMP"])
df_field_data["LATITUDE_STR"] = pd.to_numeric(df_field_data["LATITUDE_STR"])
df_field_data["LONGITUDE_STR"] = pd.to_numeric(df_field_data["LONGITUDE_STR"])
""" Other fields which need to be converted """
df_field_data["STOP_ID"] = pd.to_numeric(df_field_data["STOP_ID"])
df_field_data["CURRENT_STOP_SEQUENCE"] = pd.to_numeric(df_field_data["STOP_ID"])

## Understanding the GTFS static files

Pune's static GTFS files define stops, trips and routes as the following -

| Field Name 	| Field Description 	| Example 	| 
--- | --- | ---
| route_id 	| ID of the route. A route is a fixed path between  two points. Routes in the dataset has no name. 	| 42.  This number features in the trip_id field 	| 
| trip_id 	| There can be many trips for the same route. For e.g, one bus can go on the same route multiple times, but for each such trips, there will be a different trip_id 	| NORMAL_42_Katraj To Bhakti Shakti_Up-0740_0 {Route_ID}{Route_String} {Direction Up/Down}-{Time HHMM}{0}
| stop_id | ID of a given stop. A route has multiple stops. | 4036
| shape_id | Every route has a shape_id associated with it. This demarcates the path of the route| 3389

## Dataset deep dive

Let's see a few routes available in Pune.

In [246]:
df_routes.head()

Unnamed: 0,route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
0,100_18,PMPML,100,Ma. Na. Pa. to Hinjawadi Phase,,3,,,
1,101,PMPML,101,,,3,,,
2,102,PMPML,102,,,3,,,
3,103M,PMPML,103M,,,3,,,
4,103,PMPML,103,,,3,,,


Let's assume we want to see a specific route. Let's see the trips associated with that route.

In [247]:
chosen_route = "5"
df_trips.loc[df_trips["route_id"] == chosen_route].head()

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,bikes_allowed,duty,duty_sequence_number,run_sequence_number
20539,5,1,NORMAL_5_Swargate To Pune Station_Up-1850_0,Swargate To Pune Station,,0,,3388,,,5/7,24,10
20538,5,1,NORMAL_5_Pune Station To Swargate_Down-1815_0,Pune Station To Swargate,,0,,3389,,,5/7,23,9
20540,5,1,NORMAL_5_Pune Station To Swargate_Down-1955_0,Pune Station To Swargate,,0,,3389,,,5/7,25,11
20498,5,1,NORMAL_5_Pune Station To Swargate_Down-1245_0,Pune Station To Swargate,,0,,3389,,,5/6,11,11
20351,5,1,NORMAL_5_Pune Station To Swargate_Down-0620_0,Pune Station To Swargate,,0,,3389,,,5/1,3,3


What's important in this is the route_id, trip_id, the trip_headsign and the shape_id.
The trip_id is the id that ties a bus to a route. trip_id's has the following heirarchical naming structure -
{Normal/Special}_{route_id}_{trip_headsign}_{Direction Up/Down}_{Time HHMM}_{0}

Again, bear in mind, a route is a static path between two points (with no temporal connotations), a trip is a journey a bus takes along a certain route and during a certain time.

Let's take a peek at how the field data from buses moving around Pune looks like

In [248]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,CNG492,2019-11-21T20:59:59.827+05:30,SCHEDULED,4,,IN_TRANSIT_TO,180,18.499653,40149,2019-11-21 15:29:15,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-0815_0,73.86097
1,,CNG30,2019-11-21T20:59:59.731+05:30,SCHEDULED,1,,STOPPED_AT,158,18.534315,31013,2019-11-21 15:29:14,NORMAL_158_Ma Na Pa To Lohgaon_Up-0700_0,73.84843
2,,1177,2019-11-21T20:59:15.666+05:30,SCHEDULED,9,,IN_TRANSIT_TO,5,18.466667,40715,2019-11-21 15:28:54,NORMAL_5_Pune Station To Swargate_Down-0625_0,73.78061
3,,R503,2019-11-21T20:59:15.559+05:30,SCHEDULED,13,,IN_TRANSIT_TO,298,18.595135,40185,2019-11-21 15:28:46,NORMAL_298_Chinchwad Gaon To Katraj_Down-2040_0,73.78348
4,,R495,2019-11-21T20:59:15.515+05:30,SCHEDULED,27,,STOPPED_AT,235,18.541004,30106,2019-11-21 15:28:41,NORMAL_235_Katraj To Kharadi Gaon_Up-2000_0,73.88367


Let's see 10 of the most prominent trips in the field data. These trips have the best GPS data.

In [295]:
good_trips = df_field_data["TRIP_ID"].value_counts()[:50].index.tolist()
good_trips[:10]

['NORMAL_114_Ma Na Pa To Mhalungegaon_Up-1440_0',
 'NORMAL_101_Kothrud Depot To Kondhwa Kh_Up-0600_0',
 'NORMAL_87_Deccan Gymkhana To Girmepark_Up-0750_0',
 'NORMAL_323B_Nehrunagar Depo To Chikhali_Up-0550_1',
 'NORMAL_43_Katraj To Bhakti Shakti_Up-0625_0',
 'NORMAL_114_Ma Na Pa To Mhalungegaon_Up-0945_0',
 'NORMAL_5_Pune Station To Swargate_Down-0625_0',
 'NORMAL_158_Ma Na Pa To Lohgaon_Up-0700_0',
 'NORMAL_12_Bhakti Shakti To Upper Depot_Down-1225_0',
 'NORMAL_143A_Galinde Path To Pune Station (Vai Nal Stop)_Up-0810_0']

Unfortunately, a trip having most records in the dataset is not an indication of the bus actually moving. We'll need to choose trips with the most variance in latitude, longitude and stop_ids. Entroipy is one such measure. The larger the entropy, the greater the guarantee that the bus is moving and changing stops. Let's create a function that calculates the entroopy of stops and position and creates a score based on that. Taking the index of the trip with a maximum score will yield a good trip which we will use to analyze.

In [0]:
import numpy as np
from scipy.stats import entropy
from math import log, e

def column_entropy(column, base=None):
  vc = pd.Series(column).value_counts(normalize=True, sort=False)
  base = e if base is None else base
  return -(vc * np.log(vc)/np.log(base)).sum()

In [0]:
df_field_grouped = df_field_data.groupby(["TRIP_ID"])
score = []
""" Iterate through each trip """
for trip in good_trips:
  group = df_field_grouped.get_group(trip)
  entropy_pos = column_entropy(group["LONGITUDE_STR"])
  entropy_stop =  column_entropy(group["STOP_ID"])
  """ Compute score """
  score.append(entropy_pos * entropy_stop)

""" Get indexes of sorted scores """
good_trips_sorted_indexes = np.argsort(score)

""" The last one of the sorted indices is the one with the best score """
best_index = good_trips_sorted_indexes[-1]

We have chosen the trip with the best entropy score for our analysis. You may choose any other trip.
Now we can use this best_index and choose the trip name corresponding to that index.

In [318]:
chosen_trip = good_trips[best_index]
chosen_trip

'NORMAL_68S_Sutardara To Upper Depot_Up-0840_0'

We will proceed with out analysis for this trip

These are the first 5 stops of that trip with their respective times and stop_id. 

In [319]:
df_stop_time.loc[df_stop_time["trip_id"] == chosen_trip].head(5)["trip_id"].values

array(['NORMAL_68S_Sutardara To Upper Depot_Up-0840_0',
       'NORMAL_68S_Sutardara To Upper Depot_Up-0840_0',
       'NORMAL_68S_Sutardara To Upper Depot_Up-0840_0',
       'NORMAL_68S_Sutardara To Upper Depot_Up-0840_0',
       'NORMAL_68S_Sutardara To Upper Depot_Up-0840_0'], dtype=object)

Let's revisit the trips dataframe. Notice the shape_id, this is an identifier for all the stops that belong to a trip. Let's take a specific example of one particular trip.

In [320]:
route_of_trip = df_trips.loc[df_trips["trip_id"] == chosen_trip]
route_of_trip

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,bikes_allowed,duty,duty_sequence_number,run_sequence_number
19476,68S,1,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,Sutardara To Upper Depot,,0,,2210,,,68/1,2,2


We can use this trips shape_id = 3389 and find all the stops locations from the df_shapes, the shapes dataframe. The first 5 stops of this trip are -

In [321]:
shape_of_route = df_shapes.loc[df_shapes["shape_id"] == route_of_trip["shape_id"].values[0]]
shape_of_route.head(5)

Unnamed: 0,shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
185905,2210,18.51548,73.80779,1,
185906,2210,18.51548,73.80784,2,
185907,2210,18.51421,73.80782,3,
185908,2210,18.51349,73.80784,4,
185909,2210,18.51303,73.80783,5,


Let's try to visualize this trip.

In [322]:
""" Initialize the map """
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=route_of_trip["trip_id"].values[0], color="red").add_to(m)
m


Let's find all the stops that are encountered in this trip and overlay it on the route. 
Let's again select a random route from the routes table and then choose a random trip on that route.


Now, we can obtain the stop times for this particular trip from the df_stop_time data frame. We should sort this table with stop_sequence.

In [323]:
stops_on_route = df_stop_time.loc[df_stop_time["trip_id"] == chosen_trip].sort_values("stop_sequence")
stops_on_route.head(5)

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled,timepoint
679869,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,08:40:00,08:40:00,35666,1,,,,,
679870,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,08:40:34,08:40:54,35503,2,,,,,
679871,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,08:42:50,08:43:10,33459,3,,,,,
679872,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,08:43:49,08:44:05,39283,4,,,,,
679873,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,08:44:33,08:44:41,39282,5,,,,,


For each of these stops that are identified by a stop_id, we need to fetch it's name and location fro the df_stops data frame. Subsequent to which we may add that stop to the map and visualize. We will plot both the trip route and the stops together.

In [324]:
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=chosen_trip, color="red").add_to(m)

for index, row in stops_on_route.iterrows():
  stop_info = df_stops.loc[df_stops["stop_id"] == row["stop_id"]]
  folium.Marker(stop_info[["stop_lat", "stop_lon"]].values.tolist()[0],
                popup=stop_info["stop_name"].values[0]).add_to(m) 

m

## Visualizing Archived Field data

The first set of data corresponds to field data from around 06:00 AM to 03:00 PM on 04-11-2019.

In [325]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,CNG492,2019-11-21T20:59:59.827+05:30,SCHEDULED,4,,IN_TRANSIT_TO,180,18.499653,40149.0,2019-11-21 15:29:15,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-0815_0,73.86097
1,,CNG30,2019-11-21T20:59:59.731+05:30,SCHEDULED,1,,STOPPED_AT,158,18.534315,31013.0,2019-11-21 15:29:14,NORMAL_158_Ma Na Pa To Lohgaon_Up-0700_0,73.84843
2,,1177,2019-11-21T20:59:15.666+05:30,SCHEDULED,9,,IN_TRANSIT_TO,5,18.466667,40715.0,2019-11-21 15:28:54,NORMAL_5_Pune Station To Swargate_Down-0625_0,73.78061
3,,R503,2019-11-21T20:59:15.559+05:30,SCHEDULED,13,,IN_TRANSIT_TO,298,18.595135,40185.0,2019-11-21 15:28:46,NORMAL_298_Chinchwad Gaon To Katraj_Down-2040_0,73.78348
4,,R495,2019-11-21T20:59:15.515+05:30,SCHEDULED,27,,STOPPED_AT,235,18.541004,30106.0,2019-11-21 15:28:41,NORMAL_235_Katraj To Kharadi Gaon_Up-2000_0,73.88367


You may ignore the NAME field in this data frame. What is of importance is the TRIP_ID, LONGITUDE_STR, LATITUDE_STR, ROUTE_ID, POSITION_UPDATE_TIMESTAMP.

Let's follow the same procedure as before and plot the path traversed by the bus over the previous static route.

In [335]:
trip_on_field = df_field_data.loc[df_field_data["TRIP_ID"] == chosen_trip]
trip_on_field.head()


Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
1423,,714,2019-11-21T19:46:29.921+05:30,SCHEDULED,30999.0,,IN_TRANSIT_TO,68S,18.461142,30999.0,2019-11-21 14:15:58,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,73.87115
1448,,714,2019-11-21T19:45:31.892+05:30,SCHEDULED,30999.0,,IN_TRANSIT_TO,68S,18.46101,30999.0,2019-11-21 14:14:59,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,73.86829
1481,,714,2019-11-21T19:44:23.234+05:30,SCHEDULED,31001.0,,IN_TRANSIT_TO,68S,18.461205,31001.0,2019-11-21 14:13:56,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,73.86669
1497,,714,2019-11-21T19:43:27.866+05:30,SCHEDULED,31002.0,,STOPPED_AT,68S,18.462751,31002.0,2019-11-21 14:12:58,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,73.86522
1521,,714,2019-11-21T19:42:27.040+05:30,SCHEDULED,31002.0,,IN_TRANSIT_TO,68S,18.463676,31002.0,2019-11-21 14:11:50,NORMAL_68S_Sutardara To Upper Depot_Up-0840_0,73.864944


In [327]:
lat_lons_field = trip_on_field[["LATITUDE_STR", "LONGITUDE_STR"]].values.tolist()
stop_id_field = trip_on_field["STOP_ID"].values.tolist()

""" Initialize the map """
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=chosen_trip, color="red").add_to(m)

for index, row in stops_on_route.iterrows():
  stop_info = df_stops.loc[df_stops["stop_id"] == row["stop_id"]]
  folium.Marker(stop_info[["stop_lat", "stop_lon"]].values.tolist()[0],
                popup="Static StopID = " + stop_info["stop_code"].values[0]).add_to(m) 

""" Plot field data as green marker"""
for lls, stop_id in zip(lat_lons_field, stop_id_field):
  folium.Marker(lls, popup="Live StopID = " + str(stop_id), icon=folium.Icon(color='green')).add_to(m)

m

You can change the routes and see the plot for different routes.

