# Pune ITMS data
In this notebook we will have a look at the ITMS transit data from Pune. The data consists of two parts: 1) the static data corresponding to the different trip details (stop sequences, timings), stop details (stop_id, stop location) etc, and 2) on field data from the buses having information on the current trip undertaken by the bus, the current location of the bus, whether a bus is in transit to a stop or stopped at a stop etc. 


### Import required modules

In [0]:
import numpy as np
import pandas as pd
#import matplotlib 
#matplotlib.use('nbagg')
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import folium
from folium import plugins
import os
import zipfile
import random
%matplotlib inline
import math

Install pyIUDX from github.

In [0]:
!pip install git+https://github.com/iudx/pyIUDX --upgrade

Let's import pyIUDX to get field data of buses on a particular day. Let's say 21st November between 6AM and 12PM. Since there is a lot of data, this will take some time.
We will also need to find the id of the ITMS database from PUDX. Visit pudx.catalogue.iudx.org.in and search for ITMS to get the ID. The ID of the ITMS dataset is - 
rbccps.org/aa9d66a000d94a78895de8d4c0b3a67f3450e531/pudx-resource-server/pune-itms/pune-itms-live

In [0]:
from pyIUDX.rs import rs

rs = rs.ResourceServer("https://pudx.resourceserver.iudx.org.in/resource-server/pscdcl/v1")

id = "rbccps.org/aa9d66a000d94a78895de8d4c0b3a67f3450e531/pudx-resource-server/pune-itms/pune-itms-live"
startTime = "2019-11-21T06:00:00.000+05:30"
endTime = "2019-11-21T12:00:00.000+05:30"

""" The field data obtained here is a dictionary """
field_data = rs.getDataDuring(id, startTime, endTime)

## Get GTFS static files and field data files
We download the required GTFS static files like trips.txt, stops.txt, stop_times.txt available at opendata.punecorporation.org
This static meta data will be useful in visualization of the bus routes/stops and schedule.


In [0]:
os.system("wget -N -O pune_ITMS_GTFS.zip 'http://opendata.punecorporation.org/Citizen/CitizenDatasets/Download/481?filepath=%2FDocuments%2F481%2FPMPML%20Bus%20Routes%20%20-%20July%202019.zip'")
with zipfile.ZipFile("pune_ITMS_GTFS.zip","r") as zip_ref:
    zip_ref.extractall("targetdir")

Our directory should now look like the following

In [5]:
!ls -l targetdir

total 71040
-rw-r--r-- 1 root root      152 Nov 24 10:47 agency.txt
-rw-r--r-- 1 root root      372 Nov 24 10:47 calendar.txt
-rw-r--r-- 1 root root      164 Nov 24 10:47 feed_info.txt
-rw-r--r-- 1 root root     8540 Nov 24 10:47 routes.txt
-rw-r--r-- 1 root root  5950145 Nov 24 10:47 shapes.txt
-rw-r--r-- 1 root root   307430 Nov 24 10:47 stops.txt
-rw-r--r-- 1 root root 64028482 Nov 24 10:47 stop_times.txt
-rw-r--r-- 1 root root       51 Nov 24 10:47 translations.txt
-rw-r--r-- 1 root root  2423483 Nov 24 10:47 trips.txt


Let's read these files as Pandas dataframes.

These are a bunch of static files which are GTFS compliant. The last file is a field data, collected for buses travelling on a particular day. The static files are csv files and the field data is a json file.

In [0]:
df_stops = pd.read_csv('targetdir/stops.txt')
df_trips = pd.read_csv('targetdir/trips.txt')
df_shapes = pd.read_csv('targetdir/shapes.txt')
df_routes = pd.read_csv('targetdir/routes.txt')
df_trips.sort_values('route_id',inplace=True)
df_stop_time = pd.read_csv('targetdir/stop_times.txt')

df_field_data = pd.DataFrame(field_data)


Let's take a peek at the field data, which was the live data on the particular day we chose.

In [7]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,E014,2019-11-21T11:59:59.716+05:30,SCHEDULED,32,,IN_TRANSIT_TO,180,18.530954,153009,2019-11-21T06:29:30,NORMAL_180_Bhekrainagar To Na Ta Wadi_Down-1145_0,73.847534
1,,E001,2019-11-21T11:59:59.566+05:30,SCHEDULED,10,,IN_TRANSIT_TO,322,18.627222,152113,2019-11-21T06:29:22,NORMAL_322_Akurdi Railway Station To Ma Na Pa ...,73.78255
2,,E015,2019-11-21T11:59:59.439+05:30,SCHEDULED,3,,IN_TRANSIT_TO,180,18.529034,40003,2019-11-21T06:29:28,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-1730_0,73.852516
3,,E006,2019-11-21T11:59:59.381+05:30,SCHEDULED,11,,IN_TRANSIT_TO,372,18.593365,40474,2019-11-21T06:29:25,NORMAL_372_Hinjawadi Maan Phase 3 To Bhakti Sh...,73.7361
4,,680,2019-11-21T11:59:59.314+05:30,SCHEDULED,10,,IN_TRANSIT_TO,58,18.518957,30417,2019-11-21T06:29:29,NORMAL_58_Shanipar To Gokhalenagar_Up-1135_0,73.83155


The POSITION_UPDATE_TIMESTAMP(string) is the time stamp when the latitude and longitude(string) were recorded. These need to bee converted to date-time objects and floats respectively.

In [0]:
df_field_data["POSITION_UPDATE_TIMESTAMP"] = pd.to_datetime(df_field_data["POSITION_UPDATE_TIMESTAMP"])
df_field_data["LATITUDE_STR"] = pd.to_numeric(df_field_data["LATITUDE_STR"])
df_field_data["LONGITUDE_STR"] = pd.to_numeric(df_field_data["LONGITUDE_STR"])
""" Other fields which need to be converted """
df_field_data["STOP_ID"] = pd.to_numeric(df_field_data["STOP_ID"])
df_field_data["CURRENT_STOP_SEQUENCE"] = pd.to_numeric(df_field_data["STOP_ID"])

## Understanding the GTFS static files

Pune's static GTFS files define stops, trips and routes as the following -

| Field Name 	| Field Description 	| Example 	| 
--- | --- | ---
| route_id 	| ID of the route. A route is a fixed path between  two points. Routes in the dataset has no name. 	| 42.  This number features in the trip_id field 	| 
| trip_id 	| There can be many trips for the same route. For e.g, one bus can go on the same route multiple times, but for each such trips, there will be a different trip_id 	| NORMAL_42_Katraj To Bhakti Shakti_Up-0740_0 {Route_ID}{Route_String} {Direction Up/Down}-{Time HHMM}{0}
| stop_id | ID of a given stop. A route has multiple stops. | 4036
| shape_id | Every route has a shape_id associated with it. This demarcates the path of the route| 3389

## Dataset deep dive

Let's see a few routes available in Pune.

In [9]:
df_routes.head()

Unnamed: 0,route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
0,100_18,PMPML,100,Ma. Na. Pa. to Hinjawadi Phase,,3,,,
1,101,PMPML,101,,,3,,,
2,102,PMPML,102,,,3,,,
3,103M,PMPML,103M,,,3,,,
4,103,PMPML,103,,,3,,,


Let's assume we want to see a specific route. Let's see the trips associated with that route.

In [10]:
chosen_route = "5"
df_trips.loc[df_trips["route_id"] == chosen_route].head()

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,bikes_allowed,duty,duty_sequence_number,run_sequence_number
20539,5,1,NORMAL_5_Swargate To Pune Station_Up-1850_0,Swargate To Pune Station,,0,,3388,,,5/7,24,10
20538,5,1,NORMAL_5_Pune Station To Swargate_Down-1815_0,Pune Station To Swargate,,0,,3389,,,5/7,23,9
20540,5,1,NORMAL_5_Pune Station To Swargate_Down-1955_0,Pune Station To Swargate,,0,,3389,,,5/7,25,11
20498,5,1,NORMAL_5_Pune Station To Swargate_Down-1245_0,Pune Station To Swargate,,0,,3389,,,5/6,11,11
20351,5,1,NORMAL_5_Pune Station To Swargate_Down-0620_0,Pune Station To Swargate,,0,,3389,,,5/1,3,3


What's important in this is the route_id, trip_id, the trip_headsign and the shape_id.
The trip_id is the id that ties a bus to a route. trip_id's has the following heirarchical naming structure -
{Normal/Special}_{route_id}_{trip_headsign}_{Direction Up/Down}_{Time HHMM}_{0}

Again, bear in mind, a route is a static path between two points (with no temporal connotations), a trip is a journey a bus takes along a certain route and during a certain time.

Let's take a peek at how the field data from buses moving around Pune looks like

In [11]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,E014,2019-11-21T11:59:59.716+05:30,SCHEDULED,153009.0,,IN_TRANSIT_TO,180,18.530954,153009.0,2019-11-21 06:29:30,NORMAL_180_Bhekrainagar To Na Ta Wadi_Down-1145_0,73.847534
1,,E001,2019-11-21T11:59:59.566+05:30,SCHEDULED,152113.0,,IN_TRANSIT_TO,322,18.627222,152113.0,2019-11-21 06:29:22,NORMAL_322_Akurdi Railway Station To Ma Na Pa ...,73.78255
2,,E015,2019-11-21T11:59:59.439+05:30,SCHEDULED,40003.0,,IN_TRANSIT_TO,180,18.529034,40003.0,2019-11-21 06:29:28,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-1730_0,73.852516
3,,E006,2019-11-21T11:59:59.381+05:30,SCHEDULED,40474.0,,IN_TRANSIT_TO,372,18.593365,40474.0,2019-11-21 06:29:25,NORMAL_372_Hinjawadi Maan Phase 3 To Bhakti Sh...,73.7361
4,,680,2019-11-21T11:59:59.314+05:30,SCHEDULED,30417.0,,IN_TRANSIT_TO,58,18.518957,30417.0,2019-11-21 06:29:29,NORMAL_58_Shanipar To Gokhalenagar_Up-1135_0,73.83155


Let's see 10 of the most prominent trips in the field data. These trips have the best GPS data.

In [12]:
good_trips = df_field_data["TRIP_ID"].value_counts()[:50].index.tolist()
good_trips[:10]

['NORMAL_114_Ma Na Pa To Mhalungegaon_Up-1440_0',
 'NORMAL_87_Deccan Gymkhana To Girmepark_Up-0750_0',
 'NORMAL_43_Katraj To Bhakti Shakti_Up-0625_0',
 'NORMAL_12_Bhakti Shakti To Upper Depot_Down-1225_0',
 'NORMAL_323B_Nehrunagar Depo To Chikhali_Up-0550_1',
 'NORMAL_5_Pune Station To Swargate_Down-0625_0',
 'NORMAL_201_Bhekrainagar To Alandi_Up-0510_0',
 'NORMAL_143A_Galinde Path To Pune Station (Vai Nal Stop)_Up-0810_0',
 'NORMAL_101_Kothrud Depot To Kondhwa Kh_Up-0600_0',
 'NORMAL_148_Shewalwadi To Pimple Gurav_Up-0915_0']

Unfortunately, a trip having most records in the dataset is not an indication of the bus actually moving. We'll need to choose trips with the most variance in latitude, longitude and stop_ids. Entroipy is one such measure. The larger the entropy, the greater the guarantee that the bus is moving and changing stops. Let's create a function that calculates the entroopy of stops and position and creates a score based on that. Taking the index of the trip with a maximum score will yield a good trip which we will use to analyze.

In [0]:
import numpy as np
from scipy.stats import entropy
from math import log, e

def column_entropy(column, base=None):
  vc = pd.Series(column).value_counts(normalize=True, sort=False)
  base = e if base is None else base
  return -(vc * np.log(vc)/np.log(base)).sum()

In [0]:
df_field_grouped = df_field_data.groupby(["TRIP_ID"])
score = []
""" Iterate through each trip """
for trip in good_trips:
  group = df_field_grouped.get_group(trip)
  entropy_pos = column_entropy(group["LONGITUDE_STR"])
  entropy_stop =  column_entropy(group["STOP_ID"])
  """ Compute score """
  score.append(entropy_pos * entropy_stop)

""" Get indexes of sorted scores """
good_trips_sorted_indexes = np.argsort(score)

""" The last one of the sorted indices is the one with the best score """
best_index = good_trips_sorted_indexes[-1]

We have chosen the trip with the best entropy score for our analysis. You may choose any other trip.
Now we can use this best_index and choose the trip name corresponding to that index.

In [15]:
chosen_trip = good_trips[best_index]
chosen_trip

'NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0'

We will proceed with out analysis for this trip

These are the first 5 stops of that trip with their respective times and stop_id. 

In [16]:
df_stop_time.loc[df_stop_time["trip_id"] == chosen_trip].head(5)["trip_id"].values

array(['NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0',
       'NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0',
       'NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0',
       'NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0',
       'NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0'], dtype=object)

Let's revisit the trips dataframe. Notice the shape_id, this is an identifier for all the stops that belong to a trip. Let's take a specific example of one particular trip.

In [17]:
route_of_trip = df_trips.loc[df_trips["trip_id"] == chosen_trip]
route_of_trip

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,bikes_allowed,duty,duty_sequence_number,run_sequence_number
21377,298,1,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,Chinchwad Gaon To Katraj,,0,,3313,,,298/1,4,4


We can use this trips shape_id = 3389 and find all the stops locations from the df_shapes, the shapes dataframe. The first 5 stops of this trip are -

In [18]:
shape_of_route = df_shapes.loc[df_shapes["shape_id"] == route_of_trip["shape_id"].values[0]]
shape_of_route.head(5)

Unnamed: 0,shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
104321,3313,18.62751,73.78216,1,
104322,3313,18.62829,73.78222,2,
104323,3313,18.62871,73.78249,3,
104324,3313,18.62905,73.78245,4,
104325,3313,18.62925,73.7819,5,


Let's try to visualize this trip.

In [19]:
""" Initialize the map """
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=route_of_trip["trip_id"].values[0], color="red").add_to(m)
m


Let's find all the stops that are encountered in this trip and overlay it on the route. 
Let's again select a random route from the routes table and then choose a random trip on that route.


Now, we can obtain the stop times for this particular trip from the df_stop_time data frame. We should sort this table with stop_sequence.

In [20]:
stops_on_route = df_stop_time.loc[df_stop_time["trip_id"] == chosen_trip].sort_values("stop_sequence")
stops_on_route.head(5)

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled,timepoint
738025,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,10:40:00,10:40:00,39312,1,,,,,
738026,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,10:42:12,10:42:43,1034,2,,,,,
738027,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,10:43:31,10:43:41,1032,3,,,,,
738028,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,10:44:37,10:44:50,39311,4,,,,,
738029,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,10:46:00,10:46:28,39336,5,,,,,


For each of these stops that are identified by a stop_id, we need to fetch it's name and location fro the df_stops data frame. Subsequent to which we may add that stop to the map and visualize. We will plot both the trip route and the stops together.

In [21]:
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=chosen_trip, color="red").add_to(m)

for index, row in stops_on_route.iterrows():
  stop_info = df_stops.loc[df_stops["stop_id"] == row["stop_id"]]
  folium.Marker(stop_info[["stop_lat", "stop_lon"]].values.tolist()[0],
                popup=stop_info["stop_name"].values[0]).add_to(m) 

m

## Visualizing Archived Field data

The first set of data corresponds to field data from around 06:00 AM to 03:00 PM on 04-11-2019.

In [22]:
df_field_data.head(5)

Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
0,,E014,2019-11-21T11:59:59.716+05:30,SCHEDULED,153009.0,,IN_TRANSIT_TO,180,18.530954,153009.0,2019-11-21 06:29:30,NORMAL_180_Bhekrainagar To Na Ta Wadi_Down-1145_0,73.847534
1,,E001,2019-11-21T11:59:59.566+05:30,SCHEDULED,152113.0,,IN_TRANSIT_TO,322,18.627222,152113.0,2019-11-21 06:29:22,NORMAL_322_Akurdi Railway Station To Ma Na Pa ...,73.78255
2,,E015,2019-11-21T11:59:59.439+05:30,SCHEDULED,40003.0,,IN_TRANSIT_TO,180,18.529034,40003.0,2019-11-21 06:29:28,NORMAL_180_Na Ta Wadi To Bhekrainagar_Up-1730_0,73.852516
3,,E006,2019-11-21T11:59:59.381+05:30,SCHEDULED,40474.0,,IN_TRANSIT_TO,372,18.593365,40474.0,2019-11-21 06:29:25,NORMAL_372_Hinjawadi Maan Phase 3 To Bhakti Sh...,73.7361
4,,680,2019-11-21T11:59:59.314+05:30,SCHEDULED,30417.0,,IN_TRANSIT_TO,58,18.518957,30417.0,2019-11-21 06:29:29,NORMAL_58_Shanipar To Gokhalenagar_Up-1135_0,73.83155


You may ignore the NAME field in this data frame. What is of importance is the TRIP_ID, LONGITUDE_STR, LATITUDE_STR, ROUTE_ID, POSITION_UPDATE_TIMESTAMP.

Let's follow the same procedure as before and plot the path traversed by the bus over the previous static route.

In [23]:
trip_on_field = df_field_data.loc[df_field_data["TRIP_ID"] == chosen_trip]
trip_on_field.head()


Unnamed: 0,STOP_NAME,NAME,LASTUPDATEDATETIME,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,ROUTE_NAME,CURRENT_STATUS,ROUTE_ID,LATITUDE_STR,STOP_ID,POSITION_UPDATE_TIMESTAMP,TRIP_ID,LONGITUDE_STR
41,,R495,2019-11-21T11:58:58.251+05:30,SCHEDULED,40204.0,,STOPPED_AT,298,18.447329,40204.0,2019-11-21 06:28:19,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,73.85885
72,,R495,2019-11-21T11:57:56.761+05:30,SCHEDULED,40204.0,,IN_TRANSIT_TO,298,18.448462,40204.0,2019-11-21 06:27:16,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,73.8584
96,,R495,2019-11-21T11:56:36.614+05:30,SCHEDULED,40204.0,,IN_TRANSIT_TO,298,18.448597,40204.0,2019-11-21 06:26:23,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,73.85837
116,,R495,2019-11-21T11:55:49.630+05:30,SCHEDULED,40204.0,,IN_TRANSIT_TO,298,18.448608,40204.0,2019-11-21 06:25:22,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,73.85837
120,,R495,2019-11-21T11:54:56.277+05:30,SCHEDULED,40815.0,,IN_TRANSIT_TO,298,18.44977,40815.0,2019-11-21 06:24:23,NORMAL_298_Chinchwad Gaon To Katraj_Down-1040_0,73.85812


In [24]:
lat_lons_field = trip_on_field[["LATITUDE_STR", "LONGITUDE_STR"]].values.tolist()
stop_id_field = trip_on_field["STOP_ID"].values.tolist()

""" Initialize the map """
m = folium.Map(location=[18.5204,73.8567],zoom_start=13)

lat_lons = shape_of_route[["shape_pt_lat", "shape_pt_lon"]].values.tolist()
folium.PolyLine(lat_lons,popup=chosen_trip, color="red").add_to(m)

for index, row in stops_on_route.iterrows():
  stop_info = df_stops.loc[df_stops["stop_id"] == row["stop_id"]]
  folium.Marker(stop_info[["stop_lat", "stop_lon"]].values.tolist()[0],
                popup="Static StopID = " + stop_info["stop_code"].values[0]).add_to(m) 

""" Plot field data as green marker"""
for lls, stop_id in zip(lat_lons_field, stop_id_field):
  folium.Marker(lls, popup="Live StopID = " + str(stop_id), icon=folium.Icon(color='green')).add_to(m)

m

You can change the routes and see the plot for different routes.



## Downloading larger datasets
Because of the large size of the data availabe, we have restricted PUDX "during" queries to only work when the time is less than one day. If data for a longer period is required, you will be needing the download API.
We will however need the resourceServerGroup id instead of the id for this. To find this, you can go to pudx.catalogue.iudx.org.in and search for ITMS with tags. Once the item is shown in the list view, you can click "details" and obtain the group id.
The resourceServerGroup id for itms is "urn:iudx-catalogue-pune:pudx-resource-server/pune-itms"

This will now give us a Google Drive link which we can use to download files based on weeks of the year. 

In [53]:
from pyIUDX.rs import rs

rs = rs.ResourceServer("https://pudx.resourceserver.iudx.org.in/resource-server/pscdcl/v1")

groupId = "urn:iudx-catalogue-pune:pudx-resource-server/pune-itms"

data = rs.downloadData(groupId)
data


{'download_URL': 'https://drive.google.com/open?id=1V9bp8D5M9nhqITjMvLo-1IoDltWsJlKs',
 'resourceServerGroup': 'urn:iudx-catalogue-pune:pudx-resource-server/pune-itms'}

On opening that download_URL, you will find different files corresponding to different weeks of the year for ITMS. You can then use python pyDrive module to download that file.

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from pydrive.files import GoogleDriveFile
from google.colab import auth
from oauth2client.client import GoogleCredentials

Authenticate with Google Drive.

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
files = GoogleDriveFile(auth=gauth)

In [63]:

folder_id = data["download_URL"].split("=")[-1]
file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % folder_id}).GetList()

for f in file_list:
  print(f["title"])

pune-itms-Week_40.json
pune-itms-Week_39.json


Currently only data for week 39 and 40 are made available. Let's download week 40's data.

In [0]:
fl = file_list[0]
fl.GetContentFile(fl["title"])


If we now see the directory, we will have the chosen file.json file, which we can read and proceed with our analysis as we had done before.

In [67]:
!ls

adc.json  pune_ITMS_GTFS.zip  pune-itms-Week_40.json  sample_data  targetdir


In [71]:
import json 

with open(fl["title"], "r") as f:
  df_json = json.load(f)
df_json[0:1]

[{'CURRENT_STATUS': 'IN_TRANSIT_TO',
  'CURRENT_STOP_SEQUENCE': 46.0,
  'LASTUPDATEDATETIME': '2019-09-30T18:30:03.000000+05:30',
  'LATITUDE_STR': 18.519592,
  'LONGITUDE_STR': 73.86258,
  'NAME': '1551',
  'POSITION_UPDATE_TIMESTAMP': '2019-09-30T12:59:59Z',
  'ROUTE_ID': '82',
  'SCHEDULE_RELATIONSHIP': 'SCHEDULED',
  'STOP_ID': 40411.0,
  'StartTime_IST': 1569888003999,
  'TRIP_ID': 'NORMAL_82_NDA 10 No Gate Bus Stand To MaNaPa To Kondhwagate_Round-1600_0',
  '__geoJsonLocation': {'coordinates': [73.86258, 18.519592], 'type': 'Point'},
  '__resource-group': 'pune-itms',
  '__resource-id': 'pudx-resource-server/pune-itms/pune-itms-live'}]

In [76]:
""" Convert it to a dataframe and convert string values of fields to their proper types"""

df_field_data = pd.DataFrame(df_json)
df_field_data["POSITION_UPDATE_TIMESTAMP"] = pd.to_datetime(df_field_data["POSITION_UPDATE_TIMESTAMP"])
df_field_data["LATITUDE_STR"] = pd.to_numeric(df_field_data["LATITUDE_STR"])
df_field_data["LONGITUDE_STR"] = pd.to_numeric(df_field_data["LONGITUDE_STR"])
""" Other fields which need to be converted """
df_field_data["STOP_ID"] = pd.to_numeric(df_field_data["STOP_ID"])
df_field_data["CURRENT_STOP_SEQUENCE"] = pd.to_numeric(df_field_data["STOP_ID"])


df_field_data.head(3)

Unnamed: 0,__resource-group,StartTime_IST,LASTUPDATEDATETIME,LONGITUDE_STR,SCHEDULE_RELATIONSHIP,CURRENT_STOP_SEQUENCE,CURRENT_STATUS,ROUTE_ID,__geoJsonLocation,LATITUDE_STR,POSITION_UPDATE_TIMESTAMP,STOP_ID,TRIP_ID,__resource-id,NAME
0,pune-itms,1569888003999,2019-09-30T18:30:03.000000+05:30,73.86258,SCHEDULED,40411.0,IN_TRANSIT_TO,82,"{'type': 'Point', 'coordinates': [73.86258, 18...",18.519592,2019-09-30 12:59:59+00:00,40411.0,NORMAL_82_NDA 10 No Gate Bus Stand To MaNaPa T...,pudx-resource-server/pune-itms/pune-itms-live,1551
1,pune-itms,1569888003999,2019-09-30T18:30:03.000000+05:30,73.87764,SCHEDULED,40819.0,IN_TRANSIT_TO,49,"{'type': 'Point', 'coordinates': [73.87764, 18...",18.526888,2019-09-30 12:59:59+00:00,40819.0,NORMAL_49_Pune Station To Khanapur_Up-0715_0,pudx-resource-server/pune-itms/pune-itms-live,1685
2,pune-itms,1569888005000,2019-09-30T18:30:05.000000+05:30,73.78259,SCHEDULED,150031.0,IN_TRANSIT_TO,87,"{'type': 'Point', 'coordinates': [73.78259, 18...",18.565727,2019-09-30 12:59:59+00:00,150031.0,NORMAL_87_Deccan Gymkhana To Sutarwadi Pashan ...,pudx-resource-server/pune-itms/pune-itms-live,CNG549


Remember, this dataframe has data for multiple days of the week. If you were to plot the trips of this data, you will have overlaps corresponding to the same trip taken the next day.