### Install dependencies

This notebook requires two dependencies which can be installed with the following command `pip install pandas open-bus-stride-client`.

(When running on mybinder.org the dependencies are already installed)


In [1]:
import pandas as PD
import stride

### Find a route to investigave

Because the GTFS data is not available yet, we have to use the GTFS operator_ref and line_ref numbers to find a route

In [2]:
stride.get('/siri_routes/list', {'operator_refs': 14, 'line_refs': 28153})

[{'id': 684, 'line_ref': 28153, 'operator_ref': 14}]

We determine that the route id is `684` we can now use that to get the rides for this route

### Get rides data

We use the stride iterate method to efficiently iterate over a possibly long list of results.

Behind the scenes it uses the offset/limit parameters so you don't have to worry about it.

We pass on the iterator directly on to Pandas to create a DataFrame.

In [3]:
import datetime

df = PD.DataFrame(stride.iterate('/siri_rides/list', {
    # route_ids field can be a comma-separated list of ids, but we specify only a single one here
    'route_ids': '684', 
    # all date/time parameters must have a timezone
    'scheduled_start_time_to': datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(days=1),
    # any field can be specified in order_by with asc or desc specifier, you can specify comma-separated multiple values
    'order_by': 'scheduled_start_time desc'
   # Any number can be specified for the limit as we use pagination behind the scenes, default is 10,000
}, limit=20000))
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,1868177,6473,2021-11-27-584640749,2021-11-27T17:12:00,44982402,2021-11-28 10:21:26.784728+00:00,102621964.0,102789948.0,2021-11-28 10:21:26.784797+00:00,61.0
1,1868160,2509,2021-11-27-584631185,2021-11-27T17:12:00,7663069,2021-11-28 12:09:35.590704+00:00,102619419.0,102947226.0,2021-11-28 12:09:35.590765+00:00,114.0
2,1868146,834,2021-11-27-29851672,2021-11-27T17:12:00,4962673,2021-11-28 12:09:35.558932+00:00,102618172.0,102842717.0,2021-11-28 12:09:35.558990+00:00,80.0
3,1868144,472,2021-11-27-27669043,2021-11-27T17:12:00,9298201,2021-11-28 12:09:35.548880+00:00,102617829.0,102715777.0,2021-11-28 12:09:35.548967+00:00,37.0
4,1868137,2636,2021-11-27-584631082,2021-11-27T17:12:00,7552469,2021-11-28 12:09:35.521329+00:00,102617251.0,102768545.0,2021-11-28 12:09:35.521388+00:00,55.0
...,...,...,...,...,...,...,...,...,...,...
19995,1847874,160,2021-11-26-58017518,2021-11-26T10:36:00,67546402,2021-11-28 12:00:08.959264+00:00,101434969.0,101627861.0,2021-11-28 12:00:08.959328+00:00,36.0
19996,1847942,930,2021-11-26-11860822,2021-11-26T10:36:00,44985402,2021-11-28 12:00:09.136702+00:00,101436689.0,101785623.0,2021-11-28 12:00:09.136765+00:00,67.0
19997,1847953,1046,2021-11-26-11864289,2021-11-26T10:36:00,7553069,2021-11-28 12:00:09.172092+00:00,101436908.0,101805913.0,2021-11-28 12:00:09.172136+00:00,71.0
19998,1848049,5563,2021-11-26-57838304,2021-11-26T10:36:00,7355352,2021-11-28 12:00:16.017079+00:00,101444790.0,101922119.0,2021-11-28 12:00:16.017139+00:00,92.0


### Manipulate the data using Pandas

Now we can use Pandas to get some information from this data

First, let's filter out results which don't have a duration (those are most likely recent rides for which we haven't yet calculated the duration)

In [4]:
df = df[df.duration_minutes.notnull()]
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,1868177,6473,2021-11-27-584640749,2021-11-27T17:12:00,44982402,2021-11-28 10:21:26.784728+00:00,102621964.0,102789948.0,2021-11-28 10:21:26.784797+00:00,61.0
1,1868160,2509,2021-11-27-584631185,2021-11-27T17:12:00,7663069,2021-11-28 12:09:35.590704+00:00,102619419.0,102947226.0,2021-11-28 12:09:35.590765+00:00,114.0
2,1868146,834,2021-11-27-29851672,2021-11-27T17:12:00,4962673,2021-11-28 12:09:35.558932+00:00,102618172.0,102842717.0,2021-11-28 12:09:35.558990+00:00,80.0
3,1868144,472,2021-11-27-27669043,2021-11-27T17:12:00,9298201,2021-11-28 12:09:35.548880+00:00,102617829.0,102715777.0,2021-11-28 12:09:35.548967+00:00,37.0
4,1868137,2636,2021-11-27-584631082,2021-11-27T17:12:00,7552469,2021-11-28 12:09:35.521329+00:00,102617251.0,102768545.0,2021-11-28 12:09:35.521388+00:00,55.0
...,...,...,...,...,...,...,...,...,...,...
19995,1847874,160,2021-11-26-58017518,2021-11-26T10:36:00,67546402,2021-11-28 12:00:08.959264+00:00,101434969.0,101627861.0,2021-11-28 12:00:08.959328+00:00,36.0
19996,1847942,930,2021-11-26-11860822,2021-11-26T10:36:00,44985402,2021-11-28 12:00:09.136702+00:00,101436689.0,101785623.0,2021-11-28 12:00:09.136765+00:00,67.0
19997,1847953,1046,2021-11-26-11864289,2021-11-26T10:36:00,7553069,2021-11-28 12:00:09.172092+00:00,101436908.0,101805913.0,2021-11-28 12:00:09.172136+00:00,71.0
19998,1848049,5563,2021-11-26-57838304,2021-11-26T10:36:00,7355352,2021-11-28 12:00:16.017079+00:00,101444790.0,101922119.0,2021-11-28 12:00:16.017139+00:00,92.0


Get the mean duration for all those rides

In [5]:
df.duration_minutes.mean()

53.42883772482241