### Install dependencies

This notebook requires two dependencies which can be installed with the following command `pip install pandas open-bus-stride-client`.

You can also launch it online at [this URL](https://mybinder.org/v2/gh/hasadna/open-bus-stride-client/HEAD?labpath=notebooks%2FLoad%20route%20rides%20to%20dataframe.ipynb), when launching online the dependencies are already installed.


In [1]:
import pandas as PD
import stride

### Find a route to investigave

Because the GTFS data is not available yet, we have to use the GTFS operator_ref and line_ref numbers to find a route

In [2]:
siri_routes = stride.get('/siri_routes/list', {'operator_refs': 14, 'line_refs': 28153})
siri_routes

[{'id': 684, 'line_ref': 28153, 'operator_ref': 14}]

We determine that the siri route id is `684` we can now use that to get the rides for this route

In [3]:
siri_route_id = siri_routes[0]['id']
siri_route_id

684

### Get rides data

We use the stride iterate method to efficiently iterate over a possibly long list of results.

Behind the scenes it uses the offset/limit parameters so you don't have to worry about it.

We pass on the iterator directly on to Pandas to create a DataFrame.

In [4]:
import datetime

df = PD.DataFrame(stride.iterate('/siri_rides/list', {
    # route_ids field can be a comma-separated string containing a list of ids, but we specify only a single one here
    'siri_route_ids': siri_route_id, 
    # all date/time parameters must have a timezone
    'scheduled_start_time_to': datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(days=1),
    # any field can be specified in order_by with asc or desc specifier, you can specify comma-separated multiple values
    'order_by': 'scheduled_start_time desc'
   # Any number can be specified for the limit as we use pagination behind the scenes, default is 10,000
}, limit=20000))
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,3591760,684,2021-12-19-50064583,2021-12-19T07:20:00+00:00,2629239,2021-12-19 08:02:12.567125+00:00,194612801,194791474,2021-12-19 14:00:26.946997+00:00,33
1,3589670,684,2021-12-19-50064578,2021-12-19T07:00:00+00:00,80425201,2021-12-19 08:02:04.315582+00:00,194474123,194670254,2021-12-19 14:00:24.342569+00:00,34
2,3587779,684,2021-12-19-50064573,2021-12-19T06:40:00+00:00,37582901,2021-12-19 08:01:56.656238+00:00,194369807,194565661,2021-12-19 14:00:21.354314+00:00,32
3,3585742,684,2021-12-19-50064568,2021-12-19T06:20:00+00:00,80425301,2021-12-19 07:01:17.346984+00:00,194250420,194419854,2021-12-19 13:00:34.131482+00:00,28
4,3583725,684,2021-12-19-50064563,2021-12-19T06:00:00+00:00,37582801,2021-12-19 07:00:05.217346+00:00,194122249,194300837,2021-12-19 13:00:12.594131+00:00,29
...,...,...,...,...,...,...,...,...,...,...
1567,13060,684,2021-11-03-50064716,2021-11-03T18:00:00+00:00,37582901,2021-11-28 12:11:01.816118+00:00,426598,574828,2021-11-28 12:11:01.816188+00:00,35
1568,10615,684,2021-11-03-50064711,2021-11-03T17:30:00+00:00,2628639,2021-11-28 12:10:51.293463+00:00,288477,462959,2021-11-28 12:10:51.293527+00:00,37
1569,8172,684,2021-11-03-50064706,2021-11-03T17:00:00+00:00,80426601,2021-11-28 12:10:38.470238+00:00,142061,352609,2021-11-28 12:10:38.470295+00:00,41
1570,1596,684,2021-11-03-50064701,2021-11-03T16:30:00+00:00,80425301,2021-11-28 12:09:56.185948+00:00,1598,190735,2021-11-28 12:09:56.186014+00:00,34


### Manipulate the data using Pandas

Now we can use Pandas to get some information from this data

First, let's filter out results which don't have a duration (those are most likely recent rides for which we haven't yet calculated the duration)

In [5]:
df = df[df.duration_minutes.notnull()]
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,3591760,684,2021-12-19-50064583,2021-12-19T07:20:00+00:00,2629239,2021-12-19 08:02:12.567125+00:00,194612801,194791474,2021-12-19 14:00:26.946997+00:00,33
1,3589670,684,2021-12-19-50064578,2021-12-19T07:00:00+00:00,80425201,2021-12-19 08:02:04.315582+00:00,194474123,194670254,2021-12-19 14:00:24.342569+00:00,34
2,3587779,684,2021-12-19-50064573,2021-12-19T06:40:00+00:00,37582901,2021-12-19 08:01:56.656238+00:00,194369807,194565661,2021-12-19 14:00:21.354314+00:00,32
3,3585742,684,2021-12-19-50064568,2021-12-19T06:20:00+00:00,80425301,2021-12-19 07:01:17.346984+00:00,194250420,194419854,2021-12-19 13:00:34.131482+00:00,28
4,3583725,684,2021-12-19-50064563,2021-12-19T06:00:00+00:00,37582801,2021-12-19 07:00:05.217346+00:00,194122249,194300837,2021-12-19 13:00:12.594131+00:00,29
...,...,...,...,...,...,...,...,...,...,...
1567,13060,684,2021-11-03-50064716,2021-11-03T18:00:00+00:00,37582901,2021-11-28 12:11:01.816118+00:00,426598,574828,2021-11-28 12:11:01.816188+00:00,35
1568,10615,684,2021-11-03-50064711,2021-11-03T17:30:00+00:00,2628639,2021-11-28 12:10:51.293463+00:00,288477,462959,2021-11-28 12:10:51.293527+00:00,37
1569,8172,684,2021-11-03-50064706,2021-11-03T17:00:00+00:00,80426601,2021-11-28 12:10:38.470238+00:00,142061,352609,2021-11-28 12:10:38.470295+00:00,41
1570,1596,684,2021-11-03-50064701,2021-11-03T16:30:00+00:00,80425301,2021-11-28 12:09:56.185948+00:00,1598,190735,2021-11-28 12:09:56.186014+00:00,34


Get the mean duration for all those rides

In [6]:
df.duration_minutes.mean()

34.729007633587784