### Install dependencies

This notebook requires two dependencies which can be installed with the following command `pip install pandas open-bus-stride-client`.

You can also launch it online at [this URL](https://mybinder.org/v2/gh/hasadna/open-bus-stride-client/HEAD?labpath=notebooks%2FLoad%20route%20rides%20to%20dataframe.ipynb), when launching online the dependencies are already installed.


In [1]:
import pandas as PD
import stride

### Find a route to investigave

Because the GTFS data is not available yet, we have to use the GTFS operator_ref and line_ref numbers to find a route

In [2]:
siri_routes = stride.get('/siri_routes/list', {'operator_refs': 14, 'line_refs': 28153})
siri_routes

[{'id': 684, 'line_ref': 28153, 'operator_ref': 14}]

We determine that the siri route id is `684` we can now use that to get the rides for this route

In [3]:
siri_route_id = siri_routes[0]['id']
siri_route_id

684

### Get rides data

We use the stride iterate method to efficiently iterate over a possibly long list of results.

Behind the scenes it uses the offset/limit parameters so you don't have to worry about it.

We pass on the iterator directly on to Pandas to create a DataFrame.

In [4]:
import datetime

df = PD.DataFrame(stride.iterate('/siri_rides/list', {
    # route_ids field can be a comma-separated string containing a list of ids, but we specify only a single one here
    'siri_route_ids': siri_route_id, 
    # all date/time parameters must have a timezone
    'scheduled_start_time_to': datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(days=1),
    # any field can be specified in order_by with asc or desc specifier, you can specify comma-separated multiple values
    'order_by': 'scheduled_start_time desc'
   # Any number can be specified for the limit as we use pagination behind the scenes, default is 10,000
}, limit=20000))
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,4117000,684,2021-12-25-50064765,2021-12-25T08:00:00+00:00,63555101,2021-12-25 09:00:08.521121+00:00,222389336,222394885,2021-12-25 15:00:05.006431+00:00,31
1,4116716,684,2021-12-25-50064764,2021-12-25T07:00:00+00:00,2627939,2021-12-25 08:00:08.178133+00:00,222379871,222384367,2021-12-25 14:00:05.063735+00:00,29
2,4116591,684,2021-12-25-50064763,2021-12-25T06:30:00+00:00,37582901,2021-12-25 08:00:07.933933+00:00,222375912,222380144,2021-12-25 13:00:06.321446+00:00,30
3,4116477,684,2021-12-25-50064762,2021-12-25T06:00:00+00:00,80425301,2021-12-25 07:00:06.045216+00:00,222371587,222375221,2021-12-25 13:00:05.950113+00:00,26
4,4116341,684,2021-12-25-50064761,2021-12-25T05:30:00+00:00,63555101,2021-12-25 06:00:07.587638+00:00,222367239,222370854,2021-12-25 12:00:07.014119+00:00,28
...,...,...,...,...,...,...,...,...,...,...
1789,13060,684,2021-11-03-50064716,2021-11-03T18:00:00+00:00,37582901,2021-11-28 12:11:01.816118+00:00,426598,574828,2021-11-28 12:11:01.816188+00:00,35
1790,10615,684,2021-11-03-50064711,2021-11-03T17:30:00+00:00,2628639,2021-11-28 12:10:51.293463+00:00,288477,462959,2021-11-28 12:10:51.293527+00:00,37
1791,8172,684,2021-11-03-50064706,2021-11-03T17:00:00+00:00,80426601,2021-11-28 12:10:38.470238+00:00,142061,352609,2021-11-28 12:10:38.470295+00:00,41
1792,1596,684,2021-11-03-50064701,2021-11-03T16:30:00+00:00,80425301,2021-11-28 12:09:56.185948+00:00,1598,190735,2021-11-28 12:09:56.186014+00:00,34


### Manipulate the data using Pandas

Now we can use Pandas to get some information from this data

First, let's filter out results which don't have a duration (those are most likely recent rides for which we haven't yet calculated the duration)

In [5]:
df = df[df.duration_minutes.notnull()]
df

Unnamed: 0,id,siri_route_id,journey_ref,scheduled_start_time,vehicle_ref,updated_first_last_vehicle_locations,first_vehicle_location_id,last_vehicle_location_id,updated_duration_minutes,duration_minutes
0,4117000,684,2021-12-25-50064765,2021-12-25T08:00:00+00:00,63555101,2021-12-25 09:00:08.521121+00:00,222389336,222394885,2021-12-25 15:00:05.006431+00:00,31
1,4116716,684,2021-12-25-50064764,2021-12-25T07:00:00+00:00,2627939,2021-12-25 08:00:08.178133+00:00,222379871,222384367,2021-12-25 14:00:05.063735+00:00,29
2,4116591,684,2021-12-25-50064763,2021-12-25T06:30:00+00:00,37582901,2021-12-25 08:00:07.933933+00:00,222375912,222380144,2021-12-25 13:00:06.321446+00:00,30
3,4116477,684,2021-12-25-50064762,2021-12-25T06:00:00+00:00,80425301,2021-12-25 07:00:06.045216+00:00,222371587,222375221,2021-12-25 13:00:05.950113+00:00,26
4,4116341,684,2021-12-25-50064761,2021-12-25T05:30:00+00:00,63555101,2021-12-25 06:00:07.587638+00:00,222367239,222370854,2021-12-25 12:00:07.014119+00:00,28
...,...,...,...,...,...,...,...,...,...,...
1789,13060,684,2021-11-03-50064716,2021-11-03T18:00:00+00:00,37582901,2021-11-28 12:11:01.816118+00:00,426598,574828,2021-11-28 12:11:01.816188+00:00,35
1790,10615,684,2021-11-03-50064711,2021-11-03T17:30:00+00:00,2628639,2021-11-28 12:10:51.293463+00:00,288477,462959,2021-11-28 12:10:51.293527+00:00,37
1791,8172,684,2021-11-03-50064706,2021-11-03T17:00:00+00:00,80426601,2021-11-28 12:10:38.470238+00:00,142061,352609,2021-11-28 12:10:38.470295+00:00,41
1792,1596,684,2021-11-03-50064701,2021-11-03T16:30:00+00:00,80425301,2021-11-28 12:09:56.185948+00:00,1598,190735,2021-11-28 12:09:56.186014+00:00,34


Get the mean duration for all those rides

In [6]:
df.duration_minutes.mean()

34.72185061315496