### Install dependencies

This notebook requires two dependencies which can be installed with the following command `pip install pandas open-bus-stride-client`.

You can also launch it online at [this URL](https://mybinder.org/v2/gh/hasadna/open-bus-stride-client/HEAD?labpath=notebooks%2FLoad%20route%20rides%20to%20dataframe.ipynb), when launching online the dependencies are already installed.


In [1]:
import pandas as PD
import stride

### Find a route to investigave

Because the GTFS data is not available yet, we have to use the GTFS operator_ref and line_ref numbers to find a route

In [2]:
siri_routes = stride.get('/siri_routes/list', {'operator_refs': 14, 'line_refs': 28153})
siri_routes

[{'id': 733, 'line_ref': 28153, 'operator_ref': 14}]

We determine that the siri route id is `684` we can now use that to get the rides for this route

In [3]:
siri_route_id = siri_routes[0]['id']
siri_route_id

733

### Get rides data

We use the stride iterate method to efficiently iterate over a possibly long list of results.

Behind the scenes it uses the offset/limit parameters so you don't have to worry about it.

We pass on the iterator directly on to Pandas to create a DataFrame.

In [4]:
import datetime

df = PD.DataFrame(stride.iterate('/siri_rides/list', {
    # route_ids field can be a comma-separated string containing a list of ids, but we specify only a single one here
    'siri_route_ids': siri_route_id, 
    # all date/time parameters must have a timezone
    'scheduled_start_time_to': datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(days=1),
    # any field can be specified in order_by with asc or desc specifier, you can specify comma-separated multiple values
    'order_by': 'scheduled_start_time desc'
   # Any number can be specified for the limit as we use pagination behind the scenes, default is 10,000
}, limit=20000))
df

Unnamed: 0,journey_ref,id,vehicle_ref,first_vehicle_location_id,updated_duration_minutes,journey_gtfs_ride_id,route_gtfs_ride_id,siri_route_id,scheduled_start_time,updated_first_last_vehicle_locations,last_vehicle_location_id,duration_minutes,gtfs_ride_id
0,2022-06-21-50064705,9985059,80426601,526215926,2022-06-21 23:00:35.349260+00:00,,10693091.0,733,2022-06-21 16:00:00+00:00,2022-06-21 17:02:39.756708+00:00,526402737,38,10693091.0
1,2022-06-21-50064700,9982421,80425301,526067720,2022-06-21 23:00:26.875955+00:00,,10693084.0,733,2022-06-21 15:30:00+00:00,2022-06-21 17:02:08.853851+00:00,526247471,34,10693084.0
2,2022-06-21-50064695,9979568,37582901,525897018,2022-06-21 22:00:25.957956+00:00,,10700546.0,733,2022-06-21 15:00:00+00:00,2022-06-21 16:00:55.002195+00:00,526084257,33,10700546.0
3,2022-06-21-50064690,9976547,80425201,525718141,2022-06-21 22:00:33.580594+00:00,,10693083.0,733,2022-06-21 14:30:00+00:00,2022-06-21 16:01:33.981276+00:00,525920837,35,10693083.0
4,2022-06-21-50064685,9973541,13684902,525530780,2022-06-21 21:00:11.037205+00:00,,10694156.0,733,2022-06-21 14:00:00+00:00,2022-06-21 15:00:42.410121+00:00,525724224,31,10694156.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4292,2022-02-15-50064555,2767878,13684902,146289970,2022-03-21 18:42:42.778633+00:00,,3015901.0,733,2022-02-15 05:20:00+00:00,2022-03-21 18:42:42.778601+00:00,146301252,41,3015901.0
4293,2022-02-15-50064550,2776588,80425201,146595248,2022-03-21 18:54:59.311992+00:00,,3015880.0,733,2022-02-15 05:00:00+00:00,2022-03-20 14:03:55.492574+00:00,146172210,29,3015880.0
4294,2022-02-15-50064545,2780637,37582901,146665887,2022-03-21 18:55:19.506774+00:00,,3015879.0,733,2022-02-15 04:30:00+00:00,2022-03-20 14:04:14.860460+00:00,146351498,29,3015879.0
4295,2022-02-15-50064540,2786309,37582801,146897666,2022-03-21 18:56:01.813101+00:00,,3015878.0,733,2022-02-15 04:00:00+00:00,2022-03-20 14:04:49.086862+00:00,146692025,27,3015878.0


### Manipulate the data using Pandas

Now we can use Pandas to get some information from this data

First, let's filter out results which don't have a duration (those are most likely recent rides for which we haven't yet calculated the duration)

In [5]:
df = df[df.duration_minutes.notnull()]
df

Unnamed: 0,journey_ref,id,vehicle_ref,first_vehicle_location_id,updated_duration_minutes,journey_gtfs_ride_id,route_gtfs_ride_id,siri_route_id,scheduled_start_time,updated_first_last_vehicle_locations,last_vehicle_location_id,duration_minutes,gtfs_ride_id
0,2022-06-21-50064705,9985059,80426601,526215926,2022-06-21 23:00:35.349260+00:00,,10693091.0,733,2022-06-21 16:00:00+00:00,2022-06-21 17:02:39.756708+00:00,526402737,38,10693091.0
1,2022-06-21-50064700,9982421,80425301,526067720,2022-06-21 23:00:26.875955+00:00,,10693084.0,733,2022-06-21 15:30:00+00:00,2022-06-21 17:02:08.853851+00:00,526247471,34,10693084.0
2,2022-06-21-50064695,9979568,37582901,525897018,2022-06-21 22:00:25.957956+00:00,,10700546.0,733,2022-06-21 15:00:00+00:00,2022-06-21 16:00:55.002195+00:00,526084257,33,10700546.0
3,2022-06-21-50064690,9976547,80425201,525718141,2022-06-21 22:00:33.580594+00:00,,10693083.0,733,2022-06-21 14:30:00+00:00,2022-06-21 16:01:33.981276+00:00,525920837,35,10693083.0
4,2022-06-21-50064685,9973541,13684902,525530780,2022-06-21 21:00:11.037205+00:00,,10694156.0,733,2022-06-21 14:00:00+00:00,2022-06-21 15:00:42.410121+00:00,525724224,31,10694156.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4292,2022-02-15-50064555,2767878,13684902,146289970,2022-03-21 18:42:42.778633+00:00,,3015901.0,733,2022-02-15 05:20:00+00:00,2022-03-21 18:42:42.778601+00:00,146301252,41,3015901.0
4293,2022-02-15-50064550,2776588,80425201,146595248,2022-03-21 18:54:59.311992+00:00,,3015880.0,733,2022-02-15 05:00:00+00:00,2022-03-20 14:03:55.492574+00:00,146172210,29,3015880.0
4294,2022-02-15-50064545,2780637,37582901,146665887,2022-03-21 18:55:19.506774+00:00,,3015879.0,733,2022-02-15 04:30:00+00:00,2022-03-20 14:04:14.860460+00:00,146351498,29,3015879.0
4295,2022-02-15-50064540,2786309,37582801,146897666,2022-03-21 18:56:01.813101+00:00,,3015878.0,733,2022-02-15 04:00:00+00:00,2022-03-20 14:04:49.086862+00:00,146692025,27,3015878.0


Get the mean duration for all those rides

In [6]:
df.duration_minutes.mean()

32.846869909239004