Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


Extensive Documentation - Data Harvester Workflow

Structure Overview

The script contains 1 major process (i.e., class) that requires the user to input 6 parameters.

Major Process

A) Purpose

Iteratively harvests GTFS-RT, parses relevant entities and structures to dataframe, and appends to CSV file.

B) Function Details

Coded in, the ExtractGTFSRT class composes of 152 lines of code.

C) Required Parameters

Parameter Type Purpose
url Str The url (hyperlink) to download GTFS-RT .pb file.
city Str The name of the city you are extracting GTFS-RT from to name part of the output csv file.
hrs_collect Int The number of hours for the harvester to run throughout the day. This is contingent to how often the GTFS-RT feed updates (e.g., Calgary every 30 sec.; Boston MBTA every 5 sec.)
time_zone Str The time zone of the study area used in Pytz. Type pytz.all_timezones to find your proper time zone.
throttle Int Pause (sleep) the harvester in x seconds. This refers to the frequency of the GTFS-RT updates.

D) Step Details

Below are the backend steps (in order) briefly explained followed by a graphic that encapsulates it.

  1. Run ExtractGTFSRT
    • Receive feed message an calculate the iterator. The iterator is the amount of times the for loop will run based on the frequency update of the GTFS-RT feed and amount of time to collect the data per day (lines 79-87).
    • Parse out entities from the feed - timestamp, vehicle_id, trip_id, lat, lon - and append to a dictionary (lines 100 - 120).
    • Construct DataFrame and append to csv (lines 123 - 140).

iterator = round((60 sec. / throttle) * 60 (min/hr) * hrs_collect)
iterator = round(# of updates per hr. * hrs_collect)

For example, let's say a GTFS-RT feed from a transit agency updates every 30 seconds and you would like to collect for 12 hrs. per day. Then:

iterator = round((60 sec. / 30 sec. per update) * 60 (min./hr) * 12)
iterator = 1,440

E) Packages Used & Purpose

Package Purpose
google.transit Language bindings generated from the GTFS-RT protocol buffer. More specifically, parses out entities from the GTFS-RT feed.
datetime To convert timestamp to UTC and local time.
requests Acquire the hyperlink.
time To throttle (i.e., sleep) the function.
pytz Working with timezones.
os.path Directory management.
Pandas Data Structuring & Formatting time.
tqdm Progress bar.