<h1> <a href="https://gtfs.org/">GTFS: General Public Transit Feed Specification</a></h1>

Around the world, public transit agencies make data available about their services, routes, and stops via a standardized data format called <a href="https://gtfs.org/">GTFS</a> (originally developed by Google). 

It has two parts: the static component contains information that changes rarely including locations of stops, routes and schedules. A new version of this static information is typically released every few months. Some agencies also provide a real-time component based on live GPS data from their buses, trains etc to provide up to the minute data about vehicle positions and arrival predictions - typically updated every 30 seconds.

This practical exercise will be based on only the static GTFS data.

Start by downloading the current GTFS schedule data for South East Queensland from:
https://gtfsrt.api.translink.com.au/ (https://gtfsrt.api.translink.com.au/GTFS/SEQ_GTFS.zip)

You will need to upload the following files to your Jupyter account in the cloud:
- <code>calendar.txt</code>
- <code>routes.txt</code>
- <code>stops.txt</code>
- <code>stop_times.txt</code>
- <code>trips.txt</code>

# Finding our way to the CBD via public transport
Our goal is to travel from where we live to the Bribane CBD via public transport.
We don't know where the closest stop is, we don't know which route the trains or buses follow and we don't know when those buses or trains will arrive. 

Once you have <code>stops.txt</code> uploaded to your Jupyter account, open it to view its contents.

In [26]:
# Start by reading stops.txt into a pandas data frame using read_csv method and set the stop_id column as the index

import pandas
stops = pandas.read_csv('stops.txt', index_col = 0)

# display its contents
stops

Unnamed: 0_level_0,stop_code,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,platform_code
stop_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1.0,Herschel Street Stop 1 near North Quay,,-27.467834,153.019079,1,https://translink.com.au/stop/000001/gtfs/,0,,
10,10.0,Ann Street Stop 10 at King George Square,,-27.468003,153.023970,1,https://translink.com.au/stop/000010/gtfs/,0,,
100,100.0,Parliament Stop 94A Margaret St,,-27.473751,153.026745,1,https://translink.com.au/stop/000100/gtfs/,0,,
1000,1000.0,Handford Rd at Songbird Way,,-27.339069,153.043907,2,https://translink.com.au/stop/001000/gtfs/,0,,
10000,10000.0,Balcara Ave near Allira Cr,,-27.344106,153.024982,2,https://translink.com.au/stop/010000/gtfs/,0,,
...,...,...,...,...,...,...,...,...,...,...
place_pinesc,,The Pines Shopping Centre,,-28.134660,153.469767,,,1,,
place_inttbl,,Toombul Shopping Centre interchange,,-27.408269,153.059963,,,1,,
place_intuq,,UQ Chancellors Place,,-27.497970,153.011136,,,1,,
place_scuniv,,USC station,,-26.718756,153.062004,,,1,,


In [None]:
# There are thousands of stops across south east Queensland. Our first goal is to find some stops near to where we live.

# We start by determining the longitude and latitude of the property where we live.
# Open google maps https://www.google.com/maps and locate the property where you currently live.
# Put a pin on that location and make note of the longitude and latitude. 
# The longitude should be about 153 and the latitude about -27

my_longitude = -27.38380639217319
my_latitude = 152.9595649359856

In [None]:
# Next we need to be able to measure the distance from our property to each of the stops. 
# To measure the distance between two sets of  longitude and latitude, we need to use a formula, 
# such as the haversine formula (https://en.wikipedia.org/wiki/Haversine_formula) to determine the
# distance between two points on a sphere (since the earth is not flat).
# The earth is not a perfect sphere, it's radius varies at different points, but we approximate its radius as 6371 kilometres.

import math

# https://en.wikipedia.org/wiki/Haversine_formula
def haversine_distance(lon1, lat1, lon2, lat2):
      # convert decimal degrees to radians 
      lon1 = math.radians(lon1)
      lat1 = math.radians(lat1)
      lon2 = math.radians(lon2)
      lat2 = math.radians(lat2)
        
      # haversine formula 
      dlon = lon2 - lon1 
      dlat = lat2 - lat1 
      a =  math.sin(dlat/2)**2 +  math.cos(lat1) * math.cos(lat2) *  math.sin(dlon/2)**2
      c = 2 * math.asin( math.sqrt(a)) 
      r = 6371 # Radius of earth in kilometers.
      return c * r
    
# test case
haversine_distance(-27.467834, 153.019079, -27.371936, 153.099357) # should be about 13 kilometres

In [None]:
# We can then use the function to compute the distance from our specified longitude and latitude, to each stop

def near(stop_row, lon, lat) :
    return haversine_distance(lon, lat, stop_row.stop_lat, stop_row.stop_lon)

stops['dist_from_home'] = stops.apply(near, lon=my_longitude, lat=my_latitude, axis=1)
stops # see the new column ...

In [None]:
# We can then sort the stops by this new column using the sort_values method

nearby_stops = stops.sort_values('dist_from_home')
nearby_stops

In [None]:
# Let's choose the first of these stops and see which buses or trains are coming soon and where they are going to ...
our_stop_id = nearby_stops.index[0]
our_stop_id

In [None]:
# Read stop_times.txt into a data frame using the read_csv method.
# set the data type of the stop_id column to type string by adding parameter: dtype={'stop_id':'str'}

stop_times = pandas.read_csv('stop_times.txt', dtype={'stop_id':'str'})

In [None]:
# View just those stop_time rows that match our stop_id

stop_times[stop_times.stop_id==our_stop_id]

In [None]:
# Not all of those trips we necessarily be coming today. 
# Transit agencies run different schedules on different days of the week, especially for weekends and public holidays.
# To learn about these service schedules we need to load the calendar.txt file into a data frame.
# Set the service_id column as the index and parse the two date columns as dates

services = pandas.read_csv('calendar.txt', index_col = 0, parse_dates=['start_date','end_date'])
services

In [None]:
# Start by viewing only those services that run on this day of the week.
# So, for example, if today is a Thurdsday, then we require services.thursday == 1

services[services.thursday == 1]

In [None]:
# We also need to ensure that today falls within the start_date and end_date period of that service.
# For that we need to know today's date ...
import pytz
timezone = pytz.timezone('Australia/Brisbane')
today = pandas.Timestamp.now(tz=timezone).tz_localize(None)

In [None]:
# Find the list of service_ids for services that run today and are within the service start and end dates

todays_services = services[(services.thursday == 1) & (services.start_date <= today) & (today <= services.end_date)].index
todays_services

In [None]:
# Next we need to learn which trips occur on those services, so we need to load trips.txt into a pandas data frame.
# Set the trip_id column as the index.

trips = pandas.read_csv('trips.txt', index_col = 2)
trips

In [None]:
# To test if a trip is part of a service, we can use the isin method
# trips.service_id.isin(todays_services)

# Find the list of trip_ids for those trips
todays_trips = trips[trips.service_id.isin(todays_services)].index

todays_trips

In [None]:
# We can then use this list of trip ids to find stop times matching these trip ids.
# stop_times.trip_id.isin(todays_trips)

# Find all stop times that stop at our stop today.
stop_times[(stop_times.stop_id==our_stop_id) & (stop_times.trip_id.isin(todays_trips)) ]

In [None]:
# We aren't interested in trying to catch any trains or buses that have already departed, 
# so view only those stop times that have an arrival_time after the time now.

time_now = today.strftime('%H:%M:%S')

arriving_soon = stop_times[(stop_times.stop_id==our_stop_id) & (stop_times.trip_id.isin(todays_trips)) & (time_now <= stop_times.arrival_time)  ]
arriving_soon

In [None]:
# That's great, but we don't know where any of these trains or buses are going to ...
# So, we start by joining this stop_time data with the trips data frame
stops_with_trips = arriving_soon.join(trips, on='trip_id')
stops_with_trips

In [None]:
# We now have a trip_headsign column, which may help us determine where the bus or train is going
# We also now have a route_id, but it's not particularly meaningful.
# To get information about the route we need to join our stop_time and trip data with the route.txt data.

In [None]:
# Read routes.txt into a pandas data frame.
# Set the route_id column as the index
routes = pandas.read_csv('routes.txt', index_col = 0)

In [None]:
# Join our stop_time and route data frame with the routes data frame based on the 'route_id' column

full = arriving_soon.join(trips, on='trip_id').join(routes, on='route_id')
full

In [None]:
# Filter the output so that we only see the trip_id, arrival_time, route_short_name, route_long_name, trip_headsign
show = full[['trip_id','arrival_time', 'route_short_name', 'route_long_name', 'trip_headsign']]
show

In [None]:
# Lets select one of those trips to explore precisely where it goes ...
our_trip_id = show.iloc[0,0]

In [None]:
# Find all stop_times for our trip_id (do not restrict to our stop_id)

my_stops = stop_times[stop_times.trip_id == our_trip_id]
my_stops

In [None]:
# Unfortunately, these stop_ids don't mean anything to us,
# so we need to join this data with the stops data frame
# display only the arrival_time and stop_name
my_stops.join(stops, on='stop_id')[['arrival_time', 'stop_name']]

In [None]:
# Will this get us towards the Brisbane CBD? If not, explore some other options.