## Edit GTFS as a System
Margaret Atkinson 10/25/22

Overall Goal: Conflate GTFS to TransCAD links

    Issue: Too many route variations
        First Sub-Issue Goal: Get the number of trips per time period for each route variation

Notes:

While this 2018 Recap GTFS has been cleaned by the R script itsleeds_gleangtfs.R and imported to the link layer in TransCAD, I am starting from the cleaned GTFS (not imported to link layer) so that I can easily connect all this data together in a system instead of disparate parts.

In [11]:
import matplotlib
matplotlib.use('agg')  # allows notebook to be tested in Travis

import pandas as pd
import cartopy.crs as ccrs
import cartopy
import matplotlib.pyplot as plt
import pandana as pdna
import time

import urbanaccess as ua
from urbanaccess.config import settings
from urbanaccess.gtfsfeeds import feeds
from urbanaccess import gtfsfeeds
from urbanaccess.gtfs.gtfsfeeds_dataframe import gtfsfeeds_dfs
from urbanaccess.network import ua_network, load_network

%matplotlib inline

Example of GTFS in UrbanAccess here:

https://github.com/UDST/urbanaccess/blob/dev/demo/simple_example.ipynb

In [12]:
# required bbox including all of Massachusetts and RI as well as parts of NH, CT, NY
bbox = (-73.7207, 41.1198, -69.7876, 43.1161)
# path to the downloaded and cleaned gtfs - mbta recap file for fall 2018
#   this could also be a folder of gtfs folders (pre merge of multiple gtfs)
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_its_clean"
path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_clean_trips_filter"

In [13]:
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(gtfsfeed_path= path_to_gtfs,
                                           validation=True,
                                           verbose=True,
                                           bbox=bbox,
                                           remove_stops_outsidebbox=False,
                                           append_definitions=True)

Checking GTFS text file header whitespace... Reading files using encoding: utf-8 set in configuration.
GTFS text file header whitespace check completed. Took 1.14 seconds
--------------------------------
Processing GTFS feed: mbta2018_clean_trips_filter
The unique agency id: mbta was generated using the name of the agency in the agency.txt file.
Unique agency id operation complete. Took 0.05 seconds
Unique GTFS feed id operation complete. Took 0.01 seconds
No GTFS feed stops were found to be outside the bounding box coordinates
mbta2018_clean_trips_filter GTFS feed stops: coordinates are in northwest hemisphere. Latitude = North (90); Longitude = West (-90).
Appended route type to stops
Appended route type to stop_times
--------------------------------
Added descriptive definitions to stops, routes, stop_times, and trips tables
Successfully converted ['departure_time'] to seconds past midnight and appended new columns to stop_times. Took 3.80 seconds
1 GTFS feed file(s) successfully re

### HEADWAYS

In [14]:
gtfsfeeds_dfs.calendar_dates.date = gtfsfeeds_dfs.calendar_dates.date.astype('str')
# make network
ua.gtfs.network.create_transit_net(gtfsfeeds_dfs=loaded_feeds,
                                   day='tuesday',
                                   timerange=['06:30:00', '09:30:00'],
                                   calendar_dates_lookup={'date':"20180918"})
gtfsfeeds_dfs.calendar_dates.date = gtfsfeeds_dfs.calendar_dates.date.astype('int64')

Using calendar to extract service_ids to select trips.
34 service_ids were extracted from calendar
21,968 trip(s) 100.00 percent of 21,968 total trip records were found in calendar for GTFS feed(s): ['mbta2018 clean trips filter']
21,968 of 21,968 total trips were extracted representing calendar day: tuesday and calendar_dates search parameters: {'date': '20180918'}. Took 0.03 seconds
There are no departure time records missing from trips following the specified schedule. There are no records to interpolate.
Difference between stop times has been successfully calculated. Took 0.05 seconds
Stop times from 06:30:00 to 09:30:00 successfully selected 128,915 records out of 574,987 total records (22.42 percent of total). Took 0.10 seconds
Starting transformation process for 5,543 total trips...
stop time table transformation to Pandana format edge table completed. Took 4.09 seconds
Time conversion completed: seconds converted to minutes.
7,809 of 8,260 records selected from stops. Took 0.01

In [15]:
ua.gtfs.headways.headways(
    gtfsfeeds_df=loaded_feeds,
    headway_timerange=['06:30:00','09:30:00']
)

loaded_feeds.headways.head()

Stop times from 06:30:00 to 09:30:00 successfully selected 128,915 records out of 574,987 total records (22.42 percent of total). Took 0.02 seconds
Starting route stop headway calculation for 12,535 route stops...
Route stop headway calculation complete. Took 13.57 seconds
headway calculation complete. Took 14.43 seconds


Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique_stop_id,unique_route_id,node_id_route
123072,0.0,,,,,,,,10000_mbta,195_mbta,10000_mbta_195_mbta
38333,31.0,4.612903,8.712739,0.0,0.0,0.0,0.0,23.0,10000_mbta,43_mbta,10000_mbta_43_mbta
38353,31.0,4.612903,8.712739,0.0,0.0,0.0,0.0,23.0,10000_mbta,43_mbta,10000_mbta_43_mbta
38354,31.0,4.612903,8.712739,0.0,0.0,0.0,0.0,23.0,10000_mbta,43_mbta,10000_mbta_43_mbta
38390,31.0,4.612903,8.712739,0.0,0.0,0.0,0.0,23.0,10000_mbta,43_mbta,10000_mbta_43_mbta


In [16]:
# gives number of rows corresponding to the number of times the route and stop combo is used in a trip during the time period
    # this is the same for every trip with that route/stop combination within the time period
#   could choose just the stop_ids that are first in that route by going through the trip table
gtfsfeeds_dfs.headways.query('unique_route_id == "Green-B_mbta"')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique_stop_id,unique_route_id,node_id_route
3912,31.0,5.516129,0.508001,5.0,5.0,6.0,6.0,6.0,70106_mbta,Green-B_mbta,70106_mbta_Green-B_mbta
3936,31.0,5.516129,0.508001,5.0,5.0,6.0,6.0,6.0,70106_mbta,Green-B_mbta,70106_mbta_Green-B_mbta
3960,31.0,5.516129,0.508001,5.0,5.0,6.0,6.0,6.0,70106_mbta,Green-B_mbta,70106_mbta_Green-B_mbta
3984,31.0,5.516129,0.508001,5.0,5.0,6.0,6.0,6.0,70106_mbta,Green-B_mbta,70106_mbta_Green-B_mbta
4008,31.0,5.516129,0.508001,5.0,5.0,6.0,6.0,6.0,70106_mbta,Green-B_mbta,70106_mbta_Green-B_mbta
...,...,...,...,...,...,...,...,...,...,...,...
4391,28.0,6.071429,1.385870,5.0,5.0,6.0,6.0,10.0,70200_mbta,Green-B_mbta,70200_mbta_Green-B_mbta
4415,28.0,6.071429,1.385870,5.0,5.0,6.0,6.0,10.0,70200_mbta,Green-B_mbta,70200_mbta_Green-B_mbta
4439,28.0,6.071429,1.385870,5.0,5.0,6.0,6.0,10.0,70200_mbta,Green-B_mbta,70200_mbta_Green-B_mbta
4463,28.0,6.071429,1.385870,5.0,5.0,6.0,6.0,10.0,70200_mbta,Green-B_mbta,70200_mbta_Green-B_mbta


### Calculating Route Variation Trip Counts
Now the goal is to get the number of trips per ROUTE, time period, direction, and shape_id and by ROUTE, time period, direction, and route_pattern_id. The idea is to be able to filter out uncommonly used route patterns, which will help with identifying conflation issues that are relevant to the model.

Time of Day Periods
- AM Peak - 6:30 AM to 9:30 AM
- MD - 9:30 AM - 3:00 PM
- PM Peak - 3:00 PM - 7:00 PM
- NT - 7:00 PM - 6:30 AM

In [17]:
gtfsfeeds_dfs.trips['tod'] = 0

In [18]:
#calculate time of day period for each trip
for x in gtfsfeeds_dfs.trips.trip_id:
    stop_times_trip = gtfsfeeds_dfs.stop_times.query('trip_id == @x & stop_sequence == 1').arrival_time.array
    time_int = int(stop_times_trip[0].split(":")[0]) +int( stop_times_trip[0].split(":")[1])/100
    if time_int  > 6.3 and time_int < 9.3:
        time_flag = 'AM'
    elif time_int  > 9.3 and time_int < 15:
        time_flag = 'MD'
    elif time_int  > 15 and time_int < 19:
        time_flag = 'PM'
    elif time_int  > 19 or time_int < 6.3:
        time_flag = 'NT'
    
    row = gtfsfeeds_dfs.trips.query('trip_id == @x').index[0]
    gtfsfeeds_dfs.trips.loc[row,'tod'] = time_flag

IndexError: index 0 is out of bounds for axis 0 with size 0

In [23]:
hours = gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[0]).astype('int64') 
minutes = (gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[1]).astype('int64')/100)
gtfsfeeds_dfs.stop_times['time_integer'] = hours+minutes
gtfsfeeds_dfs.stop_times['tod'] = gtfsfeeds_dfs.stop_times['time_integer'].

In [24]:
stop_times

0          19.25
1          19.26
2          19.28
3          19.30
4          19.32
           ...  
1872157    23.46
1872158    23.51
1872159    23.56
1872160    24.05
1872161    24.10
Name: arrival_time, Length: 1872162, dtype: float64

In [None]:
stop_times_trip.apply(lambda x: x[0])

In [None]:
gtfsfeeds_dfs.trips.groupby(by=['route_id' , 'direction_id', 'route_pattern_id','service_id','tod']).agg()

In [7]:
service_id_trips = {
    "service_id": (
        "BUS42018-hba48011-Weekday-02",
        "BUS42018-hbb48011-Weekday-02",
        "BUS42018-hbc48fr1-Weekday-02",
        "BUS42018-hbc48wk1-Weekday-02",
        "BUS42018-hbf48011-Weekday-02",
        "BUS42018-hbg48011-Weekday-02",
        "BUS42018-hbl48011-Weekday-02",
        "BUS42018-hbq48011-Weekday-02",
        "BUS42018-hbs48sp1-Weekday-02",
        "BUS42018-hbs48sw1-Weekday-02",
        "BUS42018-hbt48011-Weekday-02",
        "BUS42018-htt48011-Weekday-02"
        ),
    "number_of_trips": (
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hba48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbb48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbc48fr1-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbc48wk1-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbf48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbg48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbl48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbq48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbs48sp1-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbs48sw1-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-hbt48011-Weekday-02"')),
        len(gtfsfeeds_dfs.trips.query('service_id == "BUS42018-htt48011-Weekday-02"'))
        )
}
pd.DataFrame.from_dict(service_id_trips)

Unnamed: 0,service_id,number_of_trips
0,BUS42018-hba48011-Weekday-02,1040
1,BUS42018-hbb48011-Weekday-02,2214
2,BUS42018-hbc48fr1-Weekday-02,2602
3,BUS42018-hbc48wk1-Weekday-02,2602
4,BUS42018-hbf48011-Weekday-02,808
5,BUS42018-hbg48011-Weekday-02,1495
6,BUS42018-hbl48011-Weekday-02,921
7,BUS42018-hbq48011-Weekday-02,1042
8,BUS42018-hbs48sp1-Weekday-02,1711
9,BUS42018-hbs48sw1-Weekday-02,1697


In [8]:
service_dict = {
    "BUS42018-hba48011-Weekday-02": {},
    "BUS42018-hbb48011-Weekday-02": {},
    "BUS42018-hbc48fr1-Weekday-02": {},
    "BUS42018-hbc48wk1-Weekday-02": {},
    "BUS42018-hbf48011-Weekday-02": {},
    "BUS42018-hbg48011-Weekday-02": {},
    "BUS42018-hbl48011-Weekday-02": {},
    "BUS42018-hbq48011-Weekday-02": {},
    "BUS42018-hbs48sp1-Weekday-02": {},
    "BUS42018-hbs48sw1-Weekday-02": {},
    "BUS42018-hbt48011-Weekday-02": {},
    "BUS42018-htt48011-Weekday-02": {}
    }

for key,value in service_dict.items():
    for x in gtfsfeeds_dfs.routes['route_id']:
        if x in gtfsfeeds_dfs.trips.query('service_id == @key').route_id.unique():
            y = 1
        else:
            y = 0
        service_dict[key][x] = y
    

In [9]:
pd.DataFrame(service_dict).to_csv(r"C:\Users\matkinson.AD\Downloads\service_route_matrix2.csv")
pd.DataFrame(service_dict)

Unnamed: 0,BUS42018-hba48011-Weekday-02,BUS42018-hbb48011-Weekday-02,BUS42018-hbc48fr1-Weekday-02,BUS42018-hbc48wk1-Weekday-02,BUS42018-hbf48011-Weekday-02,BUS42018-hbg48011-Weekday-02,BUS42018-hbl48011-Weekday-02,BUS42018-hbq48011-Weekday-02,BUS42018-hbs48sp1-Weekday-02,BUS42018-hbs48sw1-Weekday-02,BUS42018-hbt48011-Weekday-02,BUS42018-htt48011-Weekday-02
Red,0,0,0,0,0,0,0,0,0,0,0,0
Mattapan,0,0,0,0,0,0,0,0,0,0,0,0
Orange,0,0,0,0,0,0,0,0,0,0,0,0
Green-B,0,0,0,0,0,0,0,0,0,0,0,0
Green-C,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
710,0,0,0,0,0,0,0,0,0,0,0,0
712,0,0,0,0,0,0,0,0,0,0,0,0
713,0,0,0,0,0,0,0,0,0,0,0,0
714,0,0,0,0,0,0,0,0,0,0,0,0


Notes:

The two following service_ids are identical in the routes and number of trips they serve:
- BUS42018-hbc48fr1-Weekday-02	
- BUS42018-hbc48wk1-Weekday-02

They do have overlap in when trips are running - however, hbc48wk1 seems to be for MTWTH and hbc48fr1 for F. Maybe it allows more flexibility around holidays. This is the main issue when thinking about trips with different ids but the same pattern and stop_times (e.g. same trip, different service_id) when days are conflated (e.g. Tuesdays instead of a specific date)

Both of these schedules run MTWTHF and are taken out of service in the optional calendar_dates.txt where 2 means subtracted and 1 means added. See : https://multigtfs.readthedocs.io/en/latest/gtfs.html for documentation of this schema. 

20180918 (September 18, 2018) seems to be a "regular" date in calendar_dates.txt - meaning that ONLY BUS42018-hbc48fr1-Weekday-02	 is turned off for the day.