## Edit GTFS as a System
Margaret Atkinson 10/25/22

Overall Goal: Conflate GTFS to TransCAD links

    Issue: Too many route variations
        First Sub-Issue Goal: Get the number of trips per time period for each route variation

Notes:

While this 2018 Recap GTFS has been cleaned by the R script itsleeds_gleangtfs.R and imported to the link layer in TransCAD, I am starting from the cleaned GTFS (not imported to link layer) so that I can easily connect all this data together in a system instead of disparate parts.

In [1]:
import matplotlib
matplotlib.use('agg')  # allows notebook to be tested in Travis

import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import cartopy
import matplotlib.pyplot as plt
import pandana as pdna
import time

import urbanaccess as ua
from urbanaccess.config import settings
from urbanaccess.gtfsfeeds import feeds
from urbanaccess import gtfsfeeds
from urbanaccess.gtfs.gtfsfeeds_dataframe import gtfsfeeds_dfs
from urbanaccess.network import ua_network, load_network

%matplotlib inline

Example of GTFS in UrbanAccess here:

https://github.com/UDST/urbanaccess/blob/dev/demo/simple_example.ipynb

In [2]:
# required bbox including all of Massachusetts and RI as well as parts of NH, CT, NY
bbox = (-73.7207, 41.1198, -69.7876, 43.1161)
# path to the downloaded and cleaned gtfs - mbta recap file for fall 2018
#   this could also be a folder of gtfs folders (pre merge of multiple gtfs)
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_its_clean"
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_102418"
path_to_gtfs = r"C:\Users\matkinson.AD\Downloads\Nov11_Sandbox\mbta2018_102418_20221108"

In [3]:
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(gtfsfeed_path= path_to_gtfs,
                                           validation=True,
                                           verbose=True,
                                           bbox=bbox,
                                           remove_stops_outsidebbox=False,
                                           append_definitions=True)

Checking GTFS text file header whitespace... Reading files using encoding: utf-8 set in configuration.
GTFS text file header whitespace check completed. Took 0.11 seconds
--------------------------------
Processing GTFS feed: mbta2018_102418_20221108
The unique agency id: mbta was generated using the name of the agency in the agency.txt file.
Unique agency id operation complete. Took 0.02 seconds
Unique GTFS feed id operation complete. Took 0.00 seconds
No GTFS feed stops were found to be outside the bounding box coordinates
mbta2018_102418_20221108 GTFS feed stops: coordinates are in northwest hemisphere. Latitude = North (90); Longitude = West (-90).
Appended route type to stops
Appended route type to stop_times
--------------------------------
Added descriptive definitions to stops, routes, stop_times, and trips tables
Successfully converted ['departure_time'] to seconds past midnight and appended new columns to stop_times. Took 0.96 seconds
1 GTFS feed file(s) successfully read as 

## Update Pattern ID in Trips Table (Consolidate)

In [4]:
routes_from_transcad = pd.read_csv(r"C:\Users\matkinson.AD\Downloads\Nov11_Sandbox\conflated_gtfs_routes\conflated_gtfs_routes_round1.csv")
routes_from_transcad.head()

Unnamed: 0,Route_ID,Route_Name,Route,Short Name,Long Name,Description,Mode,Trip,Sign,Service,...,ScheduleEndTime,Headway_AM,Headway_MD,Headway_PM,Headway_NT,Daily_Trips,Trips_AM,Trips_MD,Trips_PM,Trips_NT
0,1,170-F-1,170,170,Waltham - Dudley,Commuter Bus,3,38173968,Waltham Central Square (Express),BUS42018-hba48011-Weekday-02,...,4:55,27.0,,,,1,1,0,0,0
1,2,170-R-1,170,170,Waltham - Dudley,Commuter Bus,3,38174309,Dudley (Express),BUS42018-hba48011-Weekday-02,...,4:55,,,62.6812,,2,0,0,2,0
2,3,23-R-1,23,23,Ashmont - Ruggles via Washington Street,Key Bus,3,38175312_1,Watertown Yard via Dudley,BUS42018-hba48011-Weekday-02,...,11:4,,329.0,,,1,0,1,0,0
3,4,4-F-1,4,4,North Station - Tide Street,Commuter Bus,3,38174928,North Station,BUS42018-hba48011-Weekday-02,...,6:28,,,22.974,,8,0,0,8,0
4,5,4-F-2,4,4,North Station - Tide Street,Commuter Bus,3,38174920,North Station via South Station,BUS42018-hba48011-Weekday-02,...,6:28,16.1008,,,,8,8,0,0,0


In [5]:
len(routes_from_transcad)

1116

In [6]:
len(gtfsfeeds_dfs.trips['route_pattern_id'].unique())

908

In [7]:
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.merge(routes_from_transcad, how="left", left_on="trip_id", right_on="Trip")

In [8]:
len(gtfsfeeds_dfs.trips)

17655

In [9]:
ct_max = gtfsfeeds_dfs.trips.sort_values('Daily_Trips',ascending = False).drop_duplicates(['Route','Direction','Sign'])
ct_max_filt = ct_max[['Route','Direction','Sign','route_pattern_id']].rename(columns = {'route_pattern_id':'rpid'})
ct_max[['Route','Direction','Sign','route_pattern_id','Daily_Trips']]

Unnamed: 0,Route,Direction,Sign,route_pattern_id,Daily_Trips
14448,Blue,R,Wonderland,Blue-6-1,174.0
14262,Blue,F,Bowdoin,Blue-6-0,165.0
16577,Mattapan,F,Mattapan,Mattapan-_-0,157.0
15460,Green-B,R,Park Street,Green-B-3-1,156.0
16740,Mattapan,R,Ashmont,Mattapan-_-1,156.0
...,...,...,...,...,...
5582,3233,F,River & Milton via Cleary Sq,3233-5-0,0.0
3231,215,R,North Quincy,215-_-1,0.0
7690,455,R,Wonderland,455-5-1,0.0
6503,39,R,Haymarket via Forest Hills,39-_-1,0.0


In [10]:
ct_max_filt

Unnamed: 0,Route,Direction,Sign,rpid
14448,Blue,R,Wonderland,Blue-6-1
14262,Blue,F,Bowdoin,Blue-6-0
16577,Mattapan,F,Mattapan,Mattapan-_-0
15460,Green-B,R,Park Street,Green-B-3-1
16740,Mattapan,R,Ashmont,Mattapan-_-1
...,...,...,...,...
5582,3233,F,River & Milton via Cleary Sq,3233-5-0
3231,215,R,North Quincy,215-_-1
7690,455,R,Wonderland,455-5-1
6503,39,R,Haymarket via Forest Hills,39-_-1


In [11]:
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.merge(ct_max_filt, how='left', on=['Route','Direction','Sign'])
gtfsfeeds_dfs.trips.head()

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,trip_route_type,...,Headway_AM,Headway_MD,Headway_PM,Headway_NT,Daily_Trips,Trips_AM,Trips_MD,Trips_PM,Trips_NT,rpid
0,1,BUS42018-hbc48wk1-Weekday-02,38230147,Harvard,,0,C01-21,10038,1,,...,9.28009,14.0228,8.12799,11.1659,105.0,19.0,24.0,30.0,32.0,1-_-0
1,1,BUS42018-hbc48wk1-Weekday-02,38230148,Harvard,,0,C01-21,10038,1,,...,,,,,,,,,,1-_-0
2,1,BUS42018-hbc48wk1-Weekday-02,38230154,Harvard,,0,C01-16,10038,1,,...,,,,,,,,,,1-_-0
3,1,BUS42018-hbc48wk1-Weekday-02,38230155,Harvard,,0,C01-6,10038,1,,...,,,,,,,,,,1-_-0
4,1,BUS42018-hbc48wk1-Weekday-02,38230157,Harvard,,0,C01-20,10038,1,,...,,,,,,,,,,1-_-0


In [12]:
# only consolidate trip to another route pattern if has under 3 trips in all of the time periods
gtfsfeeds_dfs.trips['rp_id_updated'] = np.where(
    (gtfsfeeds_dfs.trips['Trips_AM'] < 3) & (gtfsfeeds_dfs.trips['Trips_MD'] < 3) & (gtfsfeeds_dfs.trips['Trips_PM'] < 3) & (gtfsfeeds_dfs.trips['Trips_NT'] < 3),
    gtfsfeeds_dfs.trips['rpid'], gtfsfeeds_dfs.trips['route_pattern_id']
    )

In [13]:
len(gtfsfeeds_dfs.trips['rp_id_updated'].unique())

820

In [14]:
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.drop(list(routes_from_transcad.columns), axis=1)
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.drop(list(['unique_agency_id','unique_feed_id','bikes_allowed_desc', 'wheelchair_accessible_desc']), axis=1)
gtfsfeeds_dfs.trips

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,rpid,rp_id_updated
0,1,BUS42018-hbc48wk1-Weekday-02,38230147,Harvard,,0,C01-21,010038,1,,1-_-0,1,1-_-0,1-_-0
1,1,BUS42018-hbc48wk1-Weekday-02,38230148,Harvard,,0,C01-21,010038,1,,1-_-0,1,1-_-0,1-_-0
2,1,BUS42018-hbc48wk1-Weekday-02,38230154,Harvard,,0,C01-16,010038,1,,1-_-0,1,1-_-0,1-_-0
3,1,BUS42018-hbc48wk1-Weekday-02,38230155,Harvard,,0,C01-6,010038,1,,1-_-0,1,1-_-0,1-_-0
4,1,BUS42018-hbc48wk1-Weekday-02,38230157,Harvard,,0,C01-20,010038,1,,1-_-0,1,1-_-0,1-_-0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17650,Red,RTL42018-hms48011-Weekday-01,38066722,Alewife,,1,S933_-52,933_0010,1,,Red-3-1,0,1-_-0,Red-3-1
17651,Red,RTL42018-hms48011-Weekday-01,38066723,Alewife,,1,S933_-52,933_0010,1,,Red-3-1,0,1-_-0,Red-3-1
17652,Red,RTL42018-hms48011-Weekday-01,38066730,Alewife,,1,S933_-43,933_0010,1,,Red-3-1,0,1-_-0,Red-3-1
17653,Red,RTL42018-hms48011-Weekday-01,38066735,Alewife,,1,S933_-40,933_0010,1,,Red-3-1,0,1-_-0,Red-3-1


In [15]:
ct_id = gtfsfeeds_dfs.trips
data_dict = {
    'label':['route_pattern_id == rpid','route_pattern_id != rpid','rp_id_updated == rpid','rp_id_updated != rpid','rp_id_updated == route_pattern_id','rp_id_updated != route_pattern_id'],
    'number of records':[len(ct_id.query('route_pattern_id == rpid')),len(ct_id.query('route_pattern_id != rpid')),len(ct_id.query('rp_id_updated == rpid')),len(ct_id.query('rp_id_updated != rpid')),len(ct_id.query('rp_id_updated == route_pattern_id')),len(ct_id.query('rp_id_updated != route_pattern_id'))],
    'explanation':['orig id == max id', 'orig id != max id','updated id == max id', 'updated id != max id','updated id == orig id','updated id != orig id']
}
pd.DataFrame.from_dict(data_dict)

Unnamed: 0,label,number of records,explanation
0,route_pattern_id == rpid,958,orig id == max id
1,route_pattern_id != rpid,16697,orig id != max id
2,rp_id_updated == rpid,1120,updated id == max id
3,rp_id_updated != rpid,16535,updated id != max id
4,rp_id_updated == route_pattern_id,17466,updated id == orig id
5,rp_id_updated != route_pattern_id,189,updated id != orig id


In [16]:
gtfsfeeds_dfs.trips['route_pattern_id'] = gtfsfeeds_dfs.trips['rp_id_updated']
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.drop(['rp_id_updated','rpid'], axis=1)

gtfsfeeds_dfs.trips

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed
0,1,BUS42018-hbc48wk1-Weekday-02,38230147,Harvard,,0,C01-21,010038,1,,1-_-0,1
1,1,BUS42018-hbc48wk1-Weekday-02,38230148,Harvard,,0,C01-21,010038,1,,1-_-0,1
2,1,BUS42018-hbc48wk1-Weekday-02,38230154,Harvard,,0,C01-16,010038,1,,1-_-0,1
3,1,BUS42018-hbc48wk1-Weekday-02,38230155,Harvard,,0,C01-6,010038,1,,1-_-0,1
4,1,BUS42018-hbc48wk1-Weekday-02,38230157,Harvard,,0,C01-20,010038,1,,1-_-0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
17650,Red,RTL42018-hms48011-Weekday-01,38066722,Alewife,,1,S933_-52,933_0010,1,,Red-3-1,0
17651,Red,RTL42018-hms48011-Weekday-01,38066723,Alewife,,1,S933_-52,933_0010,1,,Red-3-1,0
17652,Red,RTL42018-hms48011-Weekday-01,38066730,Alewife,,1,S933_-43,933_0010,1,,Red-3-1,0
17653,Red,RTL42018-hms48011-Weekday-01,38066735,Alewife,,1,S933_-40,933_0010,1,,Red-3-1,0


In [17]:
len(gtfsfeeds_dfs.trips['trip_id']) - len(gtfsfeeds_dfs.trips['trip_id'].unique())

0

In [18]:
len(gtfsfeeds_dfs.trips['route_pattern_id'].unique())

820

In [19]:
len(gtfsfeeds_dfs.trips['route_id'].unique())

203

In [None]:
gtfsfeeds_dfs.trips.to_csv(path_to_gtfs + r"\trips.txt",index=False)

### Calculating Route Variation Trip Counts
Now the goal is to get the number of trips per ROUTE, time period, direction, and shape_id and by ROUTE, time period, direction, and route_pattern_id. The idea is to be able to filter out uncommonly used route patterns, which will help with identifying conflation issues that are relevant to the model.

Time of Day Periods
- AM Peak - 6:30 AM to 9:30 AM
- MD - 9:30 AM - 3:00 PM
- PM Peak - 3:00 PM - 7:00 PM
- NT - 7:00 PM - 6:30 AM

In [None]:
hours = gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[0]).astype('int64') 
minutes = (gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[1]).astype('int64')/100)
gtfsfeeds_dfs.stop_times['time_integer'] = (hours+minutes).astype('float')
gtfsfeeds_dfs.stop_times['tod'] = np.where(
    ((gtfsfeeds_dfs.stop_times['time_integer'] >= 6.3) & (gtfsfeeds_dfs.stop_times['time_integer']< 9.3)), "AM", 
    np.where(
        ((gtfsfeeds_dfs.stop_times['time_integer'] >= 9.3) & (gtfsfeeds_dfs.stop_times['time_integer']< 15)), "MD",
        np.where(
            ((gtfsfeeds_dfs.stop_times['time_integer'] >= 15) & (gtfsfeeds_dfs.stop_times['time_integer']< 19)), "PM",
            np.where(
                ((gtfsfeeds_dfs.stop_times['time_integer'] >= 19) | (gtfsfeeds_dfs.stop_times['time_integer']< 6.3)), "NT", "0"
            
        ))))

In [None]:
gtfsfeeds_dfs.stop_times.query('tod=="0"')

In [None]:
trips = gtfsfeeds_dfs.trips.merge(gtfsfeeds_dfs.stop_times.groupby(by=['trip_id'])['tod'].min().reset_index()[['trip_id','tod']], on='trip_id')#.query('stop_sequence == 1')[['trip_id','tod']], on='trip_id')

In [None]:
trips_count = trips.groupby(by=['route_id' , 'direction_id', 'route_pattern_id','service_id','tod'])['shape_id'].count().reset_index()
trips_count.to_csv(r"C:\Users\matkinson.AD\Downloads\route_pattern_breakdown.csv")
trips_count

Notes:

The two following service_ids are identical in the routes and number of trips they serve:
- BUS42018-hbc48fr1-Weekday-02	
- BUS42018-hbc48wk1-Weekday-02

They do have overlap in when trips are running - however, hbc48wk1 seems to be for MTWTH and hbc48fr1 for F. Maybe it allows more flexibility around holidays. This is the main issue when thinking about trips with different ids but the same pattern and stop_times (e.g. same trip, different service_id) when days are conflated (e.g. Tuesdays instead of a specific date)

Both of these schedules run MTWTHF and are taken out of service in the optional calendar_dates.txt where 2 means subtracted and 1 means added. See : https://multigtfs.readthedocs.io/en/latest/gtfs.html for documentation of this schema. 

20180918 (September 18, 2018) seems to be a "regular" date in calendar_dates.txt - meaning that ONLY BUS42018-hbc48fr1-Weekday-02	 is turned off for the day.

## B Line Checks

In [None]:
bline_trips = gtfsfeeds_dfs.trips.query('route_id == "Green-B"')['trip_id']
gtfsfeeds_dfs.trips.query('route_id == "Green-B"')

In [None]:
bline_stops = gtfsfeeds_dfs.stop_times.query('trip_id in @bline_trips')['stop_id']
gtfsfeeds_dfs.stop_times.query('trip_id in @bline_trips')

In [None]:
gtfsfeeds_dfs.stops.query('stop_id in @bline_stops')

In [None]:
gtfsfeeds_dfs.stops.query('stop_id in @bline_stops').to_csv(
    r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Network_Release_Process\2022_FirstRelease\BLine_Green_Stops_2018GTFS.csv")