## Edit GTFS as a System
Margaret Atkinson 10/25/22

Overall Goal: Conflate GTFS to TransCAD links

    Issue: Too many route variations
        First Sub-Issue Goal: Get the number of trips per time period for each route variation

Notes:

While this 2018 Recap GTFS has been cleaned by the R script itsleeds_gleangtfs.R and imported to the link layer in TransCAD, I am starting from the cleaned GTFS (not imported to link layer) so that I can easily connect all this data together in a system instead of disparate parts.

In [1]:
import matplotlib
matplotlib.use('agg')  # allows notebook to be tested in Travis

import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import cartopy
import matplotlib.pyplot as plt
import pandana as pdna
import time

import urbanaccess as ua
from urbanaccess.config import settings
from urbanaccess.gtfsfeeds import feeds
from urbanaccess import gtfsfeeds
from urbanaccess.gtfs.gtfsfeeds_dataframe import gtfsfeeds_dfs
from urbanaccess.network import ua_network, load_network

%matplotlib inline

Example of GTFS in UrbanAccess here:

https://github.com/UDST/urbanaccess/blob/dev/demo/simple_example.ipynb

In [2]:
# required bbox including all of Massachusetts and RI as well as parts of NH, CT, NY
bbox = (-73.7207, 41.1198, -69.7876, 43.1161)
# path to the downloaded and cleaned gtfs - mbta recap file for fall 2018
#   this could also be a folder of gtfs folders (pre merge of multiple gtfs)
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_its_clean"
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_102418"
path_to_gtfs = r"C:\Users\matkinson.AD\Downloads\Nov11_Sandbox\mbta2018_102418_20221108"

date = "20181024"

In [3]:
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(gtfsfeed_path= path_to_gtfs,
                                           validation=True,
                                           verbose=True,
                                           bbox=bbox,
                                           remove_stops_outsidebbox=False,
                                           append_definitions=True)

Checking GTFS text file header whitespace... Reading files using encoding: utf-8 set in configuration.
GTFS text file header whitespace check completed. Took 0.15 seconds
--------------------------------
Processing GTFS feed: mbta2018_102418_20221108
The unique agency id: mbta was generated using the name of the agency in the agency.txt file.
Unique agency id operation complete. Took 0.02 seconds
Unique GTFS feed id operation complete. Took 0.00 seconds
No GTFS feed stops were found to be outside the bounding box coordinates
mbta2018_102418_20221108 GTFS feed stops: coordinates are in northwest hemisphere. Latitude = North (90); Longitude = West (-90).
Appended route type to stops
Appended route type to stop_times
--------------------------------
Added descriptive definitions to stops, routes, stop_times, and trips tables
Successfully converted ['departure_time'] to seconds past midnight and appended new columns to stop_times. Took 1.94 seconds
1 GTFS feed file(s) successfully read as 

## Update Pattern ID in Trips Table (Consolidate)

In [4]:
routes_from_transcad = pd.read_csv("J:\\Shared drives\\TMD_TSA\\Programs\\MID\\Networks\\Research_Development\\Transit_Networks\\gtfs_to_transcad\\GTFS_Imported_to_TransCAD.csv")
routes_from_transcad.head()

Unnamed: 0,Route_ID,Route_Name,Route,Short Name,Long Name,Description,Mode,Trip,Sign,Service,Length,Direction
0,1,170-F-1,170,170,Waltham - Dudley,Commuter Bus,3,38173968,Waltham Central Square (Express),BUS42018-hba48011-Weekday-02,20.8165,F
1,2,170-R-1,170,170,Waltham - Dudley,Commuter Bus,3,38174309,Dudley (Express),BUS42018-hba48011-Weekday-02,20.4348,R
2,3,23-R-1,23,23,Ashmont - Ruggles via Washington Street,Key Bus,3,38175312_1,Watertown Yard via Dudley,BUS42018-hba48011-Weekday-02,0.965417,R
3,4,4-F-1,4,4,North Station - Tide Street,Commuter Bus,3,38174928,North Station,BUS42018-hba48011-Weekday-02,2.31999,F
4,5,4-F-2,4,4,North Station - Tide Street,Commuter Bus,3,38174920,North Station via South Station,BUS42018-hba48011-Weekday-02,3.02948,F


In [5]:
# are the trip ids unique?
len(gtfsfeeds_dfs.trips['trip_id']) - len(gtfsfeeds_dfs.trips['trip_id'].unique())

0

In [6]:
len(gtfsfeeds_dfs.trips['route_pattern_id'].unique())

945

In [7]:
#gtfsfeeds_dfs.trips.merge(routes_from_transcad, how="left", left_on="trip_id", right_on="Trip")

consolidation_tab = routes_from_transcad.merge(gtfsfeeds_dfs.trips, how="left", left_on="Trip", right_on="trip_id")
consolidation_tab['new_id'] = consolidation_tab['Route'].astype('str') + ':' + consolidation_tab['Length'].round(1).astype('str') + ':' + consolidation_tab['Direction'].astype('str')
consolidation_tab.head()

Unnamed: 0,Route_ID,Route_Name,Route,Short Name,Long Name,Description,Mode,Trip,Sign,Service,...,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,unique_agency_id,unique_feed_id,bikes_allowed_desc,wheelchair_accessible_desc,new_id
0,1,170-F-1,170,170,Waltham - Dudley,Commuter Bus,3,38173968,Waltham Central Square (Express),BUS42018-hba48011-Weekday-02,...,1700028,1.0,,170-_-0,1.0,mbta,mbta2018_102418_20221108_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,170:20.8:F
1,2,170-R-1,170,170,Waltham - Dudley,Commuter Bus,3,38174309,Dudley (Express),BUS42018-hba48011-Weekday-02,...,1700032,1.0,,170-3-1,1.0,mbta,mbta2018_102418_20221108_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,170:20.4:R
2,3,23-R-1,23,23,Ashmont - Ruggles via Washington Street,Key Bus,3,38175312_1,Watertown Yard via Dudley,BUS42018-hba48011-Weekday-02,...,660101-1,1.0,,23-G-1,1.0,mbta,mbta2018_102418_20221108_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,23:1.0:R
3,4,4-F-1,4,4,North Station - Tide Street,Commuter Bus,3,38174928,North Station,BUS42018-hba48011-Weekday-02,...,040033,1.0,,4-1-0,1.0,mbta,mbta2018_102418_20221108_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,4:2.3:F
4,5,4-F-2,4,4,North Station - Tide Street,Commuter Bus,3,38174920,North Station via South Station,BUS42018-hba48011-Weekday-02,...,040034,1.0,,4-_-0,1.0,mbta,mbta2018_102418_20221108_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,4:3.0:F


In [None]:
lookuptab = consolidation_tab.groupby(by="new_id").first().reset_index()[['new_id','route_pattern_id']]
lookuptab = lookuptab.rename(columns = {'route_pattern_id':'rpid'})
lookuptab

In [None]:
ct = consolidation_tab.merge(lookuptab,on="new_id", how='left')
#ct['route_pattern_id'] = ct['rpid']

ct = ct.drop(list(routes_from_transcad.columns), axis=1)
ct = ct.drop(list(['new_id','unique_agency_id','unique_feed_id','bikes_allowed_desc', 'wheelchair_accessible_desc']), axis=1)
ct[['route_pattern_id','rpid']]
#ct

In [None]:
ct.query('route_pattern_id != rpid')

In [None]:
len(gtfsfeeds_dfs.trips['trip_id']) - len(gtfsfeeds_dfs.trips['trip_id'].unique())

In [None]:
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.merge(ct[['route_pattern_id','rpid']].drop_duplicates(),on="route_pattern_id", how='left')
gtfsfeeds_dfs.trips['route_pattern_id'] = gtfsfeeds_dfs.trips['rpid']
gtfsfeeds_dfs.trips = gtfsfeeds_dfs.trips.drop(list(['rpid']), axis=1)
gtfsfeeds_dfs.trips

In [None]:
len(gtfsfeeds_dfs.trips['trip_id']) - len(gtfsfeeds_dfs.trips['trip_id'].unique())

In [None]:
len(gtfsfeeds_dfs.trips['route_pattern_id'].unique())

In [None]:
len(gtfsfeeds_dfs.trips['route_id'].unique())

In [None]:
gtfsfeeds_dfs.trips.to_csv(r"C:\\Users\\matkinson.AD\\Downloads\\mbta2018_102418\\trips.txt",index=False)

### Calculating Route Variation Trip Counts
Now the goal is to get the number of trips per ROUTE, time period, direction, and shape_id and by ROUTE, time period, direction, and route_pattern_id. The idea is to be able to filter out uncommonly used route patterns, which will help with identifying conflation issues that are relevant to the model.

Time of Day Periods
- AM Peak - 6:30 AM to 9:30 AM
- MD - 9:30 AM - 3:00 PM
- PM Peak - 3:00 PM - 7:00 PM
- NT - 7:00 PM - 6:30 AM

In [None]:
hours = gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[0]).astype('int64') 
minutes = (gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[1]).astype('int64')/100)
gtfsfeeds_dfs.stop_times['time_integer'] = (hours+minutes).astype('float')
gtfsfeeds_dfs.stop_times['tod'] = np.where(
    ((gtfsfeeds_dfs.stop_times['time_integer'] >= 6.3) & (gtfsfeeds_dfs.stop_times['time_integer']< 9.3)), "AM", 
    np.where(
        ((gtfsfeeds_dfs.stop_times['time_integer'] >= 9.3) & (gtfsfeeds_dfs.stop_times['time_integer']< 15)), "MD",
        np.where(
            ((gtfsfeeds_dfs.stop_times['time_integer'] >= 15) & (gtfsfeeds_dfs.stop_times['time_integer']< 19)), "PM",
            np.where(
                ((gtfsfeeds_dfs.stop_times['time_integer'] >= 19) | (gtfsfeeds_dfs.stop_times['time_integer']< 6.3)), "NT", "0"
            
        ))))

In [None]:
gtfsfeeds_dfs.stop_times.query('tod=="0"')

In [None]:
trips = gtfsfeeds_dfs.trips.merge(gtfsfeeds_dfs.stop_times.groupby(by=['trip_id'])['tod'].min().reset_index()[['trip_id','tod']], on='trip_id')#.query('stop_sequence == 1')[['trip_id','tod']], on='trip_id')

In [None]:
trips_count = trips.groupby(by=['route_id' , 'direction_id', 'route_pattern_id','service_id','tod'])['shape_id'].count().reset_index()
trips_count.to_csv(r"C:\Users\matkinson.AD\Downloads\route_pattern_breakdown.csv")
trips_count

Notes:

The two following service_ids are identical in the routes and number of trips they serve:
- BUS42018-hbc48fr1-Weekday-02	
- BUS42018-hbc48wk1-Weekday-02

They do have overlap in when trips are running - however, hbc48wk1 seems to be for MTWTH and hbc48fr1 for F. Maybe it allows more flexibility around holidays. This is the main issue when thinking about trips with different ids but the same pattern and stop_times (e.g. same trip, different service_id) when days are conflated (e.g. Tuesdays instead of a specific date)

Both of these schedules run MTWTHF and are taken out of service in the optional calendar_dates.txt where 2 means subtracted and 1 means added. See : https://multigtfs.readthedocs.io/en/latest/gtfs.html for documentation of this schema. 

20180918 (September 18, 2018) seems to be a "regular" date in calendar_dates.txt - meaning that ONLY BUS42018-hbc48fr1-Weekday-02	 is turned off for the day.

## B Line Checks

In [None]:
bline_trips = gtfsfeeds_dfs.trips.query('route_id == "Green-B"')['trip_id']
gtfsfeeds_dfs.trips.query('route_id == "Green-B"')

In [None]:
bline_stops = gtfsfeeds_dfs.stop_times.query('trip_id in @bline_trips')['stop_id']
gtfsfeeds_dfs.stop_times.query('trip_id in @bline_trips')

In [None]:
gtfsfeeds_dfs.stops.query('stop_id in @bline_stops')

In [None]:
gtfsfeeds_dfs.stops.query('stop_id in @bline_stops').to_csv(
    r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Network_Release_Process\2022_FirstRelease\BLine_Green_Stops_2018GTFS.csv")