## Edit GTFS as a System
Margaret Atkinson 10/25/22

Overall Goal: Conflate GTFS to TransCAD links

    Issue: Too many route variations
        First Sub-Issue Goal: Get the number of trips per time period for each route variation

Notes:

While this 2018 Recap GTFS has been cleaned by the R script itsleeds_gleangtfs.R and imported to the link layer in TransCAD, I am starting from the cleaned GTFS (not imported to link layer) so that I can easily connect all this data together in a system instead of disparate parts.

In [1]:
import matplotlib
matplotlib.use('agg')  # allows notebook to be tested in Travis

import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import cartopy
import matplotlib.pyplot as plt
import pandana as pdna
import time

import urbanaccess as ua
from urbanaccess.config import settings
from urbanaccess.gtfsfeeds import feeds
from urbanaccess import gtfsfeeds
from urbanaccess.gtfs.gtfsfeeds_dataframe import gtfsfeeds_dfs
from urbanaccess.network import ua_network, load_network

%matplotlib inline

Example of GTFS in UrbanAccess here:

https://github.com/UDST/urbanaccess/blob/dev/demo/simple_example.ipynb

In [2]:
# required bbox including all of Massachusetts and RI as well as parts of NH, CT, NY
bbox = (-73.7207, 41.1198, -69.7876, 43.1161)
# path to the downloaded and cleaned gtfs - mbta recap file for fall 2018
#   this could also be a folder of gtfs folders (pre merge of multiple gtfs)
#path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_its_clean"
path_to_gtfs = r"J:\Shared drives\TMD_TSA\Programs\MID\Networks\Research_Development\Transit_Networks\gtfs_to_transcad\mbta2018_clean_trips_filter"

In [3]:
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(gtfsfeed_path= path_to_gtfs,
                                           validation=True,
                                           verbose=True,
                                           bbox=bbox,
                                           remove_stops_outsidebbox=False,
                                           append_definitions=True)

Checking GTFS text file header whitespace... Reading files using encoding: utf-8 set in configuration.
GTFS text file header whitespace check completed. Took 1.19 seconds
--------------------------------
Processing GTFS feed: mbta2018_clean_trips_filter
The unique agency id: mbta was generated using the name of the agency in the agency.txt file.
Unique agency id operation complete. Took 0.06 seconds
Unique GTFS feed id operation complete. Took 0.00 seconds
No GTFS feed stops were found to be outside the bounding box coordinates
mbta2018_clean_trips_filter GTFS feed stops: coordinates are in northwest hemisphere. Latitude = North (90); Longitude = West (-90).
Appended route type to stops
Appended route type to stop_times
--------------------------------
Added descriptive definitions to stops, routes, stop_times, and trips tables
Successfully converted ['departure_time'] to seconds past midnight and appended new columns to stop_times. Took 3.81 seconds
1 GTFS feed file(s) successfully re

### Calculating Route Variation Trip Counts
Now the goal is to get the number of trips per ROUTE, time period, direction, and shape_id and by ROUTE, time period, direction, and route_pattern_id. The idea is to be able to filter out uncommonly used route patterns, which will help with identifying conflation issues that are relevant to the model.

Time of Day Periods
- AM Peak - 6:30 AM to 9:30 AM
- MD - 9:30 AM - 3:00 PM
- PM Peak - 3:00 PM - 7:00 PM
- NT - 7:00 PM - 6:30 AM

In [None]:
gtfsfeeds_dfs.trips['tod'] = 0

In [None]:
'''#calculate time of day period for each trip
for x in gtfsfeeds_dfs.trips.trip_id:
    stop_times_trip = gtfsfeeds_dfs.stop_times.query('trip_id == @x & stop_sequence == 1').arrival_time.array
    time_int = int(stop_times_trip[0].split(":")[0]) +int( stop_times_trip[0].split(":")[1])/100
    if time_int  > 6.3 and time_int < 9.3:
        time_flag = 'AM'
    elif time_int  > 9.3 and time_int < 15:
        time_flag = 'MD'
    elif time_int  > 15 and time_int < 19:
        time_flag = 'PM'
    elif time_int  > 19 or time_int < 6.3:
        time_flag = 'NT'
    
    row = gtfsfeeds_dfs.trips.query('trip_id == @x').index[0]
    gtfsfeeds_dfs.trips.loc[row,'tod'] = time_flag'''

In [4]:
hours = gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[0]).astype('int64') 
minutes = (gtfsfeeds_dfs.stop_times['arrival_time'].astype('str').str.split(':').apply(lambda x: x[1]).astype('int64')/100)
gtfsfeeds_dfs.stop_times['time_integer'] = (hours+minutes).astype('float')
gtfsfeeds_dfs.stop_times['tod'] = np.where(
    ((gtfsfeeds_dfs.stop_times['time_integer'] > 6.3) & gtfsfeeds_dfs.stop_times['time_integer']< 9.3), "AM", 
    np.where(
        ((gtfsfeeds_dfs.stop_times['time_integer'] > 9.3) & gtfsfeeds_dfs.stop_times['time_integer']< 15), "MD",
        np.where(
            ((gtfsfeeds_dfs.stop_times['time_integer'] > 15) & gtfsfeeds_dfs.stop_times['time_integer']< 19), "PM",
            np.where(
                ((gtfsfeeds_dfs.stop_times['time_integer'] > 19) | gtfsfeeds_dfs.stop_times['time_integer']< 6.3), "NT", "0"
            
        ))))

In [5]:
gtfsfeeds_dfs.stop_times.query('tod=="0"')

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,timepoint,checkpoint_id,unique_agency_id,unique_feed_id,route_type,pickup_type_desc,drop_off_type_desc,timepoint_desc,departure_time_sec,time_integer,tod


In [8]:
trips = gtfsfeeds_dfs.trips.merge(gtfsfeeds_dfs.stop_times.query('stop_sequence == 1')[['trip_id','tod']], on='trip_id')

In [13]:
trips_count = trips.groupby(by=['route_id' , 'direction_id', 'route_pattern_id','service_id','tod'])['shape_id'].count().reset_index()
trips_count.to_csv(r"C:\Users\matkinson.AD\Downloads\route_pattern_breakdown.csv")
trips_count

Unnamed: 0,route_id,direction_id,route_pattern_id,service_id,tod,shape_id
0,1,0,1-_-0,BUS42018-hbc48fr1-Weekday-02,AM,114
1,1,0,1-_-0,BUS42018-hbc48wk1-Weekday-02,AM,114
2,1,1,1-_-1,BUS42018-hbc48fr1-Weekday-02,AM,109
3,1,1,1-_-1,BUS42018-hbc48wk1-Weekday-02,AM,109
4,10,0,10-5-0,BUS42018-hbc48fr1-Weekday-02,AM,1
...,...,...,...,...,...,...
1097,Orange,0,Orange-3-0,RTL42018-hmo48011-Weekday-01,AM,150
1098,Orange,1,Orange-3-1,RTL42018-hmo48011-Weekday-01,AM,153
1099,Red,0,Red-1-0,RTL42018-hms48011-Weekday-01,AM,109
1100,Red,0,Red-3-0,RTL42018-hms48011-Weekday-01,AM,114


Notes:

The two following service_ids are identical in the routes and number of trips they serve:
- BUS42018-hbc48fr1-Weekday-02	
- BUS42018-hbc48wk1-Weekday-02

They do have overlap in when trips are running - however, hbc48wk1 seems to be for MTWTH and hbc48fr1 for F. Maybe it allows more flexibility around holidays. This is the main issue when thinking about trips with different ids but the same pattern and stop_times (e.g. same trip, different service_id) when days are conflated (e.g. Tuesdays instead of a specific date)

Both of these schedules run MTWTHF and are taken out of service in the optional calendar_dates.txt where 2 means subtracted and 1 means added. See : https://multigtfs.readthedocs.io/en/latest/gtfs.html for documentation of this schema. 

20180918 (September 18, 2018) seems to be a "regular" date in calendar_dates.txt - meaning that ONLY BUS42018-hbc48fr1-Weekday-02	 is turned off for the day.