# Consolidating Route Patterns in GTFS

Occurs after running R script that filters the trips, schedules, etc. by a chosen date.
This part first came after an import to TransCAD but this is an alternate approach of consolidating solely based on GTFS instead of trying to join the calculated trips per Route_Name back to the route_pattern_id. There isn't a great field to join back on as we don't know exactly how TransCAD is translating the GTFS information into the Route_Names and how the routes are being consolidated as the number of unique route patterns differs between TransCAD and GTFS with GTFS having less unique route patterns in its trip table *(both post R filtering)

As discussed with Marty and Sabiheh, the goal of this consolidation step is to re-assign route patterns with low numbers of trips across the day ( less than three in each time period) to route_pattern_ids with the most trips within their matching Route, Direction, and Headsign. This gets filtered down to the trip level where if a trip's route pattern is consolidated, it will be replaced with the route_pattern_id with max trips.

All trips with same headsign/route/direction  are assigned the same route_pattern_id. The route_pattern_id that represents the group is the route_pattern_id with the most trips per day.

If less than 3 trips per tod period across all day, assign the route_pattern_id of the dir/route group with the most trips.

Calc TOD by 3-3 not 6:30 - 30:29

reorder stop times at end by trip_id

In [1]:
import time
import datetime

import matplotlib
matplotlib.use('agg')  # allows notebook to be tested in Travis

import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import cartopy
import matplotlib.pyplot as plt
import pandana as pdna
import time

import urbanaccess as ua
from urbanaccess.config import settings
from urbanaccess.gtfsfeeds import feeds
from urbanaccess import gtfsfeeds
from urbanaccess.gtfs.gtfsfeeds_dataframe import gtfsfeeds_dfs
from urbanaccess.network import ua_network, load_network

%matplotlib inline

In [2]:
# required bbox including all of Massachusetts and RI as well as parts of NH, CT, NY
bbox = (-73.7207, 41.1198, -69.7876, 43.1161)
# path to the downloaded and cleaned gtfs - mbta recap file for fall 2018
#   this could also be a folder of gtfs folders (pre merge of multiple gtfs)

path_to_gtfs = r"C:\Users\matkinson.AD\Downloads\Nov12_Sandbox\Part_2_GTFS_R\mbta2018_102418_20221109"

In [3]:
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(gtfsfeed_path= path_to_gtfs,
                                           validation=True,
                                           verbose=True,
                                           bbox=bbox,
                                           remove_stops_outsidebbox=False,
                                           append_definitions=True)

Checking GTFS text file header whitespace... Reading files using encoding: utf-8 set in configuration.
GTFS text file header whitespace check completed. Took 0.10 seconds
--------------------------------
Processing GTFS feed: mbta2018_102418_20221109
The unique agency id: mbta was generated using the name of the agency in the agency.txt file.
Unique agency id operation complete. Took 0.01 seconds
Unique GTFS feed id operation complete. Took 0.00 seconds
No GTFS feed stops were found to be outside the bounding box coordinates
mbta2018_102418_20221109 GTFS feed stops: coordinates are in northwest hemisphere. Latitude = North (90); Longitude = West (-90).
Appended route type to stops
Appended route type to stop_times
--------------------------------
Added descriptive definitions to stops, routes, stop_times, and trips tables
Successfully converted ['departure_time'] to seconds past midnight and appended new columns to stop_times. Took 0.97 seconds
1 GTFS feed file(s) successfully read as 

Needs/Steps:
- Number of trips per time period per route_pattern_id
    - Midpoint time of each trip
    - Each trip classified by TOD (based on midpoint)
    - Sum of trips per TOD by route_pattern_id
- route_pattern_id with most daily trips per Route, Direction, Headsign
    - Sum all tod trips per route_pattern_id
    - grab just the max per Route, Direction, Headsign (but keep route_pattern_id)
- consolidate route patterns by Route, Direction, Headsign
    - if route_pattern_id has less than 3 trips in each of the 4 TODs, replace with max trips route_pattern_id

In [4]:
def get_start_stop_times(stop_times):    
    chocula =0 
    for trip_id in stop_times['trip_id'].unique():
        max_row = stop_times.query('trip_id==@trip_id').query('stop_sequence == stop_sequence.max()')[['trip_id','arrival_time']]
        min_row = stop_times.query('trip_id==@trip_id').query('stop_sequence == stop_sequence.min()')[['trip_id','arrival_time']]
        r2 = min_row.merge(max_row, how='left', on='trip_id', suffixes = ('_start','_end'))
        if chocula == 0:
            flintstone = pd.DataFrame(r2)
        else:
            flintstone=pd.concat([flintstone,r2])
        chocula +=1
    return(flintstone)


In [5]:
simpson = get_start_stop_times(gtfsfeeds_dfs.stop_times)

#### Check the results! 

In [6]:
simpson['arrival_time_end'].str.split(':').str[0]

0    20
0    07
0    07
0    11
0    06
     ..
0    21
0    22
0    23
0    23
0    24
Name: arrival_time_end, Length: 17655, dtype: object

In [7]:
simpson.query('arrival_time_end.str.split(":").str[0].astype("int32") > 23')

Unnamed: 0,trip_id,arrival_time_start,arrival_time_end
0,37940087,23:45:00,24:20:00
0,37940108,24:00:00,24:35:00
0,37940114,24:30:00,25:05:00
0,37940119,24:00:00,24:37:00
0,37940130,24:14:00,24:51:00
...,...,...,...
0,CR-Weekday-Fall-18-731,23:50:00,24:53:00
0,CR-Weekday-Fall-18-732,23:46:00,24:44:00
0,CR-Weekday-Fall-18-837,23:00:00,24:11:00
0,CR-Weekday-Fall-18-839,23:59:00,25:10:00


In [8]:
simpson.query('arrival_time_start.str.split(":").str[0].astype("int32") < 6').sort_values(by='arrival_time_start')

Unnamed: 0,trip_id,arrival_time_start,arrival_time_end
0,38232159,02:30:00,02:52:00
0,38232178,02:54:00,03:09:00
0,38229240,03:20:00,03:30:00
0,38229239,03:26:00,03:40:00
0,CR-Weekday-Fall-18-701,03:50:00,04:40:00
...,...,...,...
0,38167897,05:59:00,06:13:00
0,38169103,05:59:00,06:23:00
0,38168732,05:59:00,06:50:00
0,38442664,05:59:00,06:12:00


In [9]:
((simpson['arrival_time_end'].str.split(":").str[1]).astype('int32')/60)

0    0.050000
0    0.050000
0    0.883333
0    0.833333
0    0.416667
       ...   
0    0.133333
0    0.316667
0    0.166667
0    0.333333
0    0.166667
Name: arrival_time_end, Length: 17655, dtype: float64

#### Start work again!

In [10]:
def assign_tod(start_stop):
    start_stop['at_end_dec'] = (
        (
            (start_stop['arrival_time_end'].str.split(":").str[0]).astype('int32')
            +
            ((start_stop['arrival_time_end'].str.split(":").str[1]
            ).astype('int32')/60)))
    start_stop['at_start_dec'] = (
        (
            (start_stop['arrival_time_start'].str.split(":").str[0]).astype('int32')
            +
            ((start_stop['arrival_time_start'].str.split(":").str[1]
            ).astype('int32')/60)))
    
    start_stop['midpoint'] = start_stop['at_start_dec'] + ((start_stop['at_end_dec']-start_stop['at_start_dec'])/2)
    start_stop['tod'] = np.where(start_stop['midpoint'].between(6.50,9.50),'AM', np.where(
        start_stop['midpoint'].between(9.50,15.00), 'MD', np.where(
            start_stop['midpoint'].between(15.00,19.00),'PM', 'NT' 
        )
            ) 
        )
    
    return start_stop


In [11]:
smurf = assign_tod(simpson)

In [12]:
smurf

Unnamed: 0,trip_id,arrival_time_start,arrival_time_end,at_end_dec,at_start_dec,midpoint,tod
0,37940074,19:25:00,20:03:00,20.050000,19.416667,19.733333,NT
0,37940075,06:35:00,07:03:00,7.050000,6.583333,6.816667,AM
0,37940076,07:14:00,07:53:00,7.883333,7.233333,7.558333,AM
0,37940077,11:12:00,11:50:00,11.833333,11.200000,11.516667,MD
0,37940079,05:48:00,06:25:00,6.416667,5.800000,6.108333,NT
...,...,...,...,...,...,...,...
0,CR-Weekday-Fall-18-924,20:35:00,21:08:00,21.133333,20.583333,20.858333,NT
0,CR-Weekday-Fall-18-925,21:40:00,22:19:00,22.316667,21.666667,21.991667,NT
0,CR-Weekday-Fall-18-926,22:35:00,23:10:00,23.166667,22.583333,22.875000,NT
0,CR-Weekday-Fall-18-927,22:40:00,23:20:00,23.333333,22.666667,23.000000,NT


#### Exploration: For Route/Headsign/Direction ids with multiple route_pattern_ids, does the TOD period matter?
- Service_ID is not something that seems to differentiate why we would have multiple route_pattern_ids.
- Yes TOD does matter, see below: 
    - Basically, a route might run one way in the morning and a different way in the afternoon. See Route 704 for a great example. These routes do not deviate enough to be a new headsign or route (no VIA either) but may take a different road for a short while. This likely affects a few stops but not enough to change the headsign as previously noted.

In [13]:
# FOR R/H/D, select RPID with most trips
tod_trips = gtfsfeeds_dfs.trips.merge(smurf[['trip_id','tod']], how='left', on='trip_id')
day_rpid = tod_trips.groupby(by=['route_pattern_id']).agg({'trip_id':'nunique', 'route_id':'first','trip_headsign':'first','direction_id':'first'})
day_rpid = day_rpid.rename(columns = {'trip_id':'daily_trips'}).reset_index()
max_rpid = day_rpid.groupby(by=['route_id','trip_headsign','direction_id']).apply(lambda g: g[g['daily_trips'] == g['daily_trips'].max()])[['route_pattern_id','daily_trips']].reset_index()
max_rpid = max_rpid[['route_id','trip_headsign','direction_id','route_pattern_id']].rename(columns = {'route_pattern_id':'route_pattern_id_new'})
fred = max_rpid[max_rpid.duplicated(subset=['route_id','trip_headsign','direction_id'], keep =False)]
max_rpid[max_rpid.duplicated(subset=['route_id','trip_headsign','direction_id'], keep =False)]

Unnamed: 0,route_id,trip_headsign,direction_id,route_pattern_id_new
78,134,Wellington via Veterans Senior Center,1,134-1-1
79,134,Wellington via Veterans Senior Center,1,134-2-1
102,16,Forest Hills,0,16-2-0
103,16,Forest Hills,0,16-3-0
156,217,Quincy Center via North Quincy,0,217-1-0
157,217,Quincy Center via North Quincy,0,217-8-0
205,240,Quincy Center via North Randolph,1,240-2-1
206,240,Quincy Center via North Randolph,1,240-3-1
292,36,Forest Hills,1,36-1-1
293,36,Forest Hills,1,36-_-1


In [14]:
# OG
tod_rpid = tod_trips.groupby(by=['route_pattern_id','tod']).agg({'tod':'count', 'route_id':'first','trip_headsign':'first','direction_id':'first'})
tod_rpid = tod_rpid.rename(columns = {'tod':'trips_per_tod'}).reset_index()

george = fred.merge(tod_rpid, how='right', left_on='route_pattern_id_new',right_on='route_pattern_id').query('~route_id_x.isna()')
george.query('route_id_x == "70A"').sort_values(by=['trip_headsign_x'])

Unnamed: 0,route_id_x,trip_headsign_x,direction_id_x,route_pattern_id_new,route_pattern_id,tod,trips_per_tod,route_id_y,trip_headsign_y,direction_id_y
1560,70A,North Waltham,0.0,70A-1-0,70A-1-0,MD,3,70A,North Waltham,0
1561,70A,North Waltham,0.0,70A-1-0,70A-1-0,PM,7,70A,North Waltham,0
1564,70A,North Waltham,0.0,70A-4-0,70A-4-0,AM,6,70A,North Waltham,0
1565,70A,North Waltham,0.0,70A-4-0,70A-4-0,MD,2,70A,North Waltham,0
1566,70A,North Waltham,0.0,70A-4-0,70A-4-0,NT,2,70A,North Waltham,0
1562,70A,University Park,1.0,70A-1-1,70A-1-1,AM,5,70A,University Park,1
1563,70A,University Park,1.0,70A-1-1,70A-1-1,MD,5,70A,University Park,1
1567,70A,University Park,1.0,70A-4-1,70A-4-1,MD,2,70A,University Park,1
1568,70A,University Park,1.0,70A-4-1,70A-4-1,NT,2,70A,University Park,1
1569,70A,University Park,1.0,70A-4-1,70A-4-1,PM,6,70A,University Park,1


#### Actually update the route_pattern_ids
- Getting duplicate ids where same number of trips for max, so choosing one arbitrarily.
- The above section can be expanded to try to determine which has the most trips during peak period, but for now this works.

In [53]:
# FOR R/H/D, select RPID with most trips
tod_trips = gtfsfeeds_dfs.trips.merge(smurf[['trip_id','tod']], how='left', on='trip_id')
day_rpid = tod_trips.groupby(by=['route_id','trip_headsign','direction_id','route_pattern_id']).agg({'trip_id':'nunique'})
day_rpid = day_rpid.rename(columns = {'trip_id':'daily_trips'}).reset_index()
max_rpid = day_rpid.groupby(by=['route_id','trip_headsign','direction_id']).apply(lambda g: g[g['daily_trips'] == g['daily_trips'].max()])[['route_pattern_id','daily_trips']].reset_index()
max_rpid = max_rpid[['route_id','trip_headsign','direction_id','route_pattern_id']].rename(columns = {'route_pattern_id':'route_pattern_id_new'})

max_rpid = max_rpid.drop_duplicates(subset=['route_id','trip_headsign','direction_id'])
max_rpid

Unnamed: 0,route_id,trip_headsign,direction_id,route_pattern_id_new
0,1,Dudley,1,1-_-1
1,1,Harvard,0,1-_-0
2,10,Andrew via South Bay Center,0,10-6-0
3,10,City Point,0,10-_-0
4,10,City Point via South Bay Center,0,10-9-0
...,...,...,...,...
712,Orange,Forest Hills,0,Orange-3-0
713,Orange,Oak Grove,1,Orange-3-1
714,Red,Alewife,1,Red-3-1
715,Red,Ashmont,0,Red-1-0


In [54]:
trips_update1 = gtfsfeeds_dfs.trips.merge(max_rpid, how='left', on=['route_id','trip_headsign','direction_id'])
trips_update1['route_pattern_id'] = trips_update1['route_pattern_id_new']
trips_update1 = trips_update1[gtfsfeeds_dfs.trips.columns]
trips_update1[trips_update1.duplicated(subset=['trip_id'], keep =False)]

Unnamed: 0,route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,unique_agency_id,unique_feed_id,bikes_allowed_desc,wheelchair_accessible_desc


#### Now - update route/headsign/direction combos where each TOD period has less than three trips 
- take the route_pattern_id with max number of daily trips per route/direction

In [55]:
trips_update1.groupby(by=['route_pattern_id','route_id','direction_id','trip_headsign']).agg({'trip_id':'count','service_id':'nunique'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,trip_id,service_id
route_pattern_id,route_id,direction_id,trip_headsign,Unnamed: 4_level_1,Unnamed: 5_level_1
1-_-0,1,0,Harvard,114,1
1-_-1,1,1,Dudley,109,1
10-2-1,10,1,Townsend & Humboldt,1,1
10-6-0,10,0,Andrew via South Bay Center,1,1
10-9-0,10,0,City Point via South Bay Center,31,1
...,...,...,...,...,...
Orange-3-0,Orange,0,Forest Hills,153,1
Orange-3-1,Orange,1,Oak Grove,153,1
Red-1-0,Red,0,Ashmont,109,1
Red-3-0,Red,0,Braintree,114,1


In [56]:
len(trips_update1)

17655

In [57]:
tod_trips2 = trips_update1.merge(smurf[['trip_id','tod']], how='left', on='trip_id')
tod_rpid2 = tod_trips2.groupby(by=['route_id','trip_headsign','direction_id','tod']).agg({'tod':'count','route_pattern_id':'nunique','service_id':'nunique'})
tod_rpid2 = tod_rpid2.rename(columns = {'tod':'trips_per_tod'}).reset_index()
tod_rpid2.head()

Unnamed: 0,route_id,trip_headsign,direction_id,tod,trips_per_tod,route_pattern_id,service_id
0,1,Dudley,1,AM,18,1,1
1,1,Dudley,1,MD,24,1,1
2,1,Dudley,1,NT,39,1,1
3,1,Dudley,1,PM,28,1,1
4,1,Harvard,0,AM,20,1,1


In [62]:
print(len(tod_rpid2.query('service_id > 1')))
print(len(tod_rpid2.query('route_pattern_id != 1')))

59
0


In [97]:
tod_rpid3 = tod_rpid2[['route_id','trip_headsign','direction_id','tod','trips_per_tod']]
woolf = tod_rpid3.pivot_table(index = ['route_id','trip_headsign','direction_id'], columns = ['tod'])

needs_update = woolf['trips_per_tod'].reset_index().fillna(0).query('AM < 3 & MD < 3 & PM < 3 & NT < 3')
needs_update = needs_update.set_index(['route_id','direction_id'])
needs_update.index = ['{}_{}'.format(i, j) for i, j in needs_update.index]
needs_update.reset_index()

tod,index,trip_headsign,AM,MD,NT,PM
0,10_0,Andrew via South Bay Center,1.0,0.0,0.0,0.0
1,10_1,Townsend & Humboldt,1.0,0.0,0.0,0.0
2,100_1,Fellsway Garage via Fellsway,0.0,0.0,0.0,1.0
3,101_0,Medford Square,1.0,0.0,1.0,0.0
4,106_0,Franklin Square via Lebanon Loop,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...
170,CR-Franklin_0,Walpole,1.0,0.0,0.0,1.0
171,CR-Kingston_0,Plymouth,0.0,2.0,0.0,1.0
172,CR-Newburyport_0,Beverly,1.0,0.0,0.0,1.0
173,CR-Worcester_0,Ashland,1.0,0.0,0.0,0.0


In [82]:
byrpid = trips_update1.groupby(
    by=['route_id','trip_headsign','direction_id','route_pattern_id']).agg(
        {'trip_id':'count'}).reset_index().rename(
            columns={'trip_id':'daily_trips'})

gby = byrpid.set_index(['route_id','direction_id'])
gby.index = ['{}_{}'.format(i, j) for i, j in gby.index]
chocula = 0
for name in gby.index:
    if len(pd.DataFrame(gby.loc[name,:]).transpose()) > 1:
        max_row = pd.DataFrame(gby.loc[name,:]).query('daily_trips == daily_trips.max()')
    else:
        max_row = pd.DataFrame(gby.loc[name,:]).transpose().query('daily_trips == daily_trips.max()')
    if chocula == 0:
        flintstone = pd.DataFrame(max_row)
    else:
        flintstone=pd.concat([flintstone,max_row])
    chocula +=1

In [101]:
flintstone = flintstone.drop_duplicates()
flintstone

Unnamed: 0,trip_headsign,route_pattern_id,daily_trips
1_1,Dudley,1-_-1,109
1_0,Harvard,1-_-0,114
10_0,City Point via South Bay Center,10-9-0,31
10_1,Copley via South Bay Center,10-9-1,28
100_0,Elm St,100-3-0,36
...,...,...,...
Mattapan_0,Mattapan,Mattapan-_-0,163
Orange_0,Forest Hills,Orange-3-0,153
Orange_1,Oak Grove,Orange-3-1,153
Red_1,Alewife,Red-3-1,223


In [138]:
update_table = needs_update.reset_index().merge(flintstone.reset_index(), how='left', on='index', suffixes = ('_old','_max'))

update_table = update_table.rename(columns = {'route_pattern_id':'route_pattern_id_new','index':'combo_id'})
update_table = update_table.drop_duplicates(subset='combo_id')
update_table[['route_id','direction_id']] = update_table['combo_id'].str.split('_',expand=True)

update_table = update_table.set_index(['combo_id','trip_headsign_old'])
update_table.index = ['{}_{}'.format(i, j) for i, j in update_table.index]

update_table

Unnamed: 0,AM,MD,NT,PM,trip_headsign_max,route_pattern_id_new,daily_trips,route_id,direction_id
10_0_Andrew via South Bay Center,1.0,0.0,0.0,0.0,City Point via South Bay Center,10-9-0,31,10,0
10_1_Townsend & Humboldt,1.0,0.0,0.0,0.0,Copley via South Bay Center,10-9-1,28,10,1
100_1_Fellsway Garage via Fellsway,0.0,0.0,0.0,1.0,Wellington,100-3-1,36,100,1
101_0_Medford Square,1.0,0.0,1.0,0.0,Malden,101-3-0,53,101,0
106_0_Franklin Square via Lebanon Loop,1.0,0.0,0.0,0.0,Lebanon Loop,106-_-0,20,106,0
...,...,...,...,...,...,...,...,...,...
CR-Franklin_0_Norwood Central,0.0,0.0,0.0,1.0,Forge Park/495,CR-Franklin-0-0,16,CR-Franklin,0
CR-Kingston_0_Plymouth,0.0,2.0,0.0,1.0,Kingston,CR-Kingston-0-0,9,CR-Kingston,0
CR-Newburyport_0_Beverly,1.0,0.0,0.0,1.0,Newburyport,CR-Newburyport-0-0,16,CR-Newburyport,0
CR-Worcester_0_Ashland,1.0,0.0,0.0,0.0,Worcester,CR-Worcester-0-0,20,CR-Worcester,0


#### Quick break to calculate some numbers per route pattern id before they are updated!

In [None]:
smurf['trip_length'] = smurf['at_end_dec'] - smurf['at_start_dec']
smurf['trip_length_time'] = smurf['trip_length'].astype('int32').astype('str') + ":" + ((smurf['trip_length'] % 1)*60).astype('int32').astype('str') + ":00"
smurf

In [None]:
papa_smurf = smurf.merge(gtfsfeeds_dfs.trips[['trip_id','route_pattern_id']], how='left', on='trip_id')
smurfette = papa_smurf.sort_values('midpoint').groupby(by=['route_pattern_id','tod']).agg({'trip_length':['mean','median','count','first','last']})
smurfette = smurfette.droplevel(0,axis=1).reset_index()
smurfette['first_last_dif'] = round(smurfette['first'] - smurfette['last'],2)
smurfette['first_last_dif_min'] = smurfette['first_last_dif']*60
smurfette

#### Break over - update those route_pattern_ids!

In [140]:
trips_update1
trips_update2 = trips_update1.set_index(['route_id','direction_id','trip_headsign'])
trips_update2.index = ['{}_{}_{}'.format(i, j, k) for i, j, k in trips_update2.index]
trips_update2

Unnamed: 0,service_id,trip_id,trip_short_name,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,unique_agency_id,unique_feed_id,bikes_allowed_desc,wheelchair_accessible_desc
1_0_Harvard,BUS42018-hbc48wk1-Weekday-02,38230147,,C01-21,010038,1,,1-_-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...
1_0_Harvard,BUS42018-hbc48wk1-Weekday-02,38230148,,C01-21,010038,1,,1-_-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...
1_0_Harvard,BUS42018-hbc48wk1-Weekday-02,38230154,,C01-16,010038,1,,1-_-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...
1_0_Harvard,BUS42018-hbc48wk1-Weekday-02,38230155,,C01-6,010038,1,,1-_-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...
1_0_Harvard,BUS42018-hbc48wk1-Weekday-02,38230157,,C01-20,010038,1,,1-_-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Red_1_Alewife,RTL42018-hms48011-Weekday-01,38066722,,S933_-52,933_0010,1,,Red-3-1,0,mbta,mbta2018_102418_20221109_1,,can accommodate at least one rider in a wheelc...
Red_1_Alewife,RTL42018-hms48011-Weekday-01,38066723,,S933_-52,933_0010,1,,Red-3-1,0,mbta,mbta2018_102418_20221109_1,,can accommodate at least one rider in a wheelc...
Red_1_Alewife,RTL42018-hms48011-Weekday-01,38066730,,S933_-43,933_0010,1,,Red-3-1,0,mbta,mbta2018_102418_20221109_1,,can accommodate at least one rider in a wheelc...
Red_1_Alewife,RTL42018-hms48011-Weekday-01,38066735,,S933_-40,933_0010,1,,Red-3-1,0,mbta,mbta2018_102418_20221109_1,,can accommodate at least one rider in a wheelc...


In [141]:
trips_update3 = trips_update2.reset_index().merge(update_table.reset_index()[['index','route_pattern_id_new']], how='left',on='index').rename(columns={'index':'combo_id'})
trips_update3.query('~route_pattern_id_new.isna() & route_pattern_id_new != route_pattern_id')

Unnamed: 0,combo_id,service_id,trip_id,trip_short_name,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,unique_agency_id,unique_feed_id,bikes_allowed_desc,wheelchair_accessible_desc,route_pattern_id_new
224,10_0_Andrew via South Bay Center,BUS42018-hbc48wk1-Weekday-02,38231518,,C23-216,100112,1,,10-6-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,10-9-0
268,10_1_Townsend & Humboldt,BUS42018-hbc48wk1-Weekday-02,38232166,,C66-292,100116,1,,10-2-1,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,10-9-1
373,100_1_Fellsway Garage via Fellsway,BUS42018-hbf48011-Weekday-02,38428429,,F132-63,1000048,1,,100-_-1,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,100-3-1
384,101_0_Medford Square,BUS42018-hbg48011-Weekday-02,38443430,,G112-184,1010151,1,,101-1-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,101-3-0
385,101_0_Medford Square,BUS42018-hbg48011-Weekday-02,38443432,,G101-70,1010151,1,,101-1-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,101-3-0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14980,CR-Kingston_0_Plymouth,CR-Weekday-Kingston-Fall-18,CR-Weekday-Fall-18-063,63.0,,9790006,1,,CR-Kingston-2-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,CR-Kingston-0-0
14981,CR-Kingston_0_Plymouth,CR-Weekday-Kingston-Fall-18,CR-Weekday-Fall-18-065,65.0,,9790004,1,,CR-Kingston-2-0,1,mbta,mbta2018_102418_20221109_1,can accommodate at least one bicycle,can accommodate at least one rider in a wheelc...,CR-Kingston-0-0
14982,CR-Kingston_0_Plymouth,CR-Weekday-Kingston-Fall-18,CR-Weekday-Fall-18-067,67.0,,9790004,1,,CR-Kingston-2-0,2,mbta,mbta2018_102418_20221109_1,no bicycles are allowed on this trip,can accommodate at least one rider in a wheelc...,CR-Kingston-0-0
15774,Green-C_1_Lechmere,LRV42018-hlb48011-Weekday-01,38091440,,B831_-51,830_0004,1,,Green-C-0-1,0,mbta,mbta2018_102418_20221109_1,,can accommodate at least one rider in a wheelc...,Green-C-1-1


In [142]:
trips_update3['route_pattern_id'] = np.where(trips_update3['route_pattern_id_new'].isna(), trips_update3['route_pattern_id'], trips_update3['route_pattern_id_new'])

In [143]:
trips_update3[trips_update3['trip_id'].duplicated(keep=False)]

Unnamed: 0,combo_id,service_id,trip_id,trip_short_name,block_id,shape_id,wheelchair_accessible,trip_route_type,route_pattern_id,bikes_allowed,unique_agency_id,unique_feed_id,bikes_allowed_desc,wheelchair_accessible_desc,route_pattern_id_new


In [144]:
len(trips_update3['trip_id'].unique())

17655

In [145]:
len(trips_update3['route_pattern_id'].unique())

587

In [146]:
len(gtfsfeeds_dfs.trips['route_pattern_id'].unique())

945

In [None]:
gtfsfeeds_dfs.trips = trips_update3[gtfsfeeds_dfs.trips.columns]

### Create Generic Stop Times & Stop Sequence per Route Pattern

- Can't just change route_pattern_id as TransCAD does not use this field to combine trips into routes. There is no effect on the import.
- Working theory is that to consolidate routes one must use the stop_times.txt table as it defines the stop sequence for every trip. Theoretically, this is being used to consolidate trips into routes based on whether they have the same stop sequence.

Plan:
- Explore if stop times differ depending on TOD or if only realtime GTFS takes into account traffic.
    - For every route_pattern_id, get the average trip length (in minutes).
- For every route_pattern_id, get the average number of minutes between each pair of stops in the stop sequence.
- Replace the stop times and sequence for trips that had their route_pattern_id updated with the generic stop sequence and times per route_pattern_id created in the previous step. 
    - Keep the start time and work off of that.
    - Arrival time will equal departure time given that the difference is usually less than a minute. Will assume difference in time can be included in the minutes to next arrival time for aggregate modeling purposes.

In [None]:
trip_stop_replace = {}

gtfsfeeds_dfs.stop_times['first_stop'] =  0
gtfsfeeds_dfs.stop_times.loc[gtfsfeeds_dfs.stop_times.groupby('trip_id').stop_sequence.idxmin(),'first_stop'] = 1

for idx, row in gtfsfeeds_dfs.trips.query('(~route_pattern_id_old.isna()) & (route_pattern_id_old != route_pattern_id_max)').iterrows():
    tid = row['trip_id']
    rpid = gtfsfeeds_dfs.trips.query('trip_id == @tid').route_pattern_id

    all_trips = gtfsfeeds_dfs.trips.query('route_pattern_id == @rpid.iloc[0]').trip_id

    start_time = gtfsfeeds_dfs.stop_times.query('(trip_id == @tid) & (first_stop == 1)')
    all_start_times = gtfsfeeds_dfs.stop_times.query('(trip_id in @all_trips) & (first_stop == 1)')

    test_list = [[x,(abs(start_time['departure_time_sec']- all_start_times.query('trip_id == @x'))['departure_time_sec'].iloc[0])] for x in all_start_times['trip_id']]
    close = {}
    close = {sub[0]:sub[1] for sub in test_list}
        
    min_t = min(close, key=close.get)
    if tid != min_t:
        trip_stop_replace[tid] = min_t

In [None]:
gtfsfeeds_dfs.stop_times = gtfsfeeds_dfs.stop_times.sort_values('stop_sequence')

for tid in trip_stop_replace.values():
    
    gtfsfeeds_dfs.stop_times.loc[gtfsfeeds_dfs.stop_times.loc[:,'trip_id'] == str(tid), 'time_between_stops'] = gtfsfeeds_dfs.stop_times.loc[gtfsfeeds_dfs.stop_times.loc[:,'trip_id'] == str(tid), 'departure_time_sec'].diff()

gtfsfeeds_dfs.stop_times

In [None]:
gtfsfeeds_dfs.stop_times.query('time_between_stops > 0')

In [None]:
stop_times = gtfsfeeds_dfs.stop_times
for trip in trip_stop_replace.keys():
    start_time = stop_times.query('(trip_id == @trip) & (first_stop == 1)')['departure_time_sec']
    # drop old stop times
    stop_times = stop_times.drop(
        stop_times.loc[stop_times['trip_id']==trip].index)
    # grab new stop times
    new_trip = trip_stop_replace[trip]
    nst = stop_times.query('trip_id == @new_trip')
    nst['trip_id'] = trip

    # replace the start time, then calculate the stop times by the departure_time_sec difference
    nst.loc[nst.loc[:,'first_stop']==1,'departure_time_sec'] = int(start_time.iloc[0])
    nst.loc[nst.loc[:,'first_stop']==1,'time_between_stops'] = int(start_time.iloc[0])
    nst['departure_time_sec'] = nst['time_between_stops'].cumsum()

    # recalc arrival/dep times
    nst['arrival_time'] = pd.to_datetime(nst['departure_time_sec'],unit='s').astype('str').str[11:19]
    nst['departure_time'] = nst['arrival_time']

    #keep only relevant columns
    nst = nst[stop_times.columns]

    stop_times = pd.concat([stop_times,nst])

In [None]:
gtfsfeeds_dfs.stop_times = stop_times[['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'stop_sequence',
       'stop_headsign', 'pickup_type', 'drop_off_type', 'timepoint',
       'checkpoint_id', 'unique_agency_id', 'unique_feed_id', 'route_type',
       'pickup_type_desc', 'drop_off_type_desc', 'timepoint_desc',
       'departure_time_sec']]


In [None]:
gtfsfeeds_dfs.trips.to_csv(r"C:\Users\matkinson.AD\Downloads\sandbox_vetday\trips.txt",index=False)
gtfsfeeds_dfs.stop_times.to_csv(r"C:\Users\matkinson.AD\Downloads\sandbox_vetday\stop_times.txt",index=False)