# Calculating Headways for CTA Buses - work in progress

NOTE: Headways for late night routes are not accurate yet - See the Needs Fixing section below.

### What This Does

This code calculates headways for each CTA bus stop on a given route and direction.  It also generates
summary stats for an entire route (min, max, mean, median, 25th percentile, and 75th percentile headways).

### To Do Next

- Fix late night headway issue

- Add summary stats to the stops geodataframe so they show up in the visualizations when you hover over a stop

- See what happens if you try to loop through many - or all - routes and compile a larger headways dataframe

- Investigate EWT calcs Sean MacMullan found: https://www.trapezegroup.com.au/resources/infographic-how-to-calculate-excess-waiting-time/ 

### Notes

One bus route can be made up of several patterns.  Headways are calculated for all buses running the same direction on a given route at a particular stop, regardless which pattern the bus is on.   

Vehicle data comes from the Chi Hack Night Ghost Buses breakout team: https://ghostbuses.com/about
This provides location information in 5 minute intervals for every CTA bus.  It includes
data on which route (rt) and pattern (pid) the bus was running, along with the vehicle's distance along the pattern (pdist) and a timestamp.  

Pattern data comes from the CTA's API directly. This tells us which stops are found along
a given pattern and the distance along the pattern where each stop is located.

Combining the datasets above, the general strategy is:

1. Turn vehicle data into intervals:  Time and distance are recorded at the start and end of each 5-minute interval.

2. For a given stop and pattern, find all intervals where a vehicle on that pattern reached or pased the stop.

3. Estimate the time each bus acutally reached the stop through interpolation.  The interval gives time and distance location along a given pattern before and after the bus arrived at the stop.  The CTA's pattern data tells us where the bus stop falls along the pattern.  Stop times are estimated assuming the vehicle travels a constant spaeed througout the interval.

4. Combine all stop times for buses running the same direction at a particular stop.

5. Calculate headways between buses based on stop times.

6. Calculate summary statistics on headways.


### Needs Fixing

Address late night routes. 

- Some buses run past midnight, so the first bus of the calendar day appears just after midnight. 
- Many of these have no scheduled service in the early morning hours, but the unscheduled hours show up as a long headway.  And different directions of travel may begin and end service at different times. 
    - Need to change the range of times where data is used so it doesn't assume buses should be running during their off time  
    - (Example: see first bus/last bus info on the 55 bus here: https://www.transitchicago.com/bus/55/)
        - Parts of this route run 24 hours
        - Parts run 5am to 2am
        - Parts run 4:10 am to 1:10am
        - First and last buses reach mid-route stops later than these ranges. Buses continue to the end of their route
        - not in service times between 5am and 2am show up in the code below as long headways
 
- Possible Approach?

    - DONE: Combine 2 days of data: the "main" date being looked at and the following date for any schedules that splill into overnight.  How to choose a new range that starts and ends with the out-of-service period?

    - Investigate the schedule data from CTA. Can we extract actual start/end times for a route/direction, adding a buffer of extra time on the end to capture buses finishing their routes at end of the schedule? 

- Also, this code currently eliminates all intervals where one timestamp is before midnight and the next is after, since it uses one calendar day of data.  Missing one inverval per day is less critical than the issue above, and it would likely be resolved by the same fixes.




In [3]:
import requests
from dotenv import load_dotenv
import pandas as pd
import geopandas as gpd
from shapely import Point, LineString
import datetime as dt
import numpy as np

In [4]:
# Get API key from the .env file
load_dotenv()
API_KEY = os.getenv('API_KEY')

### Functions

In [5]:
def get_chn_vehicles(date_string:str, start_timedelta_string:str='02:30', end_timedelta_string:str='26:30') -> pd.DataFrame:
    """Parameters:\n

    date_string in 'YYYY-MM-DD'format\n

    start_timedelta_string in 'hh:mm' format.  (optional. Default is '02:30')\n
    end_timedelta_string in 'hh:mm' format. (optional. Default is '26:30)' \n

    Data returned:\n

    Vehicle data scraped by the chn ghost bus team for all CTA buses running between the specified
    start and end times on the specified date.  Where an end time over 24 hours is specified, the data returned
    will extend into the following calendar date. The maximum valid 
    value for start_timedelta_string or end_timedelta_string is '23:59'\n
    
    Timedelta values start at midnight on day 1.  Example: If start_timedelta_string is '03:45' 
    and end_timedelta_sring is '25:07', then data returned is from 3:45 am on the requested date 
    through 1:07 am the following day. \n
    
    Data is returned in a pandas dataframe. Columns include vehicle id (vid), timestamp (tmstmp), 
    pattern id (pid), and distance along the pattern (pdist) for each vehicle at 5-minute intervals 
    throughout the requested time range on the requested calendar day.
    
    """

    day1 = pd.to_datetime(date_string, infer_datetime_format=True)
    day2 = day1 + pd.Timedelta(days=1)
    day2_string = day2.strftime('%Y-%m-%d')

    start_timedelta_string_expanded = start_timedelta_string + ':00'
    end_timedelta_string_expanded = end_timedelta_string + ':00'

    def get_vehicles_single_day(single_day_datestring):
        chn_data_source_single_day = f'https://chn-ghost-buses-public.s3.us-east-2.amazonaws.com/bus_full_day_data_v2/{single_day_datestring}.csv'
        vehicles_single_day = pd.read_csv(
        chn_data_source_single_day, dtype={
            'vid':'int',
            'tmstmp':'str',
            'lat':'float',
            'lon':'float',
            'hdg':'int',
            'pid':'int',
            'rt':'str',
            'pdist':'int',
            'des':'str',
            'dly':'bool',
            'tatripid':'str',
            'origatripno':'int',
            'tablockid':'str',
            'zone':'str',
            'scrape_file':'str',
            'data_hour':'int',
            'data_date':'str'
            }
        )

        vehicles_single_day['tmstmp'] = pd.to_datetime(vehicles_single_day['tmstmp'],infer_datetime_format=True)
    
        return vehicles_single_day

    df_day1_vehicles = get_vehicles_single_day(date_string)
    df_day2_vehicles = get_vehicles_single_day(day2_string)
    
    df_both_days_vehicles = pd.concat([df_day1_vehicles, df_day2_vehicles])

    # Filter for vehicles running between 2:50 am on day 1 and 2:50 am on day 2.
    # Note:  First tried 2:30, but caught one bus on the test day / route that had their
    # out of service time based on actual stop times from 3:42 am to 4:37 am.  
    # Then changed the window to 3:50am and found another bus stop with a bus
    # at 4:11 and another at 5:50.
    # another 

    # TODO
    # Possible to run quickly check every bus route and see if the longest headway is actually
    # an out of service time around 3am?

    service_day_start = day1+pd.Timedelta(start_timedelta_string_expanded)
    service_day_end = day1+pd.Timedelta(end_timedelta_string_expanded)

    df_vehicles = df_both_days_vehicles.loc[
        (service_day_start < df_both_days_vehicles['tmstmp'])
        & (df_both_days_vehicles['tmstmp'] <= service_day_end)]
 
    return df_vehicles


In [6]:
def get_patterns(vehicles:pd.DataFrame, rt:str) -> pd.DataFrame:
    '''This is a helper function.\n
    Parameters:\n
    vehicles is a dataframe obtained using get_chn_vehicles().\n
    rt is a route id as a string (for example, '55' for the 55 Garfield bus)\n
    Data returned:\n
    patterns data from the CTA's bus tracker API is returned in a dataframe. 
    It includes all pattern ids (pid) found in the in the vehicles data for the specified
    route. Columns include pattern id (pid) and points (pt).\n
    The pt data for each pattern is its own dataframe with information on every point
    along the pattern. It includes columns for sequence (seq), latitude (lat),
    longitude (lon), type of points (typ) where S indicates a bus stop,
    stop ID (stpid) for stop points, and distance along the pattern (pdist).
     '''
    

    df_output = pd.DataFrame()

    # filter vehicles to the specified route
    rt_vehicles = vehicles.loc[vehicles['rt'] == rt]

    # list pid values included in the route
    pid_list = list(rt_vehicles['pid'].unique())

    # convert pids to strings
    pid_list = [str(i) for i in pid_list]

    # split pid_list into chunks of 10 (limit of the API):
    start = 0
    end = len(pid_list)
    step = 10
    for i in range(start, end, step):
        pid_list_chunk = pid_list[i:i+step]
        pid_string = ','.join(pid_list_chunk)

        # get data from CTA's feed
        api_url = f'http://www.ctabustracker.com/bustime/api/v2/getpatterns?key={API_KEY}&pid={pid_string}&format=json'
        response = requests.get(api_url)
        patterns = response.json()

        # convert json to dataframe
        df_patterns = pd.DataFrame(patterns['bustime-response']['ptr'])

        # add to the output dataframe
        df_output = pd.concat([df_output, df_patterns])


    # convert pt column values to dataframes for each pattern containing that pattern's points
    df_output['pt'] = df_output['pt'].apply(lambda x: pd.DataFrame(x))
    
    return df_output


In [7]:

def get_pattern_linestrings(patterns:pd.DataFrame) -> gpd.GeoDataFrame:
    '''This is for future use and visualization - not neccessary to generate
    headway information.\n
    Paremeters:\n
    patterns is a dataframe obtained using get_patterns().\n
    Data returned:\n
    Pattern data is returned as a geodataframe wiht linestring geometry
    representing the path buses travel.'''

    df_patterns = patterns.copy()

    # Turn points into linestrings
    geometry_linestrings = []
    for p in df_patterns['pt']:
        p.sort_values('seq', inplace=True)
        linestring_points = list(zip(p['lon'],p['lat']))

        # generate linestring using all points
        linestring = LineString(linestring_points)
        geometry_linestrings.append(linestring)

    # Create a geodataframe for the patterns using the linestring geometry
    gdf_patterns = gpd.GeoDataFrame(df_patterns, geometry=geometry_linestrings).set_crs(epsg=4326)

    # Drop the original pt column
    gdf_patterns.drop(['pt'], axis=1, inplace=True)

    return gdf_patterns


In [8]:
def get_pattern_stops(patterns) -> gpd.GeoDataFrame:
        '''This is a helper function.\n
        Parameters:\n
        patterns is a dataframe obtained using get_patterns().\n
        Data returned:\n
        Bus stop data is returned as a geodataframe 
        with point geomtry, one point per bus stop on each pattern
        associated with a route.\n
        Note that stops serving multiple patterns will be listed multiple 
        times, once for each pattern with the seq and pdist values 
        specific to that pattern.'''

        # get patterns for the route
        df_patterns = patterns.copy()

        # set up a geodataframe to contain stops
        gdf_route_stops = gpd.GeoDataFrame()

        # consider the pid column (pattern ID) and the pt column (dataframe contaning
        # points along the pattern)
        for pid, pt in zip(df_patterns['pid'],df_patterns['pt']):
                # sort points sequentially
                pt.sort_values('seq', inplace=True)
                # add the pattern id to each point's data
                pt['pid']=pid
                # add the pattern direction to each point's data
                rtdir = df_patterns['rtdir'].loc[df_patterns['pid'] == pid].tolist()[0]
                pt['rtdir'] = rtdir
                # filter to only show stop points
                stops = pt[pt['typ']=='S']
                # zip lat/lon data to get coordinate pairs
                coords = list(zip(stops['lon'],stops['lat']))
                # turn coordinates into point geometry
                geometry = [Point(c) for c in coords]
                # generate a geodataframe for the stops in this pattern
                gdf_pattern_stops = gpd.GeoDataFrame(stops,geometry=geometry).set_crs(epsg=4326)
                # add this pattern's stops to the dataframe containing all stops on the route
                gdf_route_stops = pd.concat([gdf_route_stops, gdf_pattern_stops])

        return gdf_route_stops


In [9]:
def get_vehicle_intervals(vehicles:pd.DataFrame, rt:str) -> pd.DataFrame:

    '''This is a helper function.\n
    Parameters:\n
    vehicles is a dataframe obtained using get_chn_vehicles().\n
    Data returned:\n
    Intervals are returned as a dataframe, with each row representing
    an interval between two points in time and space for one vehicle. 
    Columns are added to the vehicles data for each interval's 
    start time, end time, start pdist, and end pdist.'''

    df_vehicles = vehicles.copy()

    # filter to the specified route
    df_vehicles = df_vehicles.loc[df_vehicles['rt'] == rt]

    # Set up dataframe to contain final fomratted data
    df_output = pd.DataFrame()

    # End time for each interval as a timestamp
    # df_vehicles['end_time'] = pd.to_datetime(df_vehicles['tmstmp'],infer_datetime_format=True)
    df_vehicles['end_time'] = df_vehicles['tmstmp']
    vid_list = df_vehicles['vid'].unique().tolist()

    # End location for each interval
    df_vehicles['end_pdist'] = df_vehicles['pdist']

    for vid in vid_list:

        # pare data down to a single vehicle
        df_vehicle = df_vehicles.loc[df_vehicles['vid'] == vid]

        # handle each pattern separately
        pid_list = df_vehicle['pid'].unique().tolist()
        for p in pid_list:
            df_vehicle_pattern = df_vehicle.loc[df_vehicle['pid']==p].copy()
            # sort by time (it should be sorted already, but just in case)
            df_vehicle_pattern.sort_values(by=['end_time'], inplace=True)

            # Create a start time based on the previous tinmestamp
            end_times = df_vehicle_pattern['end_time'].tolist()
            start_times = np.roll(end_times,shift=1)
            df_vehicle_pattern['start_time'] = start_times

            # Create a start pattern distance based on the previous pdist
            end_distances = df_vehicle_pattern['end_pdist'].tolist()
            start_distances = np.roll(end_distances,shift=1)
            df_vehicle_pattern['start_pdist'] = start_distances

            # Remove the first interval since we don't have real start
            # time or location data for it
            df_vehicle_pattern = df_vehicle_pattern.iloc[1:]

            # add data to the full output dataframe
            df_output = pd.concat([df_output, df_vehicle_pattern])

    return df_output



In [10]:
        # Interpolate estimated times the bus arrived at a stop
        def interpolate_stop_time(
            stop_pdist:int, 
            start_time:pd.Timestamp, 
            end_time:pd.Timestamp, 
            start_pdist:int, 
            end_pdist:int
            ) -> pd.Timestamp:

            '''This is a helper function.\n
            Parameters:\n
            stop_pdist is an integer distance along a pattern to a given bus stop.\n
            start_time and end_time are timestamps for the beginning and end of an interval.\n
            start_pdist and end_pdist are integer distances along a pattern at the beginning and
            end of an interval.\n
            Data returned:\n
            timestamp for the estimated time a vehicle reached a stop, assuming it
            traveled a constant speed from start to end of teh interval
            '''

            # How far into the interval distance is the bus stop?
            # stop distance from beginning of interval / full interval distance
            dist_ratio = (stop_pdist-start_pdist)/(end_pdist-start_pdist)

            # estimated bus stop time, assuming it traveled at a steady
            # speed throughout the interval
            est_stop_time = start_time + (end_time - start_time)*dist_ratio

            # round estimated stop time to the nearest minute
            est_stop_time = est_stop_time.round(freq='T')

            return est_stop_time

In [11]:
def get_stoptimes(rt:str, vehicles:pd.DataFrame) -> pd.DataFrame:

    '''This is a helper function.\n
    Parameters:\n
    vehicles is a dataframe obtained using get_chn_vehicles().\n
    rt is a route id as a string (for example, '55' for the 55 Garfield bus)\n
    Data returned:\n
    Columns are added to the vehicles dataframe indicating the start and end time
    and the start and end distances along a pattern for each interval where a bus
    passed a stop (start_time, end_time, start_pdist, end_pdist). The estimated time 
    each bus actually arrived at the stop (est_stop_time) is also added.\n
    The dataframe returned covers all buses at all stops on the specified route'''
 
    # set up a dataframe to contain the output data
    df_output = pd.DataFrame()

    # turn vehicle data into intervals between vehicles
    vehicle_intervals = get_vehicle_intervals(vehicles, rt)

    # get pattern data from the CTA
    df_patterns = get_patterns(vehicles, rt)

    # get all stops on this route, including all patterns
    gdf_stops = get_pattern_stops(df_patterns)

    # Consider each combination of stop and pattern
    for stpid, pid, rtdir in list(zip(gdf_stops['stpid'],gdf_stops['pid'], gdf_stops['rtdir'])):

        # get a single stop on a single pattern
        gdf_this_stop_pattern = gdf_stops.loc[(gdf_stops['stpid'] == stpid) & (gdf_stops['pid'] == pid)]
        if len(gdf_this_stop_pattern) == 0:
            continue
            
        # Find the bus stop's distance along the pattern
        pdist_this_stop = gdf_this_stop_pattern['pdist'].tolist()[0]

        # find the intervals that are on this pattern
        df_this_pattern_intervals = vehicle_intervals.loc[vehicle_intervals['pid'] == pid]
        if len(df_this_pattern_intervals) == 0:
            continue

        # Filter for intervals that start ahead of the stop location and end at or beyond the stop
        def filter_intervals(stop_dist:int, start_pdist:int, end_pdist:int):
            return (start_pdist < stop_dist) & (end_pdist >= stop_dist)
    
        # Create filter for the intervals we're working on
        interval_filter = df_this_pattern_intervals.apply(
            lambda x: filter_intervals(pdist_this_stop, x['start_pdist'], x['end_pdist']), axis=1
            )
        
        # apply the filter
        df_this_pattern_stop_intervals = df_this_pattern_intervals.loc[interval_filter]
        if len(df_this_pattern_stop_intervals) == 0:
            continue

        # Add stpid, pdist, and rtdir to the data
        df_this_pattern_stop_intervals['stpid'] = stpid
        df_this_pattern_stop_intervals['stop_pdist'] = int(pdist_this_stop)
        df_this_pattern_stop_intervals['rtdir'] = rtdir


        # Estimate time each bus passed the stop (interpolated based on data at start and
        # end of the interval)
        df_this_pattern_stop_intervals['est_stop_time'] = df_this_pattern_stop_intervals.apply(
            lambda x: interpolate_stop_time(
                pdist_this_stop, 
                x['start_time'], 
                x['end_time'], 
                x['start_pdist'], 
                x['end_pdist']), axis=1
            )

        # Add the intervals with stop times to the full output dataframe
        df_output = pd.concat([df_output, df_this_pattern_stop_intervals])

    return df_output


In [12]:

def get_headways(rt:str, vehicles:pd.DataFrame) -> pd.DataFrame:

        '''Parameters:\n
        vehicles is a dataframe obtained using get_chn_vehicles().\n
        rt is a route id as a string (for example, '55' for the 55 Garfield bus)\n
        Data returned:\n
        Columns are added to the vehicles dataframe indicating the start and end time
        and the start and end distances along a pattern for each interval where a bus
        passed a stop (start_time, end_time, start_pdist, end_pdist), the estimated time 
        each bus actually arrived at the stop (est_stop_time), and headway between each 
        bus (est_headway).  Direction of travel (rtdir) and stop id (stpid)
        are also included.\n
        The dataframe returned covers all buses at all stops on the specified route'''

        df_output = pd.DataFrame()

        # Times buses stopped at each stop on the route
        df_stoptimes = get_stoptimes(rt, vehicles).copy()

        # consider all buses stopping at a given stop moving int he same direction
        for stpid, rtdir in set(zip(df_stoptimes['stpid'], df_stoptimes['rtdir'])):

                # filter data
                df_stop_direction = df_stoptimes.loc[(df_stoptimes['stpid'] == stpid) & (df_stoptimes['rtdir'] == rtdir)]

                # Sort chronologically
                df_stop_direction.sort_values(by='est_stop_time',ascending=True, inplace=True)

                # list stop times in chronological order
                stop_times = df_stop_direction['est_stop_time'].tolist()

                # calculate previous stop time for each line
                prev_stop_times = np.roll(stop_times,1)
                df_stop_direction['previous_stop_time'] = prev_stop_times

                # calculate headway
                df_stop_direction['est_headway'] = (
                df_stop_direction['est_stop_time'] - df_stop_direction['previous_stop_time']
                )
                df_stop_direction['est_headway'] = df_stop_direction['est_headway']

                # drop previous stop time column, no longer needed
                df_stop_direction = df_stop_direction.drop('previous_stop_time', axis=1)

                # Remove headway from the first bus in the dataset since we don't have the 
                # previous bus to compare with
                df_stop_direction['est_headway'].iloc[0] = None
                
                df_output = pd.concat([df_output, df_stop_direction])

        return df_output



In [13]:
def get_headway_stats(headways:pd.DataFrame) -> dict:
    '''Parameters:\n
    headways is a dataframe obtained using get_headways().\n
    Data returned:\n
    Statisics on the headways are returned as a dictionary.'''
    est_headways = headways['est_headway']
    stats = {
        'mean':est_headways.mean(),
        'max':est_headways.max(),
        'min':est_headways.min(),
        '25th_percentile':est_headways.quantile(0.25), # 25th percentile
        'median':est_headways.median(), # 50th percentile
        '75th_percentile':est_headways.quantile(0.75)
    }
    return stats




In [14]:
def get_average_wait_time(headways:pd.DataFrame) -> pd.DataFrame:
    '''Parameters:\n
    headways is a dataframe obtained using get_headways().\n
    Data returned:\n
    Average wait time (AWT) value by stop ID.'''

    # AWT = SUM(D^2)/2T, where D = the duration between arrivals and T = the timeframe duration.
    # When D=T, this simplifies to AWT = D/T
    stops = pd.DataFrame()
    stops['stpid'] = headways['stpid'].unique()
    
    AWTs = []
    mean = []
    for stop in stops['stpid']:
        stop_visits = headways[headways['stpid'] == stop]
        
        start = stop_visits['start_time'].min()
        end = stop_visits['end_time'].max()
        timeframe_duration = pd.Timedelta(end-start).seconds/60.0

        headway_minutes = stop_visits['est_headway'].dt.total_seconds()/60
        AWT = ((headway_minutes)**2).sum()/(2*timeframe_duration)
        AWTs.append(AWT)

        mean.append(headway_minutes.mean())

    stops['AWT'] = AWTs
    stops['mean_headway'] = mean
    return stops


## Try it out: Get Actual Stop Times and Headways

In [15]:
vehicles = get_chn_vehicles('2023-01-11')

In [16]:
vehicles

Unnamed: 0,vid,tmstmp,lat,lon,hdg,pid,rt,des,pdist,dly,tatripid,origtatripno,tablockid,zone,scrape_file,data_time,data_hour,data_date
3763,7973,2023-01-11 02:32:00,41.822785,-87.606773,1,2642,4,Illinois Center,15439,False,10000881,238694398,N4 -792,,bus_data/2023-01-11/02:32:56.json,2023-01-11 02:32:00,2,2023-01-11
3764,7936,2023-01-11 02:32:00,41.882610,-87.627747,5,2642,4,Illinois Center,48361,False,10000874,238694475,N4 -791,,bus_data/2023-01-11/02:32:56.json,2023-01-11 02:32:00,2,2023-01-11
3765,1086,2023-01-11 02:32:00,41.722362,-87.594824,179,12428,5,69th Red Line,10695,False,10003771,238664228,N95 -192,,bus_data/2023-01-11/02:32:56.json,2023-01-11 02:32:00,2,2023-01-11
3766,1074,2023-01-11 02:32:00,41.773212,-87.598861,268,12428,5,69th Red Line,55633,False,10003770,238664261,N95 -193,,bus_data/2023-01-11/02:32:56.json,2023-01-11 02:32:00,2,2023-01-11
3767,1177,2023-01-11 02:32:00,41.758879,-87.574355,89,12429,5,95th Red Line,21431,False,10003764,238664195,N95 -191,,bus_data/2023-01-11/02:32:56.json,2023-01-11 02:32:00,2,2023-01-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3883,8365,2023-01-12 02:27:00,41.968843,-87.669590,179,16550,81,Wilson/Marine,27267,False,88348720,238703287,N81 -491,,bus_data/2023-01-12/02:27:56.json,2023-01-12 02:27:00,2,2023-01-12
3884,8344,2023-01-12 02:27:00,41.968449,-87.707863,278,16551,81,Jefferson Park Blue Line,21419,False,88348718,238702167,81 -404,,bus_data/2023-01-12/02:27:56.json,2023-01-12 02:27:00,2,2023-01-12
3885,8394,2023-01-12 02:27:00,41.969240,-87.760502,311,16551,81,Jefferson Park Blue Line,37055,True,88348719,238703274,N81 -492,,bus_data/2023-01-12/02:27:56.json,2023-01-12 02:27:00,2,2023-01-12
3886,7993,2023-01-12 02:27:00,41.735798,-87.672447,270,5572,87,Western,13174,False,1082087,238695537,N87 -791,,bus_data/2023-01-12/02:27:56.json,2023-01-12 02:27:00,2,2023-01-12


In [17]:
%%capture --no-display 
# turn off warnings

# Check out the 90 Harlem bus
headways = get_headways('90',vehicles)


In [18]:
headways.sort_values('est_headway', ascending=False)

Unnamed: 0,vid,tmstmp,lat,lon,hdg,pid,rt,des,pdist,dly,...,data_date,end_time,end_pdist,start_time,start_pdist,stpid,stop_pdist,rtdir,est_stop_time,est_headway
168006,1510,2023-01-11 23:17:00,41.982048,-87.808205,28,2944,90,Harlem Blue Line,37723,False,...,2023-01-11,2023-01-11 23:17:00,37723,2023-01-11 23:12:00,30851,11827,37291,Northbound,2023-01-11 23:17:00,0 days 02:40:00
28661,8410,2023-01-11 07:47:00,41.981379,-87.807760,37,2944,90,Harlem Blue Line,37370,False,...,2023-01-11,2023-01-11 07:47:00,37370,2023-01-11 07:42:00,36422,11827,37291,Northbound,2023-01-11 07:47:00,0 days 02:20:00
85678,1396,2023-01-11 13:52:00,41.982210,-87.808134,37,2944,90,Harlem Blue Line,37853,False,...,2023-01-11,2023-01-11 13:52:00,37853,2023-01-11 13:47:00,35958,11827,37291,Northbound,2023-01-11 13:51:00,0 days 02:06:00
148178,1053,2023-01-11 19:32:00,41.981831,-87.808449,52,2944,90,Harlem Blue Line,37690,False,...,2023-01-11,2023-01-11 19:32:00,37690,2023-01-11 19:27:00,33477,11827,37291,Northbound,2023-01-11 19:32:00,0 days 02:05:00
66227,8297,2023-01-11 11:32:00,41.969489,-87.807152,182,5917,90,Harlem Green Line,5470,False,...,2023-01-11,2023-01-11 11:32:00,5470,2023-01-11 11:27:00,0,8535,962,Southbound,2023-01-11 11:28:00,0 days 01:56:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6015,8297,2023-01-11 04:42:00,41.919270,-87.806099,178,5917,90,Harlem Green Line,25323,False,...,2023-01-11,2023-01-11 04:42:00,25323,2023-01-11 04:37:00,19084,11785,19303,Southbound,2023-01-11 04:37:00,NaT
8265,8297,2023-01-11 05:22:00,41.966949,-87.807053,0,2944,90,Harlem Blue Line,31434,False,...,2023-01-11,2023-01-11 05:22:00,31434,2023-01-11 05:17:00,26191,11814,28320,Northbound,2023-01-11 05:19:00,NaT
7918,8297,2023-01-11 05:17:00,41.953436,-87.807220,0,2944,90,Harlem Blue Line,26191,False,...,2023-01-11,2023-01-11 05:17:00,26191,2023-01-11 05:12:00,20013,11807,22987,Northbound,2023-01-11 05:14:00,NaT
7918,8297,2023-01-11 05:17:00,41.953436,-87.807220,0,2944,90,Harlem Blue Line,26191,False,...,2023-01-11,2023-01-11 05:17:00,26191,2023-01-11 05:12:00,20013,15092,26041,Northbound,2023-01-11 05:17:00,NaT


In [19]:
headway_stats = get_headway_stats(headways)
headway_stats

{'mean': Timedelta('0 days 00:26:24.657534246'),
 'max': Timedelta('0 days 02:40:00'),
 'min': Timedelta('0 days 00:06:00'),
 '25th_percentile': Timedelta('0 days 00:21:00'),
 'median': Timedelta('0 days 00:25:00'),
 '75th_percentile': Timedelta('0 days 00:30:00')}

In [20]:
AWTs = get_average_wait_time(headways)
AWTs

Unnamed: 0,stpid,AWT,mean_headway
0,15725,13.767873,25.558140
1,11817,13.824434,24.954545
2,14226,26.673303,42.307692
3,8536,21.366364,32.235294
4,11803,13.596364,25.000000
...,...,...,...
104,11785,13.720455,25.511628
105,11814,13.786878,24.977273
106,11807,13.648869,25.000000
107,15092,13.651131,24.977273


In [21]:
# Visualize routes and stops for the 55 bus (no headway info included in visualization yet)
patterns = get_patterns(vehicles, '55')
route = get_pattern_linestrings(patterns)
stops = get_pattern_stops(patterns)

# m = route.explore(color='blue', tiles='CartoDB_positron')
# stops.explore(m=m, color='red')


# Get Scheduled Stops

In [22]:
# Get scheduled stop times from the GTFS feed

# Note: the file google_transit.zip was downloaded
# from https://www.transitchicago.com/downloads/sch_data/
# and the contents of the zip file saved in the gtfs/ directory

df_stop_times = pd.read_csv(
    'gtfs/stop_times.txt', dtype={
        'trip_id':'str',
        'arrival_time':'str',
        'departure_time':'str',
        'stop-id':'int',
        'stop_sequence':'int',
        'stop_headsign':'str',
        'pickup_type':'int',
        'shape_dist_traveled':'int'
        }, infer_datetime_format=True
    )

In [23]:
df_stop_times['arrival_time'] = pd.to_timedelta(df_stop_times['arrival_time'])
df_stop_times['departure_time'] = pd.to_timedelta(df_stop_times['departure_time'])

In [24]:

# Get trips from the GTFS feed

df_trips = pd.read_csv(
    'gtfs/trips.txt', dtype={
        'route_id':'str',
        'service_id':'str',
        'trip_id':'str',
        'direction_id':'int',
        'block_id':'str',
        'shape_id':'str',
        'direction':'str',
        'wheelchair_accessible':'int',
        'schd_trip_id':'str'
        }, infer_datetime_format=True
    )


In [25]:
df_trips

Unnamed: 0,route_id,service_id,trip_id,direction_id,block_id,shape_id,direction,wheelchair_accessible,schd_trip_id
0,1,65201,6520000656020,0,652000002856,65206351,South,1,656020
1,1,65201,6520001352020,1,652000002860,65208085,North,1,1352020
2,1,65201,6520002271020,0,652000002858,65206351,South,1,2271020
3,1,65201,6520002749020,1,652000003119,65208085,North,1,2749020
4,1,65201,6520003224020,0,652000002858,65206351,South,1,3224020
...,...,...,...,...,...,...,...,...,...
87266,BLS-1,BLS-102,BLS-102-1,1,,BLS-100-1,East,1,
87267,BLS-1,BLS-103,BLS-103-0,0,,BLS-100-0,West,1,
87268,BLS-1,BLS-103,BLS-103-1,1,,BLS-100-1,East,1,
87269,BLS-1,BLS-104,BLS-104-0,0,,BLS-100-0,West,1,


In [69]:
# Get calendar info from the GTFS feed

df_calendar = pd.read_csv(
    'gtfs/calendar.txt', dtype={
        'service_id':'str',
        'monday':'int',
        'tuesday':'int',
        'wednesday':'int',
        'thursday':'int',
        'friday':'int',
        'saturday':'int',
        'sunday':'int',
        'start_date':'str',
        'end_date':'str'}
)

def calendar_date_to_timestamp(yyyymmdd:str) -> dt.datetime:
    year = yyyymmdd[:4]
    month = yyyymmdd[4:6]
    day = yyyymmdd[6:]
    calendar_timestamp = pd.to_datetime(f'{year}-{month}-{day}')
    return calendar_timestamp

df_calendar['start_date'] = df_calendar['start_date'].apply(calendar_date_to_timestamp)
df_calendar['end_date'] = df_calendar['end_date'].apply(calendar_date_to_timestamp)

df_calendar

Unnamed: 0,service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
0,65201,1,1,1,1,1,0,0,2022-12-20,2023-01-07
1,65202,1,1,1,1,0,0,0,2022-12-20,2023-01-07
2,65203,0,1,1,1,1,0,0,2022-12-20,2023-01-07
3,65204,0,1,1,1,1,0,0,2022-12-20,2023-01-07
4,65205,0,0,0,0,1,0,0,2022-12-20,2023-01-07
...,...,...,...,...,...,...,...,...,...,...
210,1070126,1,0,0,1,0,0,0,2022-12-28,2023-02-28
211,1070127,1,0,0,0,1,0,0,2022-12-28,2023-02-28
212,1070128,1,1,0,1,0,0,0,2022-12-28,2023-02-28
213,1070129,1,0,0,1,1,0,0,2022-12-28,2023-02-28


Timestamp('2022-12-20 00:00:00')

In [43]:


# join trip data to each stop time
df_stop_schedule = df_stop_times.merge(df_trips, on='trip_id' ,how='left')

# Convert directions so they match the format from the CTA bus tracker data
df_stop_schedule['direction'] = df_stop_schedule['direction'].apply(lambda x: x+'bound')

df_stop_schedule


Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,shape_dist_traveled,route_id,service_id,direction_id,block_id,shape_id,direction,wheelchair_accessible,schd_trip_id
0,70234473833,0 days 15:46:00,0 days 15:46:00,30081,19,Cottage Grove,0,54960,G,107009,0,70021402196,307000012,Southbound,1,234473833
1,70234473833,0 days 15:49:30,0 days 15:49:30,30382,20,Cottage Grove,0,60167,G,107009,0,70021402196,307000012,Southbound,1,234473833
2,70234473833,0 days 15:52:30,0 days 15:52:30,30214,21,Cottage Grove,0,67980,G,107009,0,70021402196,307000012,Southbound,1,234473833
3,70234473833,0 days 15:55:30,0 days 15:55:30,30059,22,Cottage Grove,0,72656,G,107009,0,70021402196,307000012,Southbound,1,234473833
4,70234473833,0 days 15:57:30,0 days 15:57:30,30246,23,Cottage Grove,0,75158,G,107009,0,70021402196,307000012,Southbound,1,234473833
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5290961,BLS-103-1,0 days 00:13:30,0 days 00:13:30,999990,20,Rosemont Blue Line,0,18300,BLS-1,BLS-103,1,,BLS-100-1,Eastbound,1,
5290962,BLS-104-0,0 days 00:00:00,0 days 00:00:00,999990,10,O'Hare Airport,0,0,BLS-1,BLS-104,0,,BLS-100-0,Westbound,1,
5290963,BLS-104-0,0 days 00:08:30,0 days 00:08:30,999991,20,O'Hare Airport,0,17300,BLS-1,BLS-104,0,,BLS-100-0,Westbound,1,
5290964,BLS-104-1,0 days 00:00:00,0 days 00:00:00,999991,10,Rosemont Blue Line,0,0,BLS-1,BLS-104,1,,BLS-100-1,Eastbound,1,


In [None]:
# join 

In [47]:
df_stop_schedule.loc[df_stop_schedule['route_id'] == '55']

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,shape_dist_traveled,route_id,service_id,direction_id,block_id,shape_id,direction,wheelchair_accessible,schd_trip_id
117282,6530000010020,0 days 04:37:30,0 days 04:37:30,10603,1,St Louis,0,0,55,65301,0,653000003086,65301295,Westbound,1,10020
117285,6530000010020,0 days 04:37:55,0 days 04:37:55,10604,2,St Louis,0,639,55,65301,0,653000003086,65301295,Westbound,1,10020
117286,6530000010020,0 days 04:38:31,0 days 04:38:31,10605,3,St Louis,0,1375,55,65301,0,653000003086,65301295,Westbound,1,10020
117288,6530000010020,0 days 04:39:01,0 days 04:39:01,10606,4,St Louis,0,2061,55,65301,0,653000003086,65301295,Westbound,1,10020
117291,6530000010020,0 days 04:39:32,0 days 04:39:32,10607,5,St Louis,0,2826,55,65301,0,653000003086,65301295,Westbound,1,10020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5287806,6520047668070,0 days 11:28:50,0 days 11:28:50,10618,16,St Louis,0,10566,55,65206,0,652000000928,65201295,Westbound,1,47668070
5287807,6520047668070,0 days 11:29:38,0 days 11:29:38,10619,17,St Louis,0,11315,55,65206,0,652000000928,65201295,Westbound,1,47668070
5287808,6520047668070,0 days 11:30:09,0 days 11:30:09,17022,18,St Louis,0,11843,55,65206,0,652000000928,65201295,Westbound,1,47668070
5287809,6520047668070,0 days 11:30:54,0 days 11:30:54,10621,19,St Louis,0,12700,55,65206,0,652000000928,65201295,Westbound,1,47668070


# Get scheduled headways

In [48]:

# TODO:  One function to calculate either scheduled or actual headways?

# TODO: Separate schedules for different days,  Need to know which servide(s) are in 
# effect on a given date for this route, only show those

def get_scheduled_headways(rt:str, stop_schedule:pd.DataFrame) -> pd.DataFrame:

#         '''Parameters:\n
#         vehicles is a dataframe obtained using get_chn_vehicles().\n
#         rt is a route id as a string (for example, '55' for the 55 Garfield bus)\n
#         Data returned:\n
#         Columns are added to the vehicles dataframe indicating the start and end time
#         and the start and end distances along a pattern for each interval where a bus
#         passed a stop (start_time, end_time, start_pdist, end_pdist), the estimated time 
#         each bus actually arrived at the stop (est_stop_time), and headway between each 
#         bus (est_headway).  Direction of travel (rtdir) and stop id (stpid)
#         are also included.\n
#         The dataframe returned covers all buses at all stops on the specified route'''

        df_output = pd.DataFrame()


        # df_stoptimes = get_stoptimes(rt, vehicles).copy()
        df_stop_schedule = stop_schedule.copy()

        # filter to the specified route
        df_stop_schedule = df_stop_schedule.loc[df_stop_schedule['route_id'] == rt]

        # consider all buses stopping at a given stop moving in the same direction
        for stop_id, direction in set(zip(df_stop_schedule['stop_id'], df_stop_schedule['direction'])):

                # filter data
                df_stop_direction = df_stop_schedule.loc[
                        (df_stop_schedule['stop_id'] == stop_id) 
                        & (df_stop_schedule['direction'] == direction)
                        ]

                # Sort chronologically
                df_stop_direction.sort_values(by='arrival_time',ascending=True, inplace=True)

                # list stop times in chronological order
                stop_times = df_stop_direction['arrival_time'].tolist()

                # calculate previous stop time for each line
                prev_arrival_times = np.roll(stop_times,1)
                df_stop_direction['previous_arrival_time'] = prev_arrival_times

                # calculate headway
                df_stop_direction['scheduled_headway'] = (
                        df_stop_direction['arrival_time'] - df_stop_direction['previous_arrival_time']
                        )

                # drop previous arrrival time column, no longer needed
                df_stop_direction = df_stop_direction.drop('previous_arrival_time', axis=1)

                # Remove headway from the first bus in the dataset since we don't have the 
                # previous bus to compare with
                df_stop_direction['scheduled_headway'].iloc[0] = None
                
                df_output = pd.concat([df_output, df_stop_direction])

        return df_output

In [49]:
%%capture --no-display 
# turn off warnings

get_scheduled_headways('55',df_stop_schedule)

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,shape_dist_traveled,route_id,service_id,direction_id,block_id,shape_id,direction,wheelchair_accessible,schd_trip_id,scheduled_headway
821283,6520005804020,0 days 00:44:20,0 days 00:44:20,1654,67,Museum of Science & Industry,0,48136,55,65215,1,652000001166,65205424,Eastbound,1,5804020,NaT
3867384,6530032245020,0 days 00:44:20,0 days 00:44:20,1654,67,Museum of Science & Industry,0,48136,55,65315,1,653000001186,65305424,Eastbound,1,32245020,0 days 00:00:00
1568274,6530012532020,0 days 00:44:50,0 days 00:44:50,1654,67,Museum of Science & Industry,0,48136,55,65312,1,653000002978,65305424,Eastbound,1,12532020,0 days 00:00:30
3234478,6520027798020,0 days 00:44:50,0 days 00:44:50,1654,67,Museum of Science & Industry,0,48136,55,65212,1,652000003262,65205424,Eastbound,1,27798020,0 days 00:00:00
5111982,6520046757010,0 days 00:48:20,0 days 00:48:20,1654,67,Museum of Science & Industry,0,48136,55,65208,1,652000004193,65205424,Eastbound,1,46757010,0 days 00:03:30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4465756,6520038033070,0 days 23:49:38,0 days 23:49:38,10579,14,Midway Orange Line,0,8606,55,65206,0,652000000928,65205425,Westbound,1,38033070,0 days 00:00:00
1756173,6530013051070,0 days 23:49:38,0 days 23:49:38,10579,14,Midway Orange Line,0,8606,55,65306,0,653000000897,65305425,Westbound,1,13051070,0 days 00:00:00
193262,6530000586020,0 days 23:49:38,0 days 23:49:38,10579,14,Midway Orange Line,0,8606,55,65301,0,653000003122,65305425,Westbound,1,586020,0 days 00:00:00
1743110,6530013043070,1 days 00:06:38,1 days 00:06:38,10579,14,Midway Orange Line,0,8606,55,65306,0,653000000805,65305425,Westbound,1,13043070,0 days 00:17:00




# Explore late night schedules
Plan to move this section to the data exploration notebook eventually


In [27]:
# find the min and max scheduled times for each route/service/direction combination

df_minmax_schedule_time = df_stop_schedule.groupby(['route_id','service_id','direction'])['arrival_time'].aggregate([min, max])


In [28]:

df_minmax_schedule_time


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max
route_id,service_id,direction,Unnamed: 3_level_1,Unnamed: 4_level_1
1,65201,Northbound,0 days 05:45:00,0 days 18:54:00
1,65201,Southbound,0 days 06:15:30,0 days 19:27:30
1,65301,Northbound,0 days 05:45:00,0 days 18:54:00
1,65301,Southbound,0 days 06:15:30,0 days 19:27:30
100,65201,Eastbound,0 days 05:20:00,0 days 19:36:30
...,...,...,...,...
X98,65212,Southbound,0 days 00:40:00,0 days 01:05:00
X98,65312,Southbound,0 days 00:40:00,0 days 01:05:00
Y,107001,0bound,0 days 04:46:00,0 days 23:29:00
Y,107006,0bound,0 days 06:16:00,0 days 23:24:00


In [29]:


# Find schedules that go past midnight
df_late_schedules = df_minmax_schedule_time.loc[df_minmax_schedule_time['max'] > pd.Timedelta('24:00:00')].copy()


In [30]:

df_late_schedules.sort_values('min')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max
route_id,service_id,direction,Unnamed: 3_level_1,Unnamed: 4_level_1
20,65306,Westbound,0 days 00:15:00,1 days 00:32:00
20,65306,Eastbound,0 days 00:29:00,1 days 00:31:00
62,65309,Northbound,0 days 00:35:00,1 days 00:42:48
62,65209,Northbound,0 days 00:35:00,1 days 00:42:11
62,65206,Northbound,0 days 00:35:00,1 days 00:41:20
62,...,...,...,...
62,65302,Southbound,0 days 23:47:00,1 days 00:43:00
62,65305,Southbound,0 days 23:47:00,1 days 00:43:00
62,65202,Southbound,0 days 23:47:00,1 days 00:43:00
4,65305,Northbound,0 days 23:59:30,1 days 01:03:00


In [31]:


# Hmm.  There are several buses that the schedule shows starting at roughly 
# midnight-1am and running until midnight-1am the next day.  

# Sort for the late night schedules that have the longest run times
df_late_schedules['total_time'] = df_late_schedules['max'] - df_late_schedules ['min']

df_late_schedules.sort_values('total_time',ascending=False).head(50)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,total_time
route_id,service_id,direction,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,65306,Westbound,0 days 00:15:00,1 days 00:32:00,1 days 00:17:00
62,65309,Northbound,0 days 00:35:00,1 days 00:42:48,1 days 00:07:48
62,65209,Northbound,0 days 00:35:00,1 days 00:42:11,1 days 00:07:11
62,65306,Northbound,0 days 00:35:00,1 days 00:41:54,1 days 00:06:54
62,65206,Northbound,0 days 00:35:00,1 days 00:41:20,1 days 00:06:20
20,65306,Eastbound,0 days 00:29:00,1 days 00:31:00,1 days 00:02:00
62,65206,Southbound,0 days 01:32:00,1 days 00:43:00,0 days 23:11:00
62,65306,Southbound,0 days 01:32:00,1 days 00:43:00,0 days 23:11:00
62,65309,Southbound,0 days 01:32:00,1 days 00:35:00,0 days 23:03:00
62,65209,Southbound,0 days 01:32:00,1 days 00:34:00,0 days 23:02:00


In [32]:

# Looks like between 1:30 and 3am not much is happening outside of the
# 24-hour (or nearly 24-hours) schedules. perhaps 2:30 is a 
# safe cutoff time. to begin and end service days system-wide.  
# Before the first buses start.  Long enough after the 
# previous day's schedule ended to allow all remining buses to get to their final 
# stops. 

In [33]:


# Sort by earliest start times:  2:30 still works for everything other
# than near-24 hour buses.

df_late_schedules.sort_values('min',ascending=True).head(50)



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,total_time
route_id,service_id,direction,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,65306,Westbound,0 days 00:15:00,1 days 00:32:00,1 days 00:17:00
20,65306,Eastbound,0 days 00:29:00,1 days 00:31:00,1 days 00:02:00
62,65309,Northbound,0 days 00:35:00,1 days 00:42:48,1 days 00:07:48
62,65209,Northbound,0 days 00:35:00,1 days 00:42:11,1 days 00:07:11
62,65206,Northbound,0 days 00:35:00,1 days 00:41:20,1 days 00:06:20
62,65306,Northbound,0 days 00:35:00,1 days 00:41:54,1 days 00:06:54
62,65309,Southbound,0 days 01:32:00,1 days 00:35:00,0 days 23:03:00
62,65206,Southbound,0 days 01:32:00,1 days 00:43:00,0 days 23:11:00
62,65306,Southbound,0 days 01:32:00,1 days 00:43:00,0 days 23:11:00
62,65209,Southbound,0 days 01:32:00,1 days 00:34:00,0 days 23:02:00


In [34]:

# Sort by latest end times:  Still looks OK. 
# No schedules end too far papst 1am. 
# 2:30 am should work as a cutoff
# for service days system-wide!

df_late_schedules.sort_values('max',ascending=False).head(50)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,total_time
route_id,service_id,direction,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,65306,Southbound,0 days 05:32:42,1 days 01:07:00,0 days 19:34:18
3,65206,Southbound,0 days 05:32:42,1 days 01:07:00,0 days 19:34:18
Brn,107006,0bound,0 days 04:00:00,1 days 01:05:00,0 days 21:05:00
Blue,107001,Northbound,0 days 03:00:00,1 days 01:03:00,0 days 22:03:00
8,65306,Southbound,0 days 04:05:00,1 days 01:03:00,0 days 20:58:00
Blue,107006,Southbound,0 days 03:00:00,1 days 01:03:00,0 days 22:03:00
8,65206,Southbound,0 days 04:05:00,1 days 01:03:00,0 days 20:58:00
4,65302,Northbound,0 days 23:59:30,1 days 01:03:00,0 days 01:03:30
Blue,1070127,Southbound,0 days 16:52:30,1 days 01:03:00,0 days 08:10:30
4,65305,Northbound,0 days 23:59:30,1 days 01:03:00,0 days 01:03:30
