# Calculating Headways for CTA Buses

### Next steps

Investigate data scraping (way over my head at this point skills-wise, would need to study CHN / other methods.)
- Pattern data from CTA appears to only include the patterns in effect at the moment you request it from the API (ie, I can't see weekday patterns if I run this on a Saturday.)  Would be nice to have a reference for all patterns used in a typical week (if there is such a thing as a typical week. Is it valid to assume patterns don't change much over time?)

- Scraping / saving data for all buses on all routes would be a huge, quickly expanding dataset.  Any way to automate and save periodic runs of this kind of analysis and store summary info only?

Consider how to handle overnight routes.  CHN data is by calendar date. As it stands, the process below eliminates all intervals where one timestamp is before midnight and the next is after.

- Combine 3 days of data, requested date plus the preceeding and following dates?  Then combine all three and look for multi-hour gaps to determine whether to restart?  

- Or investigate the schedule data from CTA to determine the actual start/end times, adding an hour buffer on either end for our data pulls?

- How many routes run past midnight? Is skipping a single 5 minute interval a big deal?  It's simpler all around to keep everything to calendar days if it doesn't skew things too far.



In [1]:
import requests
from dotenv import load_dotenv
import pandas as pd
import geopandas as gpd
from shapely import Point, LineString
import datetime as dt
import numpy as np

In [2]:
# Get API key from the .env file
load_dotenv()
API_KEY = os.getenv('API_KEY')

### Functions

In [3]:
def get_chn_vehicles(datestring:str) -> pd.DataFrame:
    """Datestring must be in YYYY-MM-DD format. Returns vehicle data scraped by the chn
    ghost bus team for all CTA buses on the specified date."""

    chn_data_source = f'https://chn-ghost-buses-public.s3.us-east-2.amazonaws.com/bus_full_day_data_v2/{datestring}.csv'

    vehicles = pd.read_csv(
        chn_data_source, dtype={
            'vid':'int',
            'tmstmp':'str',
            'lat':'float',
            'lon':'float',
            'hdg':'int',
            'pid':'int',
            'rt':'str',
            'pdist':'int',
            'des':'str',
            'dly':'bool',
            'tatripid':'str',
            'origatripno':'int',
            'tablockid':'str',
            'zone':'str',
            'scrape_file':'str',
            'data_hour':'int',
            'data_date':'str'
            }
        )

    return vehicles


In [4]:
def get_pid_list(vehicles:pd.DataFrame, rt:str) -> list:
    """vehicles should be a CHN scraped dataframe. See get_chn_vehicles function. 
    Output is a list of pid values found in the data for the specified route."""
    
    rt_vehicles = vehicles.loc[vehicles['rt'] == rt]
    pid_list = list(rt_vehicles['pid'].unique())
    return pid_list

In [5]:
def get_patterns(vehicles:pd.DataFrame, rt:str) -> pd.DataFrame:
    '''Get patterns data from the CTA's bus tracker API for a specified list of pid
    values. Return patterns as a dataframe'''

    df_output = pd.DataFrame()

    # filter vehicles to the specified route
    rt_vehicles = vehicles.loc[vehicles['rt'] == rt]

    # list pid values included in the route
    pid_list = list(rt_vehicles['pid'].unique())

    # convert pids to strings
    pid_list = [str(i) for i in pid_list]

    # split pid_list into chunks of 10 (limit of the API):
    start = 0
    end = len(pid_list)
    step = 10
    for i in range(start, end, step):
        pid_list_chunk = pid_list[i:i+step]
        pid_string = ','.join(pid_list_chunk)

        # get data from CTA's feed
        api_url = f'http://www.ctabustracker.com/bustime/api/v2/getpatterns?key={API_KEY}&pid={pid_string}&format=json'
        response = requests.get(api_url)
        patterns = response.json()

        # convert json to dataframe
        df_patterns = pd.DataFrame(patterns['bustime-response']['ptr'])

        # add to the output dataframe
        df_output = pd.concat([df_output, df_patterns])


    # convert pt column values to dataframes for each pattern containing that pattern's points
    df_output['pt'] = df_output['pt'].apply(lambda x: pd.DataFrame(x))
    
    return df_output


In [6]:
##########
# Depricated:  The CTA API seems to only provide patterns for shapes currently in use
# at the time the data is requested. Switch to querying a list of PID numbers to get the pattern data for
# any day
##########

# def get_patterns(route:str) -> pd.DataFrame:
#     '''Get patterns data from the CTA's bus tracker API for a specified route.
#     Return patterns as a dataframe'''
    
#     # get data from CTA's feed
#     api_url = f'http://www.ctabustracker.com/bustime/api/v2/getpatterns?key={API_KEY}&rt={route}&format=json'
#     response = requests.get(api_url)
#     patterns = response.json()

#     # convert json to dataframe
#     df_patterns = pd.DataFrame(patterns['bustime-response']['ptr'])

#     # convert pt column values to dataframes for each pattern containing that pattern's points
#     df_patterns['pt'] = df_patterns['pt'].apply(lambda x: pd.DataFrame(x))
    
#     return df_patterns


In [7]:

def get_pattern_linestrings(patterns:pd.DataFrame) -> gpd.GeoDataFrame:
    '''Use the get_patterns function to generate patterns input for this function.
    Return patterns as a geodataframe with linestring geometry for each pattern'''

    df_patterns = patterns.copy()

    # Turn points into linestrings
    geometry_linestrings = []
    for p in df_patterns['pt']:
        p.sort_values('seq', inplace=True)
        linestring_points = list(zip(p['lon'],p['lat']))

        # generate linestring using all points
        linestring = LineString(linestring_points)
        geometry_linestrings.append(linestring)

    # Create a geodataframe for the patterns using the linestring geometry
    gdf_patterns = gpd.GeoDataFrame(df_patterns, geometry=geometry_linestrings).set_crs(epsg=4326)

    # Drop the original pt column
    gdf_patterns.drop(['pt'], axis=1, inplace=True)

    return gdf_patterns


In [8]:
def get_pattern_stops(patterns) -> gpd.GeoDataFrame:
        '''Get the patterns data from the CTA's bus tracker API 
        for a specified route. Return patterns as a geodataframe 
        with point geomtry, one point per bus stop on each pattern
        associated with the route. Note that stops serving multiple
        patterns will be listed multiple times, once for each pattern
        with the seq and pdist values specific to that pattern.'''

        # get patterns for the route
        df_patterns = patterns.copy()

        # set up a geodataframe to contain stops
        gdf_route_stops = gpd.GeoDataFrame()

        # consider the pid column (pattern ID) and the pt column (dataframe contaning
        # points along the pattern)
        for pid, pt in zip(df_patterns['pid'],df_patterns['pt']):
                # sort points sequentially
                pt.sort_values('seq', inplace=True)
                # add the pattern id to each point's data
                pt['pid']=pid
                # filter to only show stop points
                stops = pt[pt['typ']=='S']
                # zip lat/lon data to get coordinate pairs
                coords = list(zip(stops['lon'],stops['lat']))
                # turn coordinates into point geometry
                geometry = [Point(c) for c in coords]
                # generate a geodataframe for the stops in this pattern
                gdf_pattern_stops = gpd.GeoDataFrame(stops,geometry=geometry).set_crs(epsg=4326)
                # add this pattern's stops to the dataframe containing all stops on the route
                gdf_route_stops = pd.concat([gdf_route_stops, gdf_pattern_stops])

        return gdf_route_stops


In [9]:
def get_vehicle_intervals(vehicles:pd.DataFrame, rt:str) -> pd.DataFrame:

    """vehicles should be a CHN scraped dataframe including vehicle ids (vid),
    timestamps (tmstmp), pattern ids (pid), and pattern distances (pdist) for each vehicle at 
    various times.  See get_chn_vehicles function. Output is a dataframe with each row representing
    an interval between two points in time and space for one vehicle. Columns are added
    for each interval's start time, end time, start pdist, and end pdist."""

    df_vehicles = vehicles.copy()

    # filter to the specified route
    df_vehicles = df_vehicles.loc[df_vehicles['rt'] == rt]

    # Set up dataframe to contain final fomratted data
    df_output = pd.DataFrame()

    # End time for each interval as a timestamp
    df_vehicles['end_time'] = pd.to_datetime(df_vehicles['tmstmp'],infer_datetime_format=True)
    vid_list = df_vehicles['vid'].unique().tolist()

    # End location for each interval
    df_vehicles['end_pdist'] = df_vehicles['pdist']

    for vid in vid_list:

        # pare data down to a single vehicle
        df_vehicle = df_vehicles.loc[df_vehicles['vid'] == vid]

        # handle each pattern separately
        pid_list = df_vehicle['pid'].unique().tolist()
        for p in pid_list:
            df_vehicle_pattern = df_vehicle.loc[df_vehicle['pid']==p].copy()
            # sort by time (it should be sorted already, but just in case)
            df_vehicle_pattern.sort_values(by=['end_time'], inplace=True)

            # Create a start time based on the previous tinmestamp
            end_times = df_vehicle_pattern['end_time'].tolist()
            start_times = np.roll(end_times,shift=1)
            df_vehicle_pattern['start_time'] = start_times

            # Create a start pattern distance based on the previous pdist
            end_distances = df_vehicle_pattern['end_pdist'].tolist()
            start_distances = np.roll(end_distances,shift=1)
            df_vehicle_pattern['start_pdist'] = start_distances

            # Remove the first interval since we don't have real start
            # time or location data for it
            df_vehicle_pattern = df_vehicle_pattern.iloc[1:]

            # add data to the full output dataframe
            df_output = pd.concat([df_output, df_vehicle_pattern])

    return df_output



In [20]:
        # Interpolate estimated times the bus arrived at a stop
        def interpolate_stop_time(
            stop_pdist:int, 
            start_time:pd.Timestamp, 
            end_time:pd.Timestamp, 
            start_pdist:int, 
            end_pdist:int
            ) -> pd.Timestamp:

            # How far into the interval distance is the bus stop?
            # stop distance from beginning of interval / full interval distance
            dist_ratio = (stop_pdist-start_pdist)/(end_pdist-start_pdist)

            # estimated bus stop time, assuming it traveled at a steady
            # speed throughout the interval
            est_stop_time = start_time + (end_time - start_time)*dist_ratio

            # round estimated stop time to the nearest minute
            est_stop_time = est_stop_time.round(freq='T')

            return est_stop_time

In [18]:

def get_headways(stpid:str, rt:str, vehicles:pd.DataFrame) -> pd.DataFrame:
    '''Input:  vehicles should be a CHN scraped dataframe including vehicle ids (vid),
    timestamps (tmstmp), pattern ids (pid), and pattern distances (pdist) for each vehicle at 
    various times.  This can be obtained using the get_chn_vehicles function. Output is a
    pandas dataframe that includes estimated bus stop times for all buses on the
    specified route. Headways are calculated for every bus except the first one
    in the dataset, since there is no previous bus to compare times with'''

    df_output = pd.DataFrame()

    vehicle_intervals = get_vehicle_intervals(vehicles, rt)

    patterns = get_patterns(vehicles, rt)

    # get all stops on this route, including all patterns
    gdf_route_stops = get_pattern_stops(patterns)

    # get a single stop, including all patterns for this route that use the stop
    gdf_this_stop = gdf_route_stops.loc[gdf_route_stops['stpid'] == stpid]

    # if multiple patterns use this stop, consider them separately since they
    # will have different distance data
    pid_list = gdf_this_stop['pid'].tolist()
    for pid in pid_list:

        # filter down to stop info for this stop, only for this pattern
        gdf_this_stop_pattern = gdf_this_stop.loc[gdf_this_stop['pid'] == pid]
        
        pdist_this_stop = gdf_this_stop_pattern['pdist'].tolist()[0]

        # find the intervals that are on this pattern
        df_this_pattern_intervals = vehicle_intervals.loc[vehicle_intervals['pid'] == pid]

        # Filter for intervals that start ahead of the stop location and end at or beyond the stop
        def filter_intervals(stop_dist:int, start_pdist:int, end_pdist:int) -> int:
            return (start_pdist < stop_dist) & (end_pdist >= stop_dist)

        # Create filter for the intervals we're working on
        interval_filter = df_this_pattern_intervals.apply(
            lambda x: filter_intervals(pdist_this_stop, x['start_pdist'], x['end_pdist']), axis=1
            )
        
        # apply the filter
        df_this_pattern_intervals = df_this_pattern_intervals.loc[interval_filter]

        # Estimate time bus passed the stop (interpolated based on data at start and
        # end of the interval)
        df_this_pattern_intervals['est_stop_time'] = df_this_pattern_intervals.apply(
            lambda x: interpolate_stop_time(
                pdist_this_stop, 
                x['start_time'], 
                x['end_time'], 
                x['start_pdist'], 
                x['end_pdist']), axis=1
            )

        # Add the intervals with estimated times buses pass to the output dataframe
        df_output = pd.concat([df_output, df_this_pattern_intervals])

    # Sort the output chronologically
    df_output.sort_values(by='est_stop_time',ascending=True, inplace=True)

    # Calculate headways
    
    stop_times = df_output['est_stop_time'].tolist()

    # calculate previous stop time for each line
    prev_stop_times = np.roll(stop_times,1)
    df_output['previous_stop_time'] = prev_stop_times

    # calculate headway
    df_output['est_headway'] = df_output['est_stop_time']-df_output['previous_stop_time']

    # drop previous stop time column, no longer needed
    df_output = df_output.drop('previous_stop_time', axis=1)

    # Remove headway from the first bus in the dataset since we don't have the 
    # previous bus to compare with
    df_output['est_headway'].iloc[0] = None

    return df_output
    


In [12]:
def get_headway_stats(headways:pd.DataFrame) -> dict:
    est_headways = headways['est_headway']
    stats = {
        'mean':est_headways.mean(),
        'max':est_headways.max(),
        'min':est_headways.min(),
        '25th_pctile':est_headways.quantile(0.25), # 25th percentile
        'median':est_headways.median(), # 50th percentile
        '75th_pctile':est_headways.quantile(0.75)
    }
    return stats




## Try it out

In [13]:
# Get vehicles - I set the date to exactly one week ago today (needed atm because I don't have historic data on 
# patterns.  I'm assuming the snapshot I get today also applies to the same day last week.)

vehicles = get_chn_vehicles('2023-01-07')

vehicles.head()


Unnamed: 0,vid,tmstmp,lat,lon,hdg,pid,rt,des,pdist,dly,tatripid,origtatripno,tablockid,zone,scrape_file,data_time,data_hour,data_date
0,1240,20230107 00:02,41.888995,-87.624241,10,18414,3,Michigan/Chicago,66090,False,398,235351548,3 -707,,bus_data/2023-01-07/00:02:56.json,2023-01-07 00:02:00,0,2023-01-07
1,7963,20230107 00:02,41.894003,-87.61994,178,18414,3,Michigan/Chicago,71034,True,397,235351550,3 -712,,bus_data/2023-01-07/00:02:56.json,2023-01-07 00:02:00,0,2023-01-07
2,1359,20230107 00:02,41.868737,-87.624199,180,18415,3,95th/RED LINE,13126,False,1080331,235351525,3 -758,,bus_data/2023-01-07/00:02:56.json,2023-01-07 00:02:00,0,2023-01-07
3,1296,20230107 00:02,41.829096,-87.617164,178,18415,3,95th/RED LINE,29141,False,1080330,235351566,3 -713,,bus_data/2023-01-07/00:02:56.json,2023-01-07 00:02:00,0,2023-01-07
4,7977,20230107 00:02,41.754945,-87.61507,179,18415,3,95th/RED LINE,56248,False,1080329,235361260,N4 -793,,bus_data/2023-01-07/00:02:56.json,2023-01-07 00:02:00,0,2023-01-07


In [21]:
# Check out the 55 Garfield bus:
patterns = get_patterns(vehicles, '55')
m = get_pattern_linestrings(patterns).explore(color='blue', tiles='CartoDB positron')
get_pattern_stops(patterns).explore(m=m, color='red')


In [23]:
# Check headways at bus stop 10527 (Garfield and Paulina) on route 55
headways = get_headways('10608','55', vehicles)
headways.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_output['est_headway'].iloc[0] = None


Unnamed: 0,vid,tmstmp,lat,lon,hdg,pid,rt,des,pdist,dly,...,scrape_file,data_time,data_hour,data_date,end_time,end_pdist,start_time,start_pdist,est_stop_time,est_headway
505,1027,20230107 00:12,41.793842,-87.681118,268,5425,55,Midway Orange Line,30357,False,...,bus_data/2023-01-07/00:12:56.json,2023-01-07 00:12:00,0,2023-01-07,2023-01-07 00:12:00,30357,2023-01-07 00:07:00,23441,2023-01-07 00:11:00,NaT
1795,8076,20230107 00:52,41.793701,-87.693359,267,5425,55,Midway Orange Line,33899,False,...,bus_data/2023-01-07/00:52:56.json,2023-01-07 00:52:00,0,2023-01-07,2023-01-07 00:52:00,33899,2023-01-07 00:47:00,27148,2023-01-07 00:49:00,0 days 00:38:00
2412,8199,20230107 01:17,41.793774,-87.689868,267,5425,55,Midway Orange Line,32912,False,...,bus_data/2023-01-07/01:17:56.json,2023-01-07 01:17:00,1,2023-01-07,2023-01-07 01:17:00,32912,2023-01-07 01:12:00,27093,2023-01-07 01:14:00,0 days 00:25:00
2912,8228,20230107 01:42,41.793869,-87.684624,269,5425,55,Midway Orange Line,31405,False,...,bus_data/2023-01-07/01:42:56.json,2023-01-07 01:42:00,1,2023-01-07,2023-01-07 01:42:00,31405,2023-01-07 01:37:00,25069,2023-01-07 01:40:00,0 days 00:26:00
3693,8076,20230107 02:42,41.793648,-87.69593,269,1293,55,St Louis,34643,False,...,bus_data/2023-01-07/02:42:56.json,2023-01-07 02:42:00,2,2023-01-07,2023-01-07 02:42:00,34643,2023-01-07 02:37:00,26944,2023-01-07 02:38:00,0 days 00:58:00


In [24]:

# Headway stats for this bus stop
stats = get_headway_stats(headways)
stats

{'mean': Timedelta('0 days 00:19:08.918918918'),
 'max': Timedelta('0 days 00:58:00'),
 'min': Timedelta('0 days 00:00:00'),
 '25th_pctile': Timedelta('0 days 00:12:00'),
 'median': Timedelta('0 days 00:15:30'),
 '75th_pctile': Timedelta('0 days 00:25:00')}

In [None]:
# # view test data
# m = get_pattern_linestrings('90').explore(color='blue', tiles='CartoDB positron')
# get_pattern_stops('90').explore(m=m, color='red')


In [None]:
# # Q: Are there ever patterns that appear on more than one route?  

# pattern_rt_combos = all_vehicles[['pid','rt']].drop_duplicates()
# pattern_rt_combos.groupby(['pid']).count().sort_values(by='rt', ascending=False)

# # A: No patterns appear with more than one route, at least for the snapshot data available
# #    from the CTA API at the moment this was run.
