# Calculating Headways for CTA Buses

### Next steps

Investigate data scraping (way over my head at this point skills-wise, would need to study CHN / other methods.)
- Pattern data from CTA appears to only include the patterns in effect at the moment you request it from the API (ie, I can't see weekday patterns if I run this on a Saturday.)  Would be nice to have a reference for all patterns used in a typical week (if there is such a thing as a typical week. Is it valid to assume patterns don't change much over time?)

- Scraping / saving data for all buses on all routes would be a huge, quickly expanding dataset.  Any way to automate and save periodic runs of this kind of analysis and store summary info only?

Consider how to handle overnight routes.  CHN data is by calendar date. As it stands, the process below eliminates all intervals where one timestamp is before midnight and the next is after.

- Combine 3 days of data, requested date plus the preceeding and following dates?  Then combine all three and look for multi-hour gaps to determine whether to restart?  

- Or investigate the schedule data from CTA to determine the actual start/end times, adding an hour buffer on either end for our data pulls?

- How many routes run past midnight? Is skipping a single 5 minute interval a big deal?  It's simpler all around to keep everything to calendar days if it doesn't skew things too far.



In [1]:
import requests
from dotenv import load_dotenv
import pandas as pd
import geopandas as gpd
from shapely import Point, LineString
import datetime as dt
import numpy as np

In [2]:
# Get API key from the .env file
load_dotenv()
API_KEY = os.getenv('API_KEY')

### Functions

In [3]:
def get_vehicles(datestring:str) -> object:
    """Datestring must be in YYYY-MM-DD format. Returns vehicle data scraped by the chn
    ghost bus team for all CTA buses on the specified date."""

    chn_data_source = f'https://chn-ghost-buses-public.s3.us-east-2.amazonaws.com/bus_full_day_data_v2/{datestring}.csv'

    vehicles = pd.read_csv(
        chn_data_source, dtype={
            'vid':'int',
            'tmstmp':'str',
            'lat':'float',
            'lon':'float',
            'hdg':'int',
            'pid':'int',
            'rt':'str',
            'pdist':'int',
            'des':'str',
            'dly':'bool',
            'tatripid':'str',
            'origatripno':'int',
            'tablockid':'str',
            'zone':'str',
            'scrape_file':'str',
            'data_hour':'int',
            'data_date':'str'
            }
        )

    return vehicles


# view test data
# get_vehicles('2023-01-08')


In [4]:

def get_patterns(route:str) -> object:
    '''Get patterns data from the CTA's bus tracker API for a specified route.
    Return patterns as a dataframe'''
    
    # get data from CTA's feed
    api_url = f'http://www.ctabustracker.com/bustime/api/v2/getpatterns?key={API_KEY}&rt={route}&format=json'
    response = requests.get(api_url)
    patterns = response.json()

    # convert json to dataframe
    df_patterns = pd.DataFrame(patterns['bustime-response']['ptr'])

    # convert pt column values to dataframes for each pattern containing that pattern's points
    df_patterns['pt'] = df_patterns['pt'].apply(lambda x: pd.DataFrame(x))
    
    return df_patterns

# get_patterns(20)

In [5]:

def get_pattern_linestrings(route:str) -> object:
    '''Get the patterns data from the CTA's bus tracker API for a specified route.
    Return patterns as a geodataframe with linestring geometry for each pattern'''

    df_patterns = get_patterns(route)

    # Turn points into linestrings
    geometry_linestrings = []
    for p in df_patterns['pt']:
        p.sort_values('seq', inplace=True)
        linestring_points = list(zip(p['lon'],p['lat']))

        # generate linestring using all points
        linestring = LineString(linestring_points)
        geometry_linestrings.append(linestring)

    # Create a geodataframe for the patterns using the linestring geometry
    gdf_patterns = gpd.GeoDataFrame(df_patterns, geometry=geometry_linestrings).set_crs(epsg=4326)

    # Drop the original pt column
    gdf_patterns.drop(['pt'], axis=1, inplace=True)

    return gdf_patterns


In [6]:
def get_pattern_stops(route:str) -> object:
        '''Get the patterns data from the CTA's bus tracker API 
        for a specified route. Return patterns as a geodataframe 
        with point geomtry, one point per bus stop on each pattern
        associated with the route. Note that stops serving multiple
        patterns will be listed multiple times, once for each pattern
        with the seq and pdist values specific to that pattern.'''

        # get patterns for the route
        df_patterns = get_patterns(route)

        # set up a geodataframe to contain stops
        gdf_route_stops = gpd.GeoDataFrame()

        # consider the pid column (pattern ID) and the pt column (dataframe contaning
        # points along the pattern)
        for pid, pt in zip(df_patterns['pid'],df_patterns['pt']):
                # sort points sequentially
                pt.sort_values('seq', inplace=True)
                # add the pattern id to each point's data
                pt['pid']=pid
                # filter to only show stop points
                stops = pt[pt['typ']=='S']
                # zip lat/lon data to get coordinate pairs
                coords = list(zip(stops['lon'],stops['lat']))
                # turn coordinates into point geometry
                geometry = [Point(c) for c in coords]
                # generate a geodataframe for the stops in this pattern
                gdf_pattern_stops = gpd.GeoDataFrame(stops,geometry=geometry).set_crs(epsg=4326)
                # add this pattern's stops to the dataframe containing all stops on the route
                gdf_route_stops = pd.concat([gdf_route_stops, gdf_pattern_stops])

        return gdf_route_stops

# get_pattern_stops(20)

In [7]:
# view test data
m = get_pattern_linestrings('90').explore(color='blue', tiles='CartoDB positron')
get_pattern_stops('90').explore(m=m, color='red')


In [8]:
def get_vehicle_intervals(vehicles:object) -> object:

    """vehicles should be a CHN scraped dataframe including vehicle ids (vid),
    timestamps (tmstmp), pattern ids (pid), and pattern distances (pdist) for each vehicle at 
    various times.  Output is a dataframe with each row representing
    an interval between two points in time and space for one vehicle. Columns are added
    for each interval's start time, end time, start pdist, and end pdist."""

    df_vehicles = vehicles.copy()

    # Set up dataframe to contain final fomratted data
    df_output = pd.DataFrame()

    # End time for each interval as a timestamp
    df_vehicles['end_time'] = pd.to_datetime(df_vehicles['tmstmp'],infer_datetime_format=True)
    vid_list = df_vehicles['vid'].unique().tolist()

    # End location for each interval
    df_vehicles['end_pdist'] = df_vehicles['pdist']

    for vid in vid_list:

        # pare data down to a single vehicle
        df_vehicle = df_vehicles.loc[df_vehicles['vid'] == vid]

        # handle each pattern separately
        pid_list = df_vehicle['pid'].unique().tolist()
        for p in pid_list:
            df_vehicle_pattern = df_vehicle.loc[df_vehicle['pid']==p].copy()
            # sort by time (it should be sorted already, but just in case)
            df_vehicle_pattern.sort_values(by=['end_time'], inplace=True)

            # Create a start time based on the previous tinmestamp
            end_times = df_vehicle_pattern['end_time'].tolist()
            start_times = np.roll(end_times,shift=1)
            df_vehicle_pattern['start_time'] = start_times

            # Create a start pattern distance based on the previous pdist
            end_distances = df_vehicle_pattern['end_pdist'].tolist()
            start_distances = np.roll(end_distances,shift=1)
            df_vehicle_pattern['start_pdist'] = start_distances

            # Remove the first interval since we don't have real start
            # time or location data for it
            df_vehicle_pattern = df_vehicle_pattern.iloc[1:]

            # add data to the full output dataframe
            df_output = pd.concat([df_output, df_vehicle_pattern])

    return df_output



In [9]:
        # Interpolate estimated times the bus arrived at a stop
        def interpolate_stop_time(
            stop_pdist:int, 
            start_time:pd.Timestamp, 
            end_time:pd.Timestamp, 
            start_pdist:int, 
            end_pdist:int
            ) -> pd.Timestamp:

            # How far into the interval distance is the bus stop?
            # stop distance from beginning of interval / full interval distance
            dist_ratio = (stop_pdist-start_pdist)/(end_pdist-start_pdist)

            # estimated bus stop time, assuming it traveled at a steady
            # speed throughout the interval
            est_stop_time = start_time + (end_time - start_time)*dist_ratio

            return est_stop_time

In [10]:
# # Q: Are there ever patterns that appear on more than one route?  

# pattern_rt_combos = all_vehicles[['pid','rt']].drop_duplicates()
# pattern_rt_combos.groupby(['pid']).count().sort_values(by='rt', ascending=False)

# # A: No patterns appear with more than one route, at least for the snapshot data available
# #    from the CTA API at the moment this was run.


In [11]:
# Next:

# Which vehcles have passed this route and stop combo (on any pattern)?

# When did they each pass?

# What were the headways?

In [12]:
# Find all vehicles that pass a given stop

# vehicle intervals from above as test data
vehicle_intervals = get_vehicle_intervals(get_vehicles('2023-01-07'))


In [13]:
stpid = '15736'
rt = '90'


In [14]:
# def filter_intervals(stop_dist, start_pdist, end_pdist):
#             return (start_pdist < stop_dist) & (end_pdist >= stop_dist)
        
# df_this_pattern_intervals = df_this_pattern_intervals.apply(
#     lambda x: filter_intervals(pdist_this_stop, x['start_pdist'], x['end_pdist']), axis=1
#     )


# vehicle_intervals.apply(lambda x: filter_intervals(7000, x['start_pdist'],x['end_pdist']))


In [15]:

def get_times_bus_passed(stpid:str, rt:str, vehicle_intervals:object) -> object:
    
    df_output = pd.DataFrame()

    # get all stops on this route, includin all patterns
    gdf_route_stops = get_pattern_stops(rt)

    # get a single stop, including all patterns for this route that use the stop
    gdf_this_stop = gdf_route_stops.loc[gdf_route_stops['stpid'] == stpid]

    # if multiple patterns use this stop, consider them separately since they
    # will have different distance data
    pid_list = gdf_this_stop['pid'].tolist()
    for pid in pid_list:

        # filter down to stop info for this stop, only for this pattern
        gdf_this_stop_pattern = gdf_this_stop.loc[gdf_this_stop['pid'] == pid]
        
        pdist_this_stop = gdf_this_stop_pattern['pdist'].tolist()[0]

        # find the intervals that are on this pattern
        df_this_pattern_intervals = vehicle_intervals.loc[vehicle_intervals['pid'] == pid]

        # Filter for intervals that start ahead of the stop location and end at or beyond the stop
        def filter_intervals(stop_dist:int, start_pdist:int, end_pdist:int) -> int:
            return (start_pdist < stop_dist) & (end_pdist >= stop_dist)

        # Create filter for the intervals we're working on
        interval_filter = df_this_pattern_intervals.apply(
            lambda x: filter_intervals(pdist_this_stop, x['start_pdist'], x['end_pdist']), axis=1
            )
        
        # apply the filter
        df_this_pattern_intervals = df_this_pattern_intervals.loc[interval_filter]

        # Estimate time bus passed the stop (interpolated based on data at start and
        # end of the interval)
        df_this_pattern_intervals['est_stop_time'] = df_this_pattern_intervals.apply(
            lambda x: interpolate_stop_time(
                pdist_this_stop, 
                x['start_time'], 
                x['end_time'], 
                x['start_pdist'], 
                x['end_pdist']), axis=1
            )

        # Add the intervals with estimated times buses pass to the output dataframe
        df_output = pd.concat([df_output, df_this_pattern_intervals])

    # Sort the output chronologically
    df_output.sort_values(by='est_stop_time',ascending=True, inplace=True)

    return df_output
    
test = get_times_bus_passed(stpid, rt, vehicle_intervals)

test

Unnamed: 0,vid,tmstmp,lat,lon,hdg,pid,rt,des,pdist,dly,...,zone,scrape_file,data_time,data_hour,data_date,end_time,end_pdist,start_time,start_pdist,est_stop_time
5960,1514,20230107 04:57,41.887111,-87.803698,85,5917,90,Harlem Green Line,38430,False,...,,bus_data/2023-01-07/04:57:56.json,2023-01-07 04:57:00,4,2023-01-07,2023-01-07 04:57:00,38430,2023-01-07 04:52:00,29386,2023-01-07 04:56:06.859796550
6711,8231,20230107 05:17,41.886765,-87.802773,269,5917,90,Harlem Green Line,40073,False,...,,bus_data/2023-01-07/05:17:56.json,2023-01-07 05:17:00,5,2023-01-07,2023-01-07 05:17:00,40073,2023-01-07 05:12:00,31619,2023-01-07 05:15:04.847409510
7614,1061,20230107 05:37,41.8871,-87.804024,89,5917,90,Harlem Green Line,38341,False,...,,bus_data/2023-01-07/05:37:56.json,2023-01-07 05:37:00,5,2023-01-07,2023-01-07 05:37:00,38341,2023-01-07 05:32:00,34346,2023-01-07 05:35:06.382978723
8401,8256,20230107 05:52,41.887138,-87.802956,87,5917,90,Harlem Green Line,38633,False,...,,bus_data/2023-01-07/05:52:56.json,2023-01-07 05:52:00,5,2023-01-07,2023-01-07 05:52:00,38633,2023-01-07 05:47:00,29179,2023-01-07 05:51:02.722657076
9909,1514,20230107 06:17,41.887116,-87.803568,90,5917,90,Harlem Green Line,38466,False,...,,bus_data/2023-01-07/06:17:56.json,2023-01-07 06:17:00,6,2023-01-07,2023-01-07 06:17:00,38466,2023-01-07 06:12:00,30708,2023-01-07 06:15:56.658932714
11268,8231,20230107 06:37,41.888876,-87.805058,179,5917,90,Harlem Green Line,37352,False,...,,bus_data/2023-01-07/06:37:56.json,2023-01-07 06:37:00,6,2023-01-07,2023-01-07 06:37:00,37352,2023-01-07 06:32:00,29322,2023-01-07 06:36:40.423412204
12737,1061,20230107 06:57,41.887123,-87.804396,89,5917,90,Harlem Green Line,38230,False,...,,bus_data/2023-01-07/06:57:56.json,2023-01-07 06:57:00,6,2023-01-07,2023-01-07 06:57:00,38230,2023-01-07 06:52:00,34545,2023-01-07 06:55:05.861601085
13913,8256,20230107 07:12,41.886787,-87.803253,267,5917,90,Harlem Green Line,40152,False,...,,bus_data/2023-01-07/07:12:56.json,2023-01-07 07:12:00,7,2023-01-07,2023-01-07 07:12:00,40152,2023-01-07 07:07:00,33060,2023-01-07 07:09:39.390862944
16039,1514,20230107 07:37,41.889412,-87.805069,179,5917,90,Harlem Green Line,37155,False,...,,bus_data/2023-01-07/07:37:56.json,2023-01-07 07:37:00,7,2023-01-07,2023-01-07 07:37:00,37155,2023-01-07 07:32:00,29938,2023-01-07 07:36:46.407094360
17395,8231,20230107 07:52,41.888898,-87.805059,177,5917,90,Harlem Green Line,37343,False,...,,bus_data/2023-01-07/07:52:56.json,2023-01-07 07:52:00,7,2023-01-07,2023-01-07 07:52:00,37343,2023-01-07 07:47:00,28687,2023-01-07 07:51:42.151109057
