# Calculating Headways for CTA Buses

### Next steps

Investigate data scraping (way over my head at this point skills-wise, would need to study CHN / other methods.)
- Pattern data from CTA appears to only include the patterns in effect at the moment you request it from the API (ie, I can't see weekday patterns if I run this on a Saturday.)  Would be nice to have a reference for all patterns used in a typical week (if there is such a thing as a typical week. Is it valid to assume patterns don't change much over time?)

- Scraping / saving data for all buses on all routes would be a huge, quickly expanding dataset.  Any way to automate and save periodic runs of this kind of analysis and store summary info only?

Consider how to handle overnight routes.  CHN data is by calendar date. As it stands, the process below eliminates all intervals where one timestamp is before midnight and the next is after.

- Combine 3 days of data, requested date plus the preceeding and following dates?  Then combine all three and look for multi-hour gaps to determine whether to restart?  

- Or investigate the schedule data from CTA to determine the actual start/end times, adding an hour buffer on either end for our data pulls?

- How many routes run past midnight? Is skipping a single 5 minute interval a big deal?  It's simpler all around to keep everything to calendar days if it doesn't skew things too far.



In [128]:
import requests
from dotenv import load_dotenv
import pandas as pd
import geopandas as gpd
from shapely import Point, LineString
import datetime as dt
import numpy as np

In [13]:
# Get API key from the .env file
load_dotenv()
API_KEY = os.getenv('API_KEY')

### Functions

In [23]:
def get_vehicles(datestring):
    """Datestring must be in YYYY-MM-DD format. Returns vehicle data scraped by the chn
    ghost bus team for all CTA buses on the specified date."""

    chn_data_source = f'https://chn-ghost-buses-public.s3.us-east-2.amazonaws.com/bus_full_day_data_v2/{datestring}.csv'

    vehicles = pd.read_csv(
        chn_data_source, dtype={
            'vid':'int',
            'tmstmp':'str',
            'lat':'float',
            'lon':'float',
            'hdg':'int',
            'pid':'int',
            'rt':'str',
            'pdist':'int',
            'des':'str',
            'dly':'bool',
            'tatripid':'str',
            'origatripno':'int',
            'tablockid':'str',
            'zone':'str',
            'scrape_file':'str',
            'data_hour':'int',
            'data_date':'str'
            }
        )

    return vehicles


# view test data
# get_vehicles('2023-01-08')


In [24]:

def get_patterns(route):
    '''Get patterns data from the CTA's bus tracker API for a specified route.
    Return patterns as a dataframe'''
    
    # get data from CTA's feed
    api_url = f'http://www.ctabustracker.com/bustime/api/v2/getpatterns?key={API_KEY}&rt={route}&format=json'
    response = requests.get(api_url)
    patterns = response.json()

    # convert json to dataframe
    df_patterns = pd.DataFrame(patterns['bustime-response']['ptr'])

    # convert pt column values to dataframes for each pattern containing that pattern's points
    df_patterns['pt'] = df_patterns['pt'].apply(lambda x: pd.DataFrame(x))
    
    return df_patterns

# get_patterns(20)

In [25]:

def get_pattern_linestrings(route):
    '''Get the patterns data from the CTA's bus tracker API for a specified route.
    Return patterns as a geodataframe with linestring geometry for each pattern'''

    df_patterns = get_patterns(route)

    # Turn points into linestrings
    geometry_linestrings = []
    for p in df_patterns['pt']:
        p.sort_values('seq', inplace=True)
        linestring_points = list(zip(p['lon'],p['lat']))

        # generate linestring using all points
        linestring = LineString(linestring_points)
        geometry_linestrings.append(linestring)

    # Create a geodataframe for the patterns using the linestring geometry
    gdf_patterns = gpd.GeoDataFrame(df_patterns, geometry=geometry_linestrings).set_crs(epsg=4326)

    # Drop the original pt column
    gdf_patterns.drop(['pt'], axis=1, inplace=True)

    return gdf_patterns


In [26]:
def get_pattern_stops(route):
        '''Get the patterns data from the CTA's bus tracker API 
        for a specified route. Return patterns as a geodataframe 
        with point geomtry, one point per bus stop on each pattern
        associated with the route. Note that stops serving multiple
        patterns will be listed multiple times, once for each pattern
        with the seq and pdist values specific to that pattern.'''

        # get patterns for the route
        df_patterns = get_patterns(route)

        # set up a geodataframe to contain stops
        gdf_route_stops = gpd.GeoDataFrame()

        # consider the pid column (pattern ID) and the pt column (dataframe contaning
        # points along the pattern)
        for pid, pt in zip(df_patterns['pid'],df_patterns['pt']):
                # sort points sequentially
                pt.sort_values('seq', inplace=True)
                # add the pattern id to each point's data
                pt['pid']=pid
                # filter to only show stop points
                stops = pt[pt['typ']=='S']
                # zip lat/lon data to get coordinate pairs
                coords = list(zip(stops['lon'],stops['lat']))
                # turn coordinates into point geometry
                geometry = [Point(c) for c in coords]
                # generate a geodataframe for the stops in this pattern
                gdf_pattern_stops = gpd.GeoDataFrame(stops,geometry=geometry).set_crs(epsg=4326)
                # add this pattern's stops to the dataframe containing all stops on the route
                gdf_route_stops = pd.concat([gdf_route_stops, gdf_pattern_stops])

        return gdf_route_stops

# get_pattern_stops(20)

In [155]:
# view test data
m = get_pattern_linestrings('90').explore(color='blue', tiles='CartoDB positron')
get_pattern_stops('90').explore(m=m, color='red')


In [152]:
def vehicle_intervals(vehicles):

    """vehicles should be a CHN scraped dataframe including vehicle ids (vid),
    timestamps (tmstmp), pattern ids (pid), and pattern distances (pdist) for each vehicle at 
    various times.  Output is a dataframe with each row representing
    an interval between two points in time and space for one vehicle. Columns are added
    for each interval's start time, end time, start pdist, and end pdist."""

    df_vehicles = vehicles.copy()

    # Set up dataframe to contain final fomratted data
    df_output = pd.DataFrame()

    # End time for each interval as a timestamp
    df_vehicles['end_time'] = pd.to_datetime(df_vehicles['tmstmp'],infer_datetime_format=True)
    vid_list = df_vehicles['vid'].unique().tolist()

    # End location for each interval
    df_vehicles['end_pdist'] = df_vehicles['pdist']

    for vid in vid_list:

        # pare data down to a single vehicle
        df_vehicle = df_vehicles.loc[df_vehicles['vid'] == vid]

        # handle each pattern separately
        pid_list = df_vehicle['pid'].unique().tolist()
        for p in pid_list:
            df_vehicle_pattern = df_vehicle.loc[df_vehicle['pid']==p].copy()
            # sort by time (it should be sorted already, but just in case)
            df_vehicle_pattern.sort_values(by=['end_time'], inplace=True)

            # Create a start time based on the previous tinmestamp
            end_times = df_vehicle_pattern['end_time'].tolist()
            start_times = np.roll(end_times,shift=1)
            df_vehicle_pattern['start_time'] = start_times

            # Create a start pattern distance based on the previous pdist
            end_distances = df_vehicle_pattern['end_pdist'].tolist()
            start_distances = np.roll(end_distances,shift=1)
            df_vehicle_pattern['start_pdist'] = start_distances

            # Remove the first interval since we don't have real start
            # time or location data for it
            df_vehicle_pattern = df_vehicle_pattern.iloc[1:]

            # add data to the full output dataframe
            df_output = pd.concat([df_output, df_vehicle_pattern])

    return df_output



In [60]:
# # Q: Are there ever patterns that appear on more than one route?  

# pattern_rt_combos = all_vehicles[['pid','rt']].drop_duplicates()
# pattern_rt_combos.groupby(['pid']).count().sort_values(by='rt', ascending=False)

# # A: No patterns appear with more than one route, at least for the snapshot data available
# #    from the CTA API at the moment this was run.


In [None]:
# Next:

# Which vehcles have passed this route and stop combo (on any pattern)?

# When did they each pass?

# What were the headways?

In [None]:
# Find all vehicles that pass a given stop

# vehicle intervals from above as test data
vehicle_intervals = test
stpid = 17389
rt = '90'

def get_times_passed(stpid, rt, vehicle_intervals):

    pass


get_times_passed()