# Behind Schedule: An Investigation into Pittsburgh Regional Transit Performance 

# Introduction and Motivation
Pittsburgh Regional Transit (PRT), formerly known as Port Authority, has aimed to provide reliable Bus and Rail Transportation for the Greater Pittsburgh Community since its inception. In addition to being used by Pittsburgh residents for recurring transportation needs, many students of Pittsburgh-based universities such as Carnegie Mellon University, the University of Pittsburgh, and Dusquesne Univeristy rely exclusively on Pittsburgh Regional Transit to commute to and from class, work, and social gatherings. 

PRT has set monthly average On-Time Performance (OTP) goals for its Bus and Rail services, clarifying that a bus is considered "On Time" if it is no more than one minute early or five minutes late to its scheduled timepoint. These goals are 73% and 80% OTP for Bus and Rail service respectively. This report aims to quantify the reliability of PRT's service and evaluate whether it has consistently met its OTP goals holistically and for routes commonly used by university students. Its results are meant to inform university students and the broader community and ensure that PRT is held accountable to their goals.



# Dataset
Monthly OTP Data is sourced from the Western Pennsylvania Regional Data Center, where PRT has catalogued various datasets related to performance and ridership. The dataset contains 22,258 entries (rows), 11 features (columns), and covers the time period from January 2017 to September 2024. In 2018, PRT switched from a more archaic data recording system to one called Clever, which uses more timepoints and fixes previous technical issues. 

This dataset’s columns are as follows: <br>
<br>
*<b>id<b>* : unique identifier key <br>
*<b>route<b>*: route code for joining with PAAC geospatial data <br>
*<b>ridership route code<b>*: route code for joining with ridership data <br>
*<b>full route name<b>*: full route name as it would appear on bus headsign <br>
*<b>current garage<b>*: garage that route operates out of <br>
*<b>mode<b>*: bus, light Rail <br>
*<b>month start<b>*: first day of the month in YYYY-MM-DD format <br>
*<b>year month<b>*: year-month key in YYYYMM format <br>
*<b>day type<b>*: weekday, saturday, or sunday <br>
*<b>on time percent<b>*: [0,1] fraction of timepoints that bus/rail departed on time from the stop/station <br>
*<b>data source<b>*: UTA, Clever <br>

Our variables of interest are: *month start*, *day type*, *on time percent*, *full route name*, and *current garage*.


# Methodology
Data was split into bus and rail service, then preprocessed accordingly to deal with on-time performance values of 0. Observations with an OTP of 0.0 were determined to be entry errors and are dropped from the dataset. Following preprocessing, the report explores the research questions and utilizes data manipulation libraries (pandas) and visualization libraries (altair) for analysis. Our research questions are as follows:

<b> Question 1: Has PRT met its OTP targets this year, and have they improved over time (since 2017 when the data collection began)? <b>

<b> Question 2: Do CMU-Centric routes see better OTP? Which routes have the best and worst OTP? <b>

<b> Question 3: Does Bus OTP Performance Differ by Type of Day, and How Does Time of Year Affect this Relationship? <b>

To address the first research question, line charts and bar charts are employed that quantify Bus OTP in 2024, compare that year's performance to previous years, and determine at a high-level whether PRT's Buses are meeting OTP targets. 

To address the second research question, descriptive statistics and bar charts are employed that quantify Bus OTP across different routes, classified as CMU-centric routes or non-CMU centric routes. 

To address the third research question, boxplots, line charts, and bar charts are employed that quantify Bus OTP across the type of day both across all years recorded and in 2024, and determine whether PRT's Buses are meeting OTP Targets on certain service days.

### Imports

In [246]:
import pandas as pd
import numpy as np
import altair as alt
import warnings
warnings.filterwarnings('ignore')

In [247]:
pd.set_option('display.max_rows', 200)

### Reading the Data

In [248]:
data = pd.read_csv('port-authority-otp.csv')

In [249]:
data.head()

Unnamed: 0,_id,route,ridership_route_code,route_full_name,current_garage,mode,month_start,year_month,day_type,on_time_percent,data_source
0,1,1,1,1 - FREEPORT ROAD,Ross,Bus,2017-01-01,201701,WEEKDAY,0.6837,Clever
1,2,1,1,1 - FREEPORT ROAD,Ross,Bus,2017-01-01,201701,SAT.,0.6977,Clever
2,3,1,1,1 - FREEPORT ROAD,Ross,Bus,2017-01-01,201701,SUN.,0.628,Clever
3,4,2,2,2 - MOUNT ROYAL,Ross,Bus,2017-01-01,201701,WEEKDAY,0.6978,Clever
4,5,4,4,4 - TROY HILL,Ross,Bus,2017-01-01,201701,WEEKDAY,0.7438,Clever


In [250]:
# split between bus and rail datasets
bus_data = data[data['mode'] == 'Bus']
rail_data = data[data['mode'] == 'Light Rail']

### Descriptive Statistics

In [251]:
bus_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21206 entries, 0 to 22257
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _id                   21206 non-null  int64  
 1   route                 21206 non-null  object 
 2   ridership_route_code  11209 non-null  object 
 3   route_full_name       21206 non-null  object 
 4   current_garage        21206 non-null  object 
 5   mode                  21206 non-null  object 
 6   month_start           21206 non-null  object 
 7   year_month            21206 non-null  int64  
 8   day_type              21206 non-null  object 
 9   on_time_percent       21206 non-null  float64
 10  data_source           20956 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 1.9+ MB


There are 11 columns including the unique id identifier. Out of 22258 entries, most columns have no null values. ridership_route_code has the most missing data, with 11856/22258 being non-null.

In [252]:
bus_data.describe()

Unnamed: 0,_id,year_month,on_time_percent
count,21206.0,21206.0,21206.0
mean,11458.364897,202048.078421,0.666959
std,6853.018555,221.081322,0.155132
min,1.0,201701.0,0.0
25%,5321.25,201901.0,0.624925
50%,11433.5,202011.0,0.6896
75%,17388.75,202210.0,0.7508
max,23404.0,202409.0,1.0


The lowest recording on-time percentage value is 0.00, may be worthwhile to investigate how many OTP values are 0 and whether this is intential or a data entry error. On average, buses are around 66% on time, give or take 17.4%. 

In [253]:
bus_data.on_time_percent.quantile([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

0.01    0.000000
0.05    0.471375
0.10    0.553600
0.25    0.624925
0.50    0.689600
0.75    0.750800
0.90    0.803800
0.95    0.834000
0.99    0.889090
Name: on_time_percent, dtype: float64

The 99th percentile of OTPs is only at around 90% performance.

### Ridership Route Code

In [254]:
bus_data.ridership_route_code.nunique()

99

There are 120 unique bus and rail routes

In [255]:
bus_data.ridership_route_code.unique()

array(['001', '002', '004', '006', '007', '008', '011', '012', '013',
       '014', '015', '016', '017', '018', '019L', '020', '021', '022',
       '024', '026', '027', '028X', '029', '031', '036', '038', '039',
       '040', '041', '043', '044', '048', '051', '051L', '052L', '053',
       '053L', '054', '055', '056', '057', '058', '059', '060', '061A',
       '061B', '061C', '061D', '064', '065', '067', '068', '069', '071',
       '071A', '071B', '071C', '071D', '074', '075', '077', '078', '079',
       '081', '082', '083', '086', '087', '088', '089', '091', '093',
       'G2', 'G3', 'G31', 'O1', 'O12', 'O5', 'P1', 'P10', 'P12', 'P13',
       'P16', 'P17', 'P2', 'P3', 'P67', 'P68', 'P69', 'P7', 'P71', 'P76',
       'P78', 'Y1', 'Y45', 'Y46', 'Y47', 'Y49', '60', nan], dtype=object)

### Missing/Null Values

In [256]:
bus_data_nan = bus_data[bus_data.data_source.isna()]
bus_data_nan.on_time_percent.describe()

count    250.0
mean       0.0
std        0.0
min        0.0
25%        0.0
50%        0.0
75%        0.0
max        0.0
Name: on_time_percent, dtype: float64

In [257]:
bus_data_nan.route_full_name.unique()

array(['19L - EMSWORTH LIMITED', '51L - CARRICK LIMITED',
       '52L - HOMEVILLE LIMITED', '53L - HOMESTEAD PARK LIMITED',
       '60 - MCKEESPORT - WALNUT', 'G3 - MOON FLYER',
       'G31 - BRIDGEVILLE FLYER', 'O1 - ROSS FLYER',
       'O12 - MCKNIGHT FLYER', 'O5 - THOMPSON RUN FLYER VIA 279',
       'P10 - ALLEGHENY VALLEY FLYER', 'P12 - HOLIDAY PARK FLYER',
       'P13 - MOUNT ROYAL FLYER', 'P16 - PENN HILLS FLYER',
       'P17 - LINCOLN PARK FLYER', 'P2 - EAST BUSWAY SHORT',
       'P3 - EAST BUSWAY-OAKLAND', 'P67 - MONROEVILLE FLYER',
       'P69 - TRAFFORD FLYER', 'P7 - MCKEESPORT FLYER',
       'P71 - SWISSVALE FLYER', 'P76 - LINCOLN HIGHWAY FLYER',
       'P78 - OAKMONT FLYER', 'Y1 - LARGE FLYER',
       'Y45 - BALDWIN MANOR FLYER', 'Y47 - CURRY FLYER'], dtype=object)

The above routes are missing on-time performance figures, likely due to the change in data collection method.

In [258]:
bus_data = bus_data.loc[~((bus_data['on_time_percent'] == 0) | (bus_data['data_source'].isna()))]

# Analysis

## Question 1: Has PRT met its OTP targets this year, and have they improved over time (since 2017 when the data collection began)? 

This analysis aims to quantify performance trends of PRT across time, computing year-over-year performance metrics and determining whether PRT has improved since data collection began. 



#### Helper Code

In [259]:
bus_data_2024 = bus_data[(bus_data.month_start >= '2024-01-01')]

In [260]:
bus_data_2024.month_start = pd.to_datetime(bus_data_2024.month_start).dt.strftime('%B')

In [261]:
rail_data_2024 = rail_data[(rail_data.month_start >= '2024-01-01')]

In [262]:
rail_data_2024.month_start = pd.to_datetime(rail_data_2024.month_start).dt.strftime('%B')

In [263]:
bus_monthly_otp_2024 = bus_data_2024.groupby('month_start')['on_time_percent'].mean().reset_index()
bus_monthly_otp_2024['mode'] = 'Bus'

In [264]:
rail_monthly_otp_2024 = rail_data_2024.groupby('month_start')['on_time_percent'].mean().reset_index()
rail_monthly_otp_2024['mode'] = 'Light Rail'

In [265]:
monthly_otp_2024 = pd.concat([bus_monthly_otp_2024, rail_monthly_otp_2024]).sort_values('month_start')

In [266]:
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

In [267]:
bus_color = 'red'
rail_color = 'blue'

In [268]:
chart = alt.Chart(monthly_otp_2024).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month of 2024')),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Average On-Time Performance'), scale=alt.Scale(domain=[0.5, 1])),
    color=alt.Color('mode:N', title='Mode of Transportation', 
                    scale=alt.Scale(domain=['Bus', 'Light Rail'], range=[bus_color, rail_color]), 
                    legend=alt.Legend(orient='bottom-right', titleFontSize=14, labelFontSize=15))
).properties(
    title = 'Monthly PRT On-Time Performance Across 2024, Colored By Mode of Transportation', 
    width=800,
    height=400
)

line1 = alt.Chart(pd.DataFrame({'on_time_percent': [0.73], 'label': ['Bus OTP Target']})).mark_rule(strokeDash=[10, 10], size=2).encode(
    y='on_time_percent:Q',
    color=alt.value(bus_color)
) + alt.Chart(pd.DataFrame({'on_time_percent': [0.73], 'label': ['Bus OTP Target']})).mark_text(
    align='left', dx=5, dy=-5, color=bus_color).encode(
    x = alt.value(800),
    y='on_time_percent:Q',
    text='label:N'
)

line2 = alt.Chart(pd.DataFrame({'on_time_percent': [0.8], 'label': ['Rail OTP Target']})).mark_rule(strokeDash=[10, 10], size=2).encode(
    y='on_time_percent:Q',
    color=alt.value(rail_color)
) + alt.Chart(pd.DataFrame({'on_time_percent': [0.8], 'label': ['Rail OTP Target']})).mark_text(
    align='left', dx=5, dy=-5, color=rail_color).encode(
    x = alt.value(800),
    y='on_time_percent:Q',
    text='label:N'
)


#### Figure 1: Monthly PRT On-Time Performance Across 2024, Colored by Mode of Transportation

In [269]:
alt.layer(chart,line1, line2).display()

On-Time Performance for the rail service hovers at around 15-20% above bus service and consistently exceeds PRT’s target. This supports our theory that the nature of rail service makes it inherently more reliable as it is less susceptible to traffic congestion and delays. Bus performance however, consistently failed to meet the target in 2024, and it declined throughout the year having its peak in march and its lowest performing month in September of 2024. One interesting note is that rail performance saw a significant dip in May, we theorized that the Pittsburgh Marathon in early May may have lead to significantly increased ridership as people commuted into the city that caused delays for the rail.


#### Helper Code

In [270]:
met_target_2024 = bus_data_2024.groupby('month_start').apply(lambda x: (x['on_time_percent'] >= 0.73).mean()*100).reset_index(name='percentage_met_target')

In [271]:
chart = alt.Chart(met_target_2024).mark_bar().encode(
    x=alt.X('month_start', title='Month', sort=month_order),
    y=alt.Y('percentage_met_target', title=' Percentage of Bus Routes that Met Target in 2024, by Month', scale=alt.Scale(domain=[0, 70])),
    tooltip=['month_start', 'percentage_met_target']
).properties(
    title='Percentage of Bus Routes that Met Target OTP Throughout 2024',
    width=800,
    height=400
)


#### Figure 2: Percentage of Buses In a Given Month that met their OTP Target in 2024

In [272]:
chart.display()

Bus routes in a given month meet the target if their on-time performance was greater than or equal to .73. Less than half of all bus routes consistently meet their OTP target throughout the year, and the number of buses meeting their target also seems to decline throughout the year. The lowest performing month is May, which may again have to do with the Pittsburgh marathon as performance seems to recover slightly in the following months. However, the true cause of these fluctuations is unclear.

#### Helper Code

In [273]:
bus_monthly_otp = bus_data.groupby('month_start')['on_time_percent'].mean().reset_index()
bus_monthly_otp.month_start = pd.to_datetime(bus_monthly_otp.month_start)
bus_monthly_otp['year_month'] = bus_monthly_otp.month_start.dt.strftime('%Y-%m')
bus_monthly_otp['year'] = pd.DatetimeIndex(bus_monthly_otp.month_start).year
bus_monthly_otp.month_start = bus_monthly_otp.month_start.dt.strftime('%B')
bus_monthly_otp['color'] = np.where(bus_monthly_otp.year == 2024, '2024', 'Other Years')

In [274]:
chart_all_years = alt.Chart(bus_monthly_otp).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month')),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Average On-Time Performance'), scale=alt.Scale(domain=[0.5, 0.8])),
    color=alt.Color('year:N', scale=alt.Scale(
        domain=[2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024],
        range=['grey', 'grey', 'grey', 'grey', 'green', 'blue', 'grey', 'red']
    ), legend=alt.Legend(title="Year", titleFontSize=14, labelFontSize=15)), size=alt.condition(
        alt.datum.year == 2024,
        alt.value(3),  # Thicker line for 2024
        alt.value(1))
).properties(
    title = 'Monthly PRT On-Time Performance Throughout the Year, 2024 vs. Previous Years', 
    width=800,
    height=400
)

In [275]:
chart_2024_vs_2023 = alt.Chart(bus_monthly_otp[bus_monthly_otp.year >= 2023]).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month')),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Avergae On-Time Performance'), scale=alt.Scale(domain=[0.5, 0.8])),
    color=alt.Color('year:N', title='Year',  
                    legend=alt.Legend(orient='bottom-right', titleFontSize=14, labelFontSize=16))
).properties(
    title = 'Monthly PRT On-Time Performance Throughout the Year, 2024 vs. 2023', 
    width=800,
    height=400
)

In [276]:
chart_2024_vs_2020 = alt.Chart((bus_monthly_otp[(bus_monthly_otp.year == 2020) | (bus_monthly_otp.year == 2024)])).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month')),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Average On-Time Performance'), scale=alt.Scale(domain=[0.5, 0.8])),
    color=alt.Color('year:N', title='Year',  
                    legend=alt.Legend(orient='bottom-right', titleFontSize=14, labelFontSize=16))
).properties(
    title = 'Monthly PRT On-Time Performance Throughout the Year, 2024 vs 2020', 
    width=800,
    height=400
)

vertical_line = alt.Chart(pd.DataFrame({
    'month_start': ['March'],
    'value': [0],  # Set a value just for positioning
})).mark_rule(strokeDash=[10, 10], size=2, color='black').encode(
    x=alt.X('month_start', sort=month_order)
)

# Create label for the vertical line
label = alt.Chart(pd.DataFrame({
    'month_start': ['March'],
    'label': ['Start of COVID Pandemic'],  # The text for the label
    'y': [50]  # Position the label at the top of the chart
})).mark_text(
    align='right', dy=-20, color="black").encode(
    x=alt.X('month_start', sort=month_order),
    text='label'
)

#### Figure 3: Monthly PRT On-Time Performance Throughout the Year, 2024 vs Previous Years

In [277]:
chart_all_years.display()

2024, colored in red, is disappointingly among the lower-performing years since 2017, but the trend of performance declining throughout the year seems mostly consistent across years. I’ve singled out 2021 in green and 2022 in blue as notably well and poor-performing years respectively.

#### Figure 4: Monthly PRT On-Time Performance Throughout the Year, 2024 vs. 2023

In [278]:
chart_2024_vs_2023.display()

PRT hasn't noticeably improved between 2024 and 2023. Based on the trend from 2023, OTP may be expected to have improved in the final three months of 2024.

#### Figure 5: Monthly PRT On-Time Performance Throughout the Year, 2024 vs. 2020

In [279]:
alt.layer(chart_2024_vs_2020, vertical_line, label).display()

The onset of the pandemic marked a noticeable decline in OTP, but 2024's OTP has not been able to reach the level of 2020 for any month.

#### Helper Code

In [280]:
chart = alt.Chart(bus_monthly_otp).mark_line().encode(
    x= alt.X('year_month:O', 
             axis=alt.Axis(title='Year and Month', 
                           labelOverlap=True,
                          tickCount=5)),
    y= alt.Y('on_time_percent', 
             axis=alt.Axis(title = 'Average On-Time Performance'), 
             scale=alt.Scale(domain=[0.5, 0.8]))
).properties(
    title = 'Port Authority On-Time Performance Between 2017-2024', 
    width=800,
    height=400
)

pandemic_line = alt.Chart(pd.DataFrame({
    'year_month': ['2020-03'],
    'value': [0],  
})).mark_rule(strokeDash=[10, 10], size=2, color='black').encode(
    x=alt.X('year_month')
)

pandemic_label = alt.Chart(pd.DataFrame({
    'year_month': ['2020-03'],
    'label': ['Start of COVID Pandemic'],  
    'y': [50]  
})).mark_text(
    align='right', dy=-20, color="black").encode(
    x=alt.X('year_month'),
    text='label'
)

rebrand_line = alt.Chart(pd.DataFrame({
    'year_month': ['2022-06'],
    'value': [0],  
})).mark_rule(strokeDash=[10, 10], size=2, color='black').encode(
    x=alt.X('year_month')
)

rebrand_label = alt.Chart(pd.DataFrame({
    'year_month': ['2022-06'],
    'label': ['Port Authority Rebrands to PRT'],  
    'y': [50]  
})).mark_text(
    align='right', dy=-20, color="black").encode(
    x=alt.X('year_month'),
    text='label'
)

#### Figure 6: Port Authority On-Time Performance Between 2017 and 2024

In [281]:
alt.layer(chart, pandemic_line, pandemic_label, rebrand_line, rebrand_label).display()

The rebrand and change of ownership led PRT to struggle temporarily, changes in budgets, more money allocated to replacing signs, buses, etc and less towards employees seems to have had a negative impact on performance. The pandemic led to an obvious decline in performance, they seemed to rebound around september of that year. Overall, however, performance is noisy and hasn't noticeably improved since 2017.

## Question 2: Do CMU-Centric routes see better OTP? Which routes have the best and worst OTP?

This analysis identifies high and low-performing areas, aiming to assess whether routes serving CMU perform better.

### CMU Vs. Other Pittsburgh Routes

To compare CMU against other area of Pittsburgh, this report determines that the 61a-d, 67, 69, and the 71a-d, which run through and around CMU's campus, are to be denoted as CMU-centric routes.

#### Helper Code

In [282]:
cmu_routes = ['61A','61B','61C','61D', '67','69','71','71A','71B','71C','71D']
avg_otp_by_route = cmu_bus_routes.groupby("route")["on_time_percent"].mean()
overall_otp_cmu = avg_otp_by_route.mean()
all_other_routes = bus_data_2024[~bus_data_2024["route"].isin(cmu_routes)]
avg_otp_by_route_other = all_other_routes.groupby("route")["on_time_percent"].mean() 
#similar to the prior series, this is the average OTP for all other routes not denoted as CMU
overall_otp_other = avg_otp_by_route_other.mean()
all_bus_avg_2024 = bus_data_2024.groupby("route")["on_time_percent"].mean().mean()
cmu_data = pd.DataFrame({
    "Route Type": ["CMU Routes", "Other Routes"],
    "Average OTP (%)": [overall_otp_cmu * 100, overall_otp_other * 100] #Multiplied by 100 to move the scale of the graph to 0-100 
})

In [283]:
cmu_otp_chart = alt.Chart(cmu_data).mark_bar().encode(
    x=alt.X("Route Type", title="Bus Route Category"),
    y=alt.Y("Average OTP (%)", title="Average On-Time Percentage", scale=alt.Scale(domain=[0, 100])),
    color=alt.Color("Route Type", scale=alt.Scale(domain=["CMU Routes", "Other Routes"], range=["red", "gray"])) 
    #in order to color CMU routes as red to standout
).properties(
    title="Comparison of On-Time Performance: CMU vs. Other Routes",
    width=800,
    height=400
)
#adds data labels
cmu_otp_text = cmu_otp_chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5 
).encode(
    text=alt.Text("Average OTP (%):Q", format=".1f")
)
#put everything together into one final graph
final_cmu_otp_chart = cmu_otp_chart + cmu_otp_text

#### Table 7: Average OTP By Route

In [284]:
avg_otp_by_route #this represents a series of each CMU bus route and its respective average

route
61A    0.568089
61B    0.537181
61C    0.481370
61D    0.727867
67     0.509296
69     0.672800
71     0.680456
71A    0.696941
71B    0.535641
71C    0.646307
71D    0.742019
Name: on_time_percent, dtype: float64

In [285]:
print("Overall OTP average over all routes:", overall_otp_cmu) #this represents the on time average of all CMU bus routes

Overall OTP average over all routes: 0.6179969696969697


In [286]:
print("Overall OTP average of non-CMU routes:", overall_otp_other) #total OTP average of the prior series which indicates non CMU routes

Overall OTP average of non-CMU routes: 0.6896045434969853


In [287]:
print("Overall OTP average of all routes in 2024", all_bus_avg_2024) #average OTP average of ALL routes in 2024, regardless of if its categorized as CMU route or Other

Overall OTP average of all routes in 2024 0.6814840969835815


#### Figure 7: Comparison of On-Time Performance: CMU vs Other Routes

In [288]:
final_cmu_otp_chart

CMU bus routes performed at an OTP of 61.8% as compared to non-CMU bus Routes having an OTP of 69.0%. This means that CMU bus routes show up on time 7.2% less than the average non-CMU bus and a full 11.2% below the target average of 73%

### Lowest Performing Routes

An interesting pattern emerges that the bus routes with the two lowest on time percentages (61d, 71b) both are what are considered a CMU bus route. 

#### Helper Code

In [289]:
avg_times = bus_data_2024.groupby("route")["on_time_percent"].mean()
bottom_10_routes = avg_times.nsmallest(10)

overall_avg = avg_times.mean()
bot_10_df = bottom_10_routes.reset_index() #convert to database so we can modifications and graph easier
bot_10_df.columns = ['route', 'on_time_percent']

bot_10_df['on_time_percent'] *= 100 
bot_10_df = bot_10_df.sort_values(by='on_time_percent', ascending=True) #sort and convert percentages into whole numbers

highlight_routes = {'67', '71B', '61C', '61B'} #CMU bus routes
bot_10_df['color'] = bot_10_df['route'].apply(lambda x: 'CMU Routes' if x in highlight_routes else 'Other Routes')


In [290]:
bot_chart = (
    alt.Chart(bot_10_df)
    .mark_bar(size=20)
    .encode(
        x=alt.X('route:N', title='Route', sort=None),
        y=alt.Y('on_time_percent:Q', title='On-Time Percentage', scale=alt.Scale(domain=[0, 100])),
        color=alt.Color('color:N', scale=alt.Scale(domain=['CMU Routes', 'Other Routes'], range=['red', 'gray']))
    )
)
bot_text = bot_chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-10
).encode(
    text=alt.Text('on_time_percent:Q', format='.1f')  
)

#adding in an average line for reference on the performance of these routes versus an "average" pittsburgh route
avg_line = (
    alt.Chart(pd.DataFrame({'y': [all_bus_avg_2024 * 100]})) 
    .mark_rule(color='black', strokeDash=[4, 4])  
    .encode(y='y:Q')
)

avg_label = (
    alt.Chart(pd.DataFrame({'y': [all_bus_avg_2024], 'text': [f'Avg: {all_bus_avg_2024*100:.1f}%']}))
    .mark_text(align='left', dx=-25, dy=-250, color='black')
    .encode(
        y='y:Q',
        text='text:N'
    )
)


final_bot_chart = (bot_chart + bot_text + avg_line + avg_label).properties(
    title="Bottom 10 Bus Routes by OTP %",
    width=800,
    height=400
)

#### Table 8: Bottom 10 Routes

In [291]:
bottom_10_routes

route
61C    0.481370
65     0.505533
67     0.509296
77     0.509844
58     0.518985
71B    0.535641
61B    0.537181
15     0.546063
1      0.554789
83     0.562433
Name: on_time_percent, dtype: float64

#### Table 9: Bottom Routes, CMU vs Other Routes

In [292]:
bot_10_df 

Unnamed: 0,route,on_time_percent,color
0,61C,48.137037,CMU Routes
1,65,50.553333,Other Routes
2,67,50.92963,CMU Routes
3,77,50.984444,Other Routes
4,58,51.898519,Other Routes
5,71B,53.564074,CMU Routes
6,61B,53.718148,CMU Routes
7,15,54.606296,Other Routes
8,1,55.478889,Other Routes
9,83,56.243333,Other Routes


#### Figure 10: Bottom 10 Bus Routes by On-Time Performance Percentage

In [293]:
final_bot_chart

The bottom 10 routes are the 61C, 65, 67, 77, 58, 71B, 61B, 15, 1, 83. Of these routes, 4 of them belong to the CMU bus routes denoted, all of them performing worse than 54%, including the worst bus route, the 61C, performing at 48.1%. This is an astounding 20% below the average bus and 24.9% below the target OTP.

### Downtown Routes Vs. Non-Downtown Routes

As a part of figuring out why the CMU bus routes appear to be significantly slower, the report compares the bus routes that do not go into downtown versus those that do. After speaking with a Pittsburgh Port Authority Representative, the bus routes that are denoted as not entering downtown are as follows.

#### Helper Code

In [294]:
not_downtown = ['75', '71A', '71C', '59', '55', '64', '74', '54','65', '93', '60', '89']
not_downtown_routes = bus_data_2024[bus_data_2024["route"].isin(not_downtown)]
not_downtown_avg_otp = not_downtown_routes.groupby("route")["on_time_percent"].mean()
overall_not_downtown_avg = not_downtown_avg_otp.mean()
downtown_routes = bus_data_2024[~bus_data_2024["route"].isin(not_downtown)]
downtown_avg_otp = downtown_routes.groupby("route")["on_time_percent"].mean().mean()
downtown_data = pd.DataFrame({
    "Route Type": ["Non-Downtown Routes", "Downtown Routes"],
    "Average OTP (%)": [overall_not_downtown_avg  * 100, downtown_avg_otp * 100]
})

In [295]:
downtown_otp_chart = alt.Chart(downtown_data).mark_bar().encode(
    x=alt.X("Route Type", title="Bus Route Category"),
    y=alt.Y("Average OTP (%)", title="Average On-Time Percentage", scale=alt.Scale(domain=[0, 100])),
    color=alt.Color("Route Type", scale=alt.Scale(domain=["Non-Downtown Routes", "Downtown Routes"], range=["green", "gray"]))
).properties(
    title="Comparison of On-Time Performance: Non-Downtown vs. Downtown Routes",
    width=800,
    height=400
)

downtown_otp_text = downtown_otp_chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5 
).encode(
    text=alt.Text("Average OTP (%):Q", format=".1f")
)

final_downtown_otp_chart = downtown_otp_chart + downtown_otp_text


#### Table 11: Average OTP for downtown and non-downtown routes

In [296]:
not_downtown_avg_otp

route
54     0.613200
55     0.685678
59     0.637789
60     0.795419
64     0.620263
65     0.505533
71A    0.696941
71C    0.646307
74     0.634978
75     0.666856
89     0.734152
93     0.658274
Name: on_time_percent, dtype: float64

In [297]:
print("Overall Average for Non-downtown Routes: ", overall_not_downtown_avg)

Overall Average for Non-downtown Routes:  0.657949074074074


In [298]:
print("Overall Average for downtown routes:", downtown_avg_otp)

Overall Average for downtown routes: 0.684806688453159


#### Figure 12: Comparison of On-Time Performance: Non-Downtown vs. Downtown Routes

In [299]:
final_downtown_otp_chart

The routes going downtown (68.5%) actually performed better than those that operated outside of downtown (65.8%), trumping them by 2.7%. The routes operating outside of downtown fall slightly below the average for all buses of 68.1%.

### OTP by Garage
Another aspect of this could be which garages these busses originate from. The following analysis takes a closer look at OTP by garage.

#### Helper Code

In [300]:
avg_times_garage = bus_data_2024.groupby("current_garage")["on_time_percent"].mean()
avg_times_garage_df = avg_times_garage.reset_index()
avg_times_garage_df['on_time_percent'] *= 100 
avg_times_garage_df = avg_times_garage_df.sort_values(by='on_time_percent', ascending=True)
avg_times_garage_df['color'] = avg_times_garage_df['current_garage'].apply(lambda x: 'blue' if x == 'Collier' else 'gray')
#collier is the highest performing garage, hence we will be highlighting it on the graph

In [301]:
garage_chart = (
    alt.Chart(avg_times_garage_df)
    .mark_bar(size=40)
    .encode(
        x=alt.X('current_garage:N', title='Garage', sort='-y'),
        y=alt.Y('on_time_percent:Q', title='Average On-Time Percentage', scale=alt.Scale(domain=[0, 100])),
        color=alt.Color('color:N', scale=alt.Scale(domain=['blue', 'gray'], range=['blue', 'gray']), legend=None)
    )
)

garage_text = garage_chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5  
).encode(
    text=alt.Text('on_time_percent:Q', format='.1f')  
)

final_garage_chart = (
    (garage_chart + garage_text + avg_line + avg_label)
    .properties(
        title="Average On-Time Percentage by Garage",
        width=800,
        height=400
    )
)

#### Table 13: Average OTP By Garage

In [302]:
avg_times_garage

current_garage
Collier         0.734075
East Liberty    0.646722
Ross            0.670426
West Mifflin    0.662378
Name: on_time_percent, dtype: float64

#### Figure 14: Aerage On-Time Performance by Garage 

In [303]:
final_garage_chart

Out of all of the garages, the Collier garage had the best performance at 73.4% OTP, while the Ross garage came in second at 67.0%, then West Mifflin at 66.2%, and last East Liberty at 64.7%. This means that the Collier garage is the only one that meets the 73% target set by Port Authority.

## Question 3: Does Bus OTP Performance Differ by Type of Day, and How Does Time of Year Affect this Relationship?

This analysis provides insights into temporal OTP variations.


### Holistic On-Time Performance by Type of Day

#### Helper Code

In [304]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [305]:
by_day_boxplot = alt.Chart(bus_data).mark_boxplot(size=100).encode(
    x=alt.X('day_type', axis = alt.Axis(title = 'Day of the Week', labelAngle=0), sort=['WEEKDAY', 'SAT.', 'SUN']),
    y=alt.Y('on_time_percent:Q', axis=alt.Axis(title = 'On-Time Performance')),
    color=alt.Color('day_type:O', scale=alt.Scale(domain=['WEEKDAY', 'SAT.', 'SUN.'],
                                                  range=['red', 'green', 'blue']),
                    legend=alt.Legend(title='Day Type'))
).properties(
    title='Distribution of Bus On-Time Performance by Type of Day',
    width=800,
    height=400
)

by_day_mean_points = alt.Chart(bus_data).mark_point(color='white', size=20).encode(
    x=alt.X('day_type', axis = alt.Axis(title = 'Day of the Week', labelAngle=0), sort=['WEEKDAY', 'SAT.', 'SUN']),
    y='mean(on_time_percent):Q'
)


#### Figure 15: Distribution of On-Time Performance by Type of Day (Weekday, Saturday, Sunday)

In [306]:
by_day_boxplot + by_day_mean_points

#### Helper Code

In [307]:
by_day_2024_boxplot = alt.Chart(bus_data_2024).mark_boxplot(size=100).encode(
    x=alt.X('day_type', axis = alt.Axis(title = 'Day of the Week', labelAngle=0), sort=['WEEKDAY', 'SAT.', 'SUN']),
    y=alt.Y('on_time_percent:Q', axis=alt.Axis(title = 'On-Time Performance'), scale=alt.Scale(domain=[0.1, 1.0])),
    color=alt.Color('day_type:O', scale=alt.Scale(domain=['WEEKDAY', 'SAT.', 'SUN.'],
                                                  range=['red', 'green', 'blue']),
                    legend=alt.Legend(title='Day Type'))
).properties(
    title='Distribution of Bus On-Time Performance by Type of Day, 2024',
    width=800,
    height=400
)

by_day_2024_mean_points = alt.Chart(bus_data_2024).mark_point(color='white', size=20).encode(
    x=alt.X('day_type', axis = alt.Axis(title = 'Day of the Week', labelAngle=0), sort=['WEEKDAY', 'SAT.', 'SUN']),
    y='mean(on_time_percent):Q'
)


In [308]:
bus_monthly_otp_by_day = bus_data.groupby(['month_start', 'day_type'])['on_time_percent'].mean().reset_index()
bus_monthly_otp_by_day['year'] = pd.DatetimeIndex(bus_monthly_otp_by_day.month_start).year

#### Figure 16: Distribution of On-Time Performance by Type of Day in 2024 (Weekday, Saturday, Sunday)

In [309]:
by_day_2024_boxplot + by_day_2024_mean_points

### Trends Across Time: Average On-time Performance for Bus Routes By Day of Week

#### Helper Code

In [310]:
daytype_year_chart = alt.Chart(bus_monthly_otp_by_day).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month', labelOverlap=True,
                          tickCount=5)),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Average On-Time Performance'), scale=alt.Scale(domain=[0.6, 0.8])),
    color=alt.Color('day_type:N', title='Type of Day', 
                    scale=alt.Scale(domain=['WEEKDAY', 'SAT.', 'SUN.'], range=["red", "green", "blue"]), 
                    legend=alt.Legend(orient='top-right', titleFontSize=14, labelFontSize=15))
).properties(
    title = 'Monthly Port Authority On-Time Performance Across Time, colored by Type of Day', 
    width=800,
    height=400
)


In [311]:
bus_monthly_otp_by_day_2024 = bus_monthly_otp_by_day[bus_monthly_otp_by_day.year == 2024]
bus_monthly_otp_by_day_2024.month_start = pd.to_datetime(bus_monthly_otp_by_day_2024.month_start).dt.strftime('%B')

In [312]:
daytype_year_chart_2024 = alt.Chart(bus_monthly_otp_by_day_2024).mark_line().encode(
    x= alt.X('month_start', sort = month_order, axis=alt.Axis(title='Month', labelOverlap=True,
                          tickCount=5)),
    y= alt.Y('on_time_percent', axis=alt.Axis(title = 'Average On-Time Performance'), scale=alt.Scale(domain=[0.60, 0.80])),
    color=alt.Color('day_type:N', title='Type of Day', 
                    scale=alt.Scale(domain=['WEEKDAY', 'SAT.', 'SUN.'], range=["red", "green", "blue"]), 
                    legend=alt.Legend(orient='bottom-right', titleFontSize=14, labelFontSize=15))
).properties(
    title = 'Monthly Average On-Time Performance Across 2024, colored by Type of Day', 
    width=800,
    height=400
)


#### Figure 17: Monthly Average On-Time Performance Across 2024, colored by Type of Day

In [313]:
daytype_year_chart

Sunday bus service has a consistently higher OTP. This seems counterintuitive as Sunday buses run at a lower frequency, however it can be reasoned that these buses tend to be, on average, often more often than Weekday and Saturday service. Saturday service looks to be most consistently the worse, which could be because higher ridership is negatively impacting bus arrival times. 

#### Figure 18: Monthly Average On-Time Performance Across 2024, Colored by Type of Day (Weekday, Saturday, Sunday)

In [314]:
daytype_year_chart_2024

OTP performance tends to decrease throughout the year regardless of type of day. Sunday performance is the best except for the months of May and September, Saturday performance is the worst except for the month of May, and Weekday performance hangs in the middle, seeing the worst metrics in May and the best ones in January and September.

### Seasons of 2024: Percentage of Buses that met OTP Target T by Type of Day

#### Helper Code

In [315]:
bus_data_2024['season'] = bus_data_2024.month_start.map({'December': 'Winter', 'January': 'Winter', 'February': 'Winter',
                                'March': 'Spring', 'April': 'Spring', 'May': 'Spring',
                                'June': 'Summer', 'July': 'Summer', 'August': 'Summer',
                                'September': 'Fall', 'October': 'Fall', 'November': 'Fall'})

In [316]:
met_target_by_daytype_2024 = bus_data_2024.groupby(['season', 'day_type']).apply(lambda x: (x['on_time_percent'] >= 0.73).mean()*100).reset_index(name='percentage_met_target')

In [317]:
season_met_target_daytype_2024_chart = alt.Chart(met_target_by_daytype_2024).mark_bar().encode(
    x=alt.X('season:N', title='Season', sort=['Winter', 'Spring', 'Summer', 'Fall']),
    xOffset='day_type:N',
    y=alt.Y('percentage_met_target:Q', title=' Percentage of Bus Routes that Met Target'),
    color=alt.Color('day_type:N', scale=alt.Scale(domain=['WEEKDAY', 'SAT.', 'SUN.'],
                                                  range=['red', 'green', 'blue']),
                    legend=alt.Legend(title='Day Type')),
    tooltip=['season', 'percentage_met_target']
).properties(
    title='Percentage of Bus Routes that Met Target OTP Throughout 2024',
    width=800,
    height=400
)


#### Figure 19: Percentage of Bus Routes that Met Target OTP of 73% Throughout 2024

In [318]:
season_met_target_daytype_2024_chart.display()

Observations:
1. In winter, the Saturday buses saw the best mean performance relative to Port Authority's OTP Target
2. In Spring and Summer, the Sunday buses saw the best mean performance relative to Port Authority's OTP Target
3. Fall's analysis is limited to September, but weekday buses saw the best mean performance relative to Port Authority's OTP Target.

Note: October-December are not included in the dataset, skewing the results for Fall. In Winter, Saturday saw the highest percentage of bus routes meeting their OTP target, while in Spring and Summer, Sunday saw the highest percentage of bus routes meeting their OTP target. Weekday service hangs in the middle, with the worst comparative performance in Winter and the best in Spring.

# Discussion of Results

In 2024, while Rail Service consistently exceeded the 80% On-Time Performance (OTP) target set by PRT, the same cannot be said for Bus Service, which fell short of its 73% target. Though PRT may have initially considered these goals ambitious, it is disappointing that, despite a five-minute allowance for delays, no month in 2024 recorded an average bus OTP meeting 73%. Furthermore, fewer than 45% of bus routes met the OTP target in any given month, meaning more than half of the routes consistently arrived late. Particularly concerning is the downward trend in bus performance throughout 2024, with the lowest monthly average OTP recorded in September—coinciding with the return of students to campus, a period when reliable bus service becomes especially critical.

Since 2017, when PRT began collecting performance data, bus service has been influenced by numerous internal and external factors. Notably, OTP peaked in early 2020 but was sharply impacted by the onset of the COVID-19 pandemic in March of that year. There was a recovery in subsequent months, and performance metrics stabilized at levels slightly below the pre-pandemic peak. In 2021, PRT recorded its most consistent bus performance, possibly attributed to increased investment in services. As such, the company decided to kick off its rebrand in 2022, but this event unfortunately coincided with another sharp decline in performance. The degree to which any fund mismanagement and rebrand-related structual changes caused declines in performance is unclear, but it is probable that these changes at least partially contributed to the negative change in performance metrics. Since 2022, performance has hovered below pre-pandemic levels, with added volatility and inconsistency.

The poor performance of bus routes servicing Carnegie Mellon University (CMU) is especially problematic. These routes not only fall short of the 73% OTP target but also lag behind the system-wide average. Given that they pass through CMU, the University of Pittsburgh, and other Pittsburgh-based universities, it is plausible that increased ridership and traffic congestion in these areas contribute to delays. To address this, PRT has undertaken a project to construct a hyper-transit corridor servicing these routes. This Bus Rapid Transit (BRT) is described as one to “improve the transit amenity and reliability experience for all users of the corridor between the three neighborhood areas in the City of Pittsburgh. Five bus routes, the 61A, 61B, 61C, 71B, and P3 will become “BRT” routes, and provide upgraded service from Oakland heading west to Downtown Pittsburgh. However, even users of the non-BRT routes, which include routes 61D, 71A, 71C and 71D with shortened service, will experience the benefits of the BRT amenities in Oakland. In this area, all bus riders will experience upgraded stations with amenities, dedicated bus lanes, and transit signal priority. East of the Oakland area, these routes will continue to have the benefit of more reliable service by providing reliability improvements in Oakland". The 61A-D are some of the worst-performing routes in the entire city, and hopefully the BRT project, which is currently in phase 1 of its construction, will be able to significiantly improve the service of these routes.

Interestingly, routes connecting downtown Pittsburgh showed slightly better OTP compared to non-downtown routes, contrary to initial expectations. Although the difference was not statistically significant, it merits further exploration given its potential implications for route efficiency. Performance across PRT’s garages revealed that buses from the Collier garage consistently outperformed others, even though these buses must navigate the often-congested Fort Pitt Tunnel. This outperformance raises questions about operational practices or structural differences at Collier compared to garages in Ross, East Liberty, and West Mifflin.

Regarding service days, PRT categorizes them as Weekday, Saturday, and Sunday, which limits detailed analysis by assuming uniformity within these groups. While overall differences in performance by service day are minor, Sunday service consistently achieved higher OTP, likely due to lower ridership and reduced bus frequency. Although students may find the longer Sunday wait times frustrating, these lower frequencies appear to enhance punctuality. In contrast, Saturday service typically had the worst OTP, potentially due to higher ridership and its impact on schedules.

Seasonal analysis revealed additional insights. For instance, Saturday Winter service demonstrated the best relative performance across all groups and seasons, while Sunday Spring and Summer services performed best within their respective seasons. Fall service, encompassing only September, saw the strongest weekday performance. However, the uncharacteristic dip in May and September Sunday OTP remains unexplained but may be tied to ridership fluctuations or local events.

Looking forward, future work could benefit from analyzing ridership data in tandem with on-time performance to better understand the relationship between passenger volume and delays. Incorporating ridership insights might shed light on whether increased demand directly correlates with performance declines and could offer targeted solutions to mitigate these issues. Additionally, applying the same level of in-depth analysis conducted for bus service to rail service could yield valuable findings about the factors contributing to the superior performance of PRT's rail network. This comparative perspective might reveal best practices or operational strategies that could be adapted for bus services. Addressing these research areas could lead to a more comprehensive understanding of service dynamics.

# Appendix

Link to dataset: https://data.wprdc.org/dataset/port-authority-monthly-average-on-time-performance-by-route

Link to video presentation: https://www.youtube.com/watch?v=f3vtpNLKIMo