##Motivation:

Our motivation behind using Port Authority’s Monthly Bus Performance is both personal and societal. Many CMU/Pitt students, including us, rely on Port Authority to commute to and from class, work, and social gatherings. Port Authority is also an important resource for the broader Pittsburgh community, and many people depend entirely on them for transportation. Oftentimes, buses can be scheduled 30 minutes to an hour from each other, so either missing the bus because it came 10 minutes early or being 10 minutes late to an important meeting or test can have catastrophic consequences. Quantifying the performance and reliability of PA’s buses will provide valuable insights that can help Pittsburgh’s community to plan accordingly and avoid being late to important events. For CMU students, the information would hopefully allow them to be more informed about the tendencies of bus routes in their area. Additionally, we would hope to spark a larger discussion surrounding the efficacy of Pittsburgh’s public transportation systems and hold Port Authority accountable to their OTP goals of 73% for bus and 80% for rail service. Our insights, if used by Port Authority, could lead to improvements in the way they allocate resources and increase trust in Pittsburgh’s public transit, which would have numerous positive effects.

## Related Work

We’ve been aware of Port Authority’s faults throughout our time as students and commuters in Pittsburgh. Both of us have had trouble with late/missing buses in the past, and when one of us suggested a project in the realm of public transportation while brainstorming, we decided to check whether Port Authority collected any data on their buses and ridership. Once we found a suitable dataset and learned about Port Authority’s On Time Performance goals, we determined that taking a closer look at Pittsburgh’s system and its potential flaws would be something worthwhile.

## Data

From a preliminary investigation, the data appears to be in a manageable format that is ready to be analyzed. It can be downloaded in a csv format, where it is well organized and includes things like a primary key (id field) in addition to descriptive fields such as the route, date, type of transport, day of the week, and on time percentage. The id field could provide for an easy way to join this data to other similar data sets from the port authority, however we will have to investigate this further once analysis begins. As for the current formatting, the data is in one table and has 11 columns which are named as follows: “id”, “route” (text), “ridership_route_code” (text), “full_route_name” (text), “current_garage” (text), “mode” (text), “month_start” (date), “year_month” (text), “day_type” (text), “on_time_percent” (float8), “data_source” (text). Some columns are categorical as they relate to a route number or name, others are interval and related to the time of recording, and “on_time_percent”, which would be the column of highest interest, is numerical data in a ratio format. The data table has 22,258 entries (rows) and has data from as far back as the start of 2017 to October 2024. There is an additional dataset providing monthly ridership info with 11 columns and 22,317 entries (rows), which we haven’t decided whether we’ll analyze in addition to our primary dataset. Although the data collection started at a later date (2019), this dataset contains many overlapping columns, so it wouldn’t be difficult to integrate it into our plan. This dataset’s columns are as follows: “id”, “route” (text), “ridership_route_code” (text), “full_route_name” (text), “current_garage” (text), “mode” (text), “month_start” (date), “year_month” (text), “day_type” (text), “avg_riders” (int), “day_count” (int).

Provide the URL or source of the data. Is the data clean and ready to be analyzed or do you need to do it? To your knowledge has this data been used in other projects (anywhere).

https://data.wprdc.org/dataset/port-authority-monthly-average-on-time-performance-by-route/resource/00eb9600-69b5-4f11-b20a-8c8ddd8cfe7a

https://data.wprdc.org/dataset/prt-monthly-average-ridership-by-route/resource/12bb84ed-397e-435c-8d1b-8ce543108698

After a brief investigation, it appears that some of the cleaning and preprocessing steps have been done for us. Both datasets are stewarded, and while we may have to deal with a couple rows with missing or incorrect entries, and make pivots in the dataset to create categories that are useful to us, both datasets are generally prepared for analysis. One analysis on Pittsburgh Bus Ridership was done by the NYC Data Science Academy, but it appears that their focus was on ridership whereas ours will be primarily on OTP. I found some articles on Port Authority’s bus performance as well, but none of these seemed to go as deep as we’re intending on, and I couldn’t determine whether they used the exact same dataset.

## Questions

1. Has Port Authority met its OTP targets this year, and have they improved over time (since 2017 when the data collection began)?
We hope to understand the performance trend of the Port Authority across time, which will help in assessing the effectiveness of past initiatives and planning future improvements. Our plan is to visualize trends holistically and for specific bus routes using line charts and compute year-over-year performance metrics.
2. Looking at CMU bus routes specifically, are the performance of buses better? Which pockets of Pittsburgh have the best and worst OTP?
By Identifying high and low-performing areas, we hope to create data-driven justifications for targeted interventions to improve service reliability in specific neighborhoods and assess if routes serving CMU perform better due to potential higher ridership. Our plan is to filter and analyze data for routes serving Carnegie Mellon University (CMU), compare performance metrics against other routes, and potentially use geospatial analysis to map and identify areas with the best and worst OTP (heat maps potentially).
3. Which days/times of day tend to see better bus performance?
We hope that insights into temporal performance variations can help in optimizing schedules and resource allocation to improve bus punctuality during peak times. Our plan is to perform a time series analysis to evaluate OTP across different days of the week and times of the day. We might use box plots and time-based charts to visualize performance trends, identify patterns, and suggest schedule adjustments.
4. (Time permitting) How does ridership impact service?
Understanding the relationship between ridership levels and service performance can inform capacity planning and service adjustments to better match demand. Our plan is to correlate ridership data with OTP metrics to investigate how changes in passenger numbers affect bus performance. We could use scatter plots and regression analysis to identify significant relationships and propose service adjustments.
5. (Time permitting) Repeat the analysis but with Rail, which routes see better performance?
Analyzing rail performance will provide us with a comprehensive view of public transit reliability and helps in addressing issues specific to rail services. We plan to apply similar analysis techniques used for buses to rail data and identify the same kinds of pockets/routes that perform worse.

### Workflow

Data Cleaning and Preprocessing.
Exploratory Data Analysis Round 1, which will be more experimental as we look for patterns to exploit.
Discuss preliminary EDA and brainstorm more relationships/patterns to visualize.
Exploratory Data Analysis Round (2,3, However many we need). These rounds will be more informed and purposeful.
Narrative Creation, Official Visualization, and Reporting.
Insights and Final Recommendations.

## Possible Findings and Implications

At a high level, we’ve theorized that Port Authority has not met their performance targets for this year, but that they have likely improved since 2017. We believe that certain years like 2020 and 2022 might be outliers in terms of OTP due to external factors (covid, layoffs, budget cuts, etc.). We anticipate that CMU routes that are heading downtown will see better monthly performance than ones heading towards cmu from downtown. We believe that routes with more ridership may receive more resources, which would improve the quality of service. We’ve observed this phenomenon already, but we believe that OTP is best during traditional business hours, and declines starting in the evening until end of service around 2am. Given that Port Authority’s OTP goals are higher for rail service, we also theorize that those routes will see better performance across the board.

Link to Data Set:
https://data.wprdc.org/dataset/port-authority-monthly-average-on-time-performance-by-route