Skip to content

Analysis of the CTA's GTFS data, and the CTA Train Tracker API to interpret how the CTA's trains actual arrival times correspond to predicted arrival times.

Notifications You must be signed in to change notification settings

michaelslice/CTA-Data-Analysis-Visualization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CTA-Data-Visualization

Undergraduate research I conducted alongside Prof.Koop from 2022-2023 looking into inconsistencies in the CTA's predictions for CTA trains on the 'L' line.

CTA 'L' DATA

The data we have from the CTA comes from a couple of different sources. The main data that we wish to analyze comes from the CTA Train Tracker Map (https://www.transitchicago.com/traintrackermap/). This is tied to the CTA Train Tracker API (https://www.transitchicago.com/developers/traintracker/), but because it is general, does not face the same limits and key constraints. However, the documentation can be useful in deciphering the data.

The General Transit Feed Specification (GTFS) is an open format for transit schedule data. Chicago's GTFS Data provides information about its routes, including both bus and "L" routes. We can download this data from the feed, and use the various files to extract "L" specific information. The l_stops data below is this extracted infromation. Note that a station (parent_station) can have multiple stops.

To run the code, you will need the data but also a python installation with support for pandas and pyarrow. Suggest using Microsoft VS Code and/or Anaconda. To generate the maps, you will need shapely which can be installed via anaconda.

There are many python references, but for pandas, the following reference is open-access: https://wesmckinney.com/book/

Deciphering the Data

image

The CTA "L" stations seem to be numbered consecutively according to when they were catalogued or added. They are a five-digit number that starts with four (4) and ends with zero (0). Evidence that these correspond to a chronological order includes the last three stops:

The stop ids start with a three (3) and serve to split a "station" into separate tracks/directions. These, too, are numbered based on chronological order, and new stops can be added to an existing (older) station. Evidence:

Train Tracker Data

image

In the collected data, we find a third identifier (CurrentStationId) which seems to correspond to a fixed location along the track. Perhaps there are sensors at these locations which allow trains to check in at the locations. In any case, we seem to have an imperfect correspondence between the line and the first digit. Note that because lines overlap and trains will run on the same tracks, this reinforces the idea that these are sensors. The second digit is a 1 or 5 which indicates direction (see Appendix C of the API Documentation), and the last three digits tend to mostly be ordered.

Other Elements

The other fields in the entries data include:

  • Line: a code for the line (e.g. B for Blue Line)
  • Datetime: a timestamp indicating the time the data was collected (UTC)
  • Lat/Lng: latitude and longitude that can be used to plot the position on a map
  • RunNumber: a number assigned to a train when it is running. Trains can keep the same run number for multiple trips up and down a line.
  • ExitStationId: ??? Numbers seem to match the CurrentStationId format
  • Direction: the direction the train is traveling (in radians)
  • LineName: an expansion of the Line code
  • DirMod: seems to be a language element that specifies how the train is moving
  • DestName: the end of the trip (usually the end of the line), important when a route splits
  • IsSched: a boolean, generally always False for our data
  • CSSClass: used for browser rending
  • Flags: ???

Predictions

image

For each train, the CTA calculates the number of minutes until it is expected at the next stops on the line. When that number of minutes is less than two, it instead lists the expected time as "Due". The ParentStop attribute matches the 4xxxx parent_station numbers that are listed above in the GTFS data. Thus, we have a link between the routes that are listed and stops that are predicted. When a prediction has not changed since it's last set of predictions, the data is not recorded.

In general, we would expect that the stops would be listed in order from the next stop to the one after and so on. However, when the stops are close together and thus expected within a minute of each other, the predictions may be the same. In these cases, the stop may be out of order, meaning that the stops as ordered by StopOrder are not as they would be encountered by the train.

Single Train Run

For a given run number, we can track the train's path over time and even display it on a map.

image

image

image

EDF DATA

The General Transit Feed Specification—or GTFS—is an open format for packaging scheduled service data. GTFS data is produced by hundreds of transit agencies (including the CTA) around the world to deliver content for inclusion in maps and directions-giving services, including Google Maps.

Information in the CTA train tracker beta comes from data fed to CTA from its rail infrastructure (unlike buses, our current railcar fleet does not have GPS hardware). This data is then processed by software we use to monitor our rail system which also generates the predictions for train arrivals based on recent train travel times from one point to another. (The software is a product called QuicTrak®.)

EUCLIDEAN DISTANCE

In mathamatics, the Euclidean distance between two points in Euclidean space is the length of a line segment between two points. In this instance, We can use this formula to calculate distance in between train stops along the line.

image

Geo Pandas

image

Using Geo Pandas we can coordinate the trains location using CurrentStationIds in the EDF data, to the scheduled destination of ParentStops in l_stops

Parent Station Ids Sorted by Line and Chronological Order Mapping

image

  • Green: Green Line
  • Blue: Blue Line
  • Brown: Brown Line
  • Purple: Purple Line
  • Pink: Pink Line
  • Red: Red Line
  • Orange: Orange Line
  • Green: Green Line
  • Grey: If marked grey, the stop has nore than 1 parent station id.
  • Black: If marked black, there are overlapping train lines that have stops with the same parent station id.

Null Island Encounter

image

In the train tracker data (EDF) some of the CurrentStationIds are routed to Null Island which is at the Earth's surface at zero degrees latitude and and zero degrees longitude (O N, O E). This is important because the CTA track sensors on the Yellow Line in particular are creating data that does not make sense.

Null Island Points

image

The CurrentStationIds above coordinates to the Yellow Line stops, which are Dempster-Skokie, Oakton-Skokie, and Howard.

l_stops & Stations Nearest Joins (sjoin_nearest)

image

Instead of calculating the Euclidean Distance for every CurrentStationId, and the closet parent_station. Geo Pandas was leveraged by using the sjoin_nearest function to configure the data set to connect CurrentStationIds to their closet parent_station along the different train lines.

Standard Deviation of Time to Travel Between Consecutive Stops on The Blue Line

image

Variation

The variation in the travel times between stops is important in understanding the consistency and reliability of the Chicago Blue Line. By analyzing the standard deviation of the train times, we can help commuters better understand their travel times.

Delays

A reliable and consistent public transit system is essential for the functioning of a busy city like Chicago. However, inconsistencies and delays can cause significant disruptions to commuters' daily routines, leading to lost time and productivity.

Time to Travel Around all Stops in the Loop

image

The Brown, Purple, Pink, and Orange line provide access to key residential, commercial, and cultural areas throughout the city. By comparing the average travel time between stops across these 4 different lines, it gives an idea of which lines are most efficient during the day compared to each other.

About

Analysis of the CTA's GTFS data, and the CTA Train Tracker API to interpret how the CTA's trains actual arrival times correspond to predicted arrival times.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published