# NYCT Delay Alerts: Data Preparation

In this notebook, we will load a snapshot of the NYCT alerts dataset and perform simple natural language processing (NLP) in order to extract the issue causing the delay and the train stop at (or near) which the issue occurred.

After setting up a virtual environment (i.e. `python3 -m venv venv`), you can installed the necessary dependencies with `pip install -r requirements.txt`.

# Import dependencies

In [1]:
import pandas as pd
import numpy as np
import re
from collections import namedtuple, Counter
from datetime import datetime
from enum import StrEnum, auto
import json

In [2]:
# Additional setup
pd.options.mode.chained_assignment = None

# Read data

## Alerts data

Here we load the MTA alerts dataset, which is available for download [here](https://data.ny.gov/Transportation/MTA-Service-Alerts-Beginning-April-2020/7kct-peq7). At the time of writing, the dataset contains alerts from April 2020 to August 2024.

In [3]:
alerts_df = pd.read_csv('../../data/raw-data/service-alerts.csv')

  alerts_df = pd.read_csv('../../data/raw-data/service-alerts.csv')


In [4]:
alerts_df.head(5)

Unnamed: 0,Alert ID,Event ID,Update Number,Date,Agency,Status Label,Affected,Header,Description
0,283131,135287,0,01/08/2024 07:17:00 PM,LIRR,extra-service,Babylon Branch,tweet,g
1,253262,119758,0,09/15/2023 01:05:00 PM,NYCT Subway,boarding-change,6,,
2,198453,92346,0,02/04/2023 05:50:00 AM,LIRR,extra-service,City Terminal Zone,tweet,
3,179799,82495,0,11/04/2022 03:18:00 AM,LIRR,station-notice,City Terminal Zone,.,
4,158731,70726,0,07/30/2022 04:01:00 AM,LIRR,service-change,Port Jefferson Branch,Weekend Changes,


In [5]:
def filter_by_status_label(df, status_label):
    ''' Filter df by whether its status labels contains status_label '''
    return df[df["Status Label"].apply(lambda x: status_label in x.split(" | "))]

In [6]:
alerts_nyct = alerts_df[alerts_df["Agency"] == "NYCT Subway"]
nyct = filter_by_status_label(alerts_nyct, "delays")

In [7]:
# Minor cleaning
nyct["Header"] = nyct["Header"].replace(np.nan, None)
nyct["Description"] = nyct["Description"].replace(np.nan, None)

In [8]:
nyct.sample(20)

Unnamed: 0,Alert ID,Event ID,Update Number,Date,Agency,Status Label,Affected,Header,Description
288962,47230,13170,1,12/26/2020 12:50:00 PM,NYCT Subway,delays,E,Northbound E trains are proceeding with delays...,
162031,182127,83829,0,11/15/2022 01:01:00 PM,NYCT Subway,delays,J,J trains are running with delays in both direc...,
44508,300730,144102,1,03/19/2024 03:54:00 PM,NYCT Subway,delays,A | B | C | D,Southbound A B C D trains are running with del...,
50704,294470,140826,1,02/21/2024 08:43:00 PM,NYCT Subway,reroute | delays,A | C | F,Southbound A C trains are running via the F li...,Trains are delayed while we remove a train wit...
212614,131457,54658,1,03/20/2022 06:27:00 AM,NYCT Subway,delays,4 | 5,Southbound 4 5 trains are running with delays ...,Southbound 4 5 trains have resumed making loca...
190535,153581,67794,0,07/07/2022 02:08:00 AM,NYCT Subway,delays,C | E,Northbound C E trains are running with delays ...,
249880,94133,32915,2,09/28/2021 01:12:00 PM,NYCT Subway,delays,B | C,Northbound B C trains are proceeding with dela...,
165958,178163,81575,0,10/27/2022 06:39:00 AM,NYCT Subway,delays,A | H,A Rockaway Park Shuttle trains are delayed in ...,
24523,321559,154852,0,06/05/2024 10:48:00 PM,NYCT Subway,delays,2 | 3,Northbound 2 3 trains are delayed while NYPD r...,
47329,297882,142626,0,03/07/2024 04:36:00 PM,NYCT Subway,delays,N | W,Southbound N W trains are running with delays ...,


## Station data

We'll also find it useful later to have data on the NYCT stations. This data is available for download [here](https://data.ny.gov/Transportation/MTA-Subway-Stations/39hk-dx4f/about_data).

In [9]:
stations_df = pd.read_csv('../../data/raw-data/subway-stations.csv')

In [10]:
stations_df

Unnamed: 0,GTFS Stop ID,Station ID,Complex ID,Division,Line,Stop Name,Borough,CBD,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,North Direction Label,South Direction Label,ADA,ADA Northbound,ADA Southbound,ADA Notes,Georeference
0,R01,1,1,BMT,Astoria,Astoria-Ditmars Blvd,Q,False,N W,Elevated,40.775036,-73.912034,Last Stop,Manhattan,0,0,0,,POINT (-73.912034 40.775036)
1,R03,2,2,BMT,Astoria,Astoria Blvd,Q,False,N W,Elevated,40.770258,-73.917843,Astoria,Manhattan,1,1,1,,POINT (-73.917843 40.770258)
2,R04,3,3,BMT,Astoria,30 Av,Q,False,N W,Elevated,40.766779,-73.921479,Astoria,Manhattan,0,0,0,,POINT (-73.921479 40.766779)
3,R05,4,4,BMT,Astoria,Broadway,Q,False,N W,Elevated,40.761820,-73.925508,Astoria,Manhattan,0,0,0,,POINT (-73.925508 40.76182)
4,R06,5,5,BMT,Astoria,36 Av,Q,False,N W,Elevated,40.756804,-73.929575,Astoria,Manhattan,0,0,0,,POINT (-73.929575 40.756804)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
491,S15,517,517,SIR,Staten Island,Prince's Bay,SI,False,SIR,Open Cut,40.525507,-74.200064,Ferry,South Shore,0,0,0,,POINT (-74.200064 40.525507)
492,S14,518,518,SIR,Staten Island,Pleasant Plains,SI,False,SIR,Embankment,40.522410,-74.217847,Ferry,South Shore,0,0,0,,POINT (-74.217847 40.52241)
493,S13,519,519,SIR,Staten Island,Richmond Valley,SI,False,SIR,Open Cut,40.519631,-74.229141,Ferry,Tottenville,0,0,0,,POINT (-74.229141 40.519631)
494,S09,522,522,SIR,Staten Island,Tottenville,SI,False,SIR,At Grade,40.512764,-74.251961,Ferry,Last Stop,1,1,1,,POINT (-74.251961 40.512764)


# Data processing

There are a number of improvements we may want to make in order to make this dataset more information-rich. Ultimately, we would like to identify common issues that cause delays, and identify which stations are particularly issue-prone.

## Clean alert headers and descriptions

We notice that the textual content of an alert is sometimes split between the header and the description. We will clean and combine these two fields for easier processing.

In [11]:
def clean_whitespace(text):
    ''' Clean newlines and extra whitespace in text '''
    if text is None:
        return None
    no_newlines = text.replace('\n', ' ')
    return re.sub(r'\s+', ' ', no_newlines).strip()

In [12]:
nyct["Header"] = nyct["Header"].apply(clean_whitespace)
nyct["Description"] = nyct["Description"].apply(clean_whitespace)

In [13]:
def combine_strings_or_nones(strings, delimiter=" "):
    ''' Join multiple strings, ignoring Nones '''
    return delimiter.join(list(filter(None, strings)))

In [14]:
nyct["Combined description"] = nyct.apply(
    lambda x: combine_strings_or_nones([x["Header"], x["Description"]]), 1)

In [15]:
nyct["Combined description"]

9         Northbound E F trains are holding in stations ...
10        Southbound 6 trains are delayed while we addre...
16        Northbound N trains are delayed while we remov...
24        Northbound 2 5 trains are delayed while we add...
29        Northbound F trains are delayed while we condu...
                                ...                        
343788    Southbound 4 trains have resumed making expres...
343792    Southbound 2 and 3 trains are proceeding at no...
343794    Southbound 6 trains are proceeding at normal s...
343800    3 trains are running with delays in both direc...
343801    3 trains are proceeding with delays in both di...
Name: Combined description, Length: 128197, dtype: object

## Identify delay causes from alert descriptions

In this section, we will use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to parse the most likely issue for the delay, according to the alert description. This approach is not infallible, but as we will see, it can work remarkably well due to the fact that alert descriptions often use very reliable language. Using an LLM to determine the cause is also certainly an option, and I experimented with it, but ultimately decided against it. For more details check out my write-up!

Here we define the issue type and specific issues that we will use when categorizing alerts. How did I arrive at this list of issues? I landed on them using an iterative approach:

1. Define a few obvious issues that appear in the alerts, like "disruptive passenger" or "door problem".
2. Tag alerts with this keyword in the description
3. Filter by alerts that did not match any issue
4. Identify the most prevalent issues in those unmatched alerts and add them to the list
6. Repeat steps 2-4 until satisfied with the coverage
7. Come up with broader categories to group issues together (issue types)

In [16]:
class IssueType(StrEnum):
    MAINTENANCE = auto()
    BRAKE_ACTIVATED = auto()
    MECHANICAL_ISSUE = auto()
    HUMAN_DISRUPTION = auto()
    OBJECT_ON_TRACKS = auto()
    EMS_NYPD_FDNY_RESPONSE = auto()
    CLEANING = auto()
    MISC = auto()

In [17]:
class Issue(StrEnum):
    # Maintenance
    TRACK_MAINTENANCE = auto()
    SIGNAL_MAINTENANCE = auto()
    SWITCH_MAINTENANCE = auto()
    WORK_TRAIN = auto()
    MISC_MAINTENANCE = auto()
    # Brake activated
    BRAKES_ACTIVATED = auto()
    EMERGENCY_BRAKE_PULLED = auto()
    # Mechanical issue
    SIGNAL_PROBLEM = auto()
    SWITCH_PROBLEM = auto()
    COMMS_PROBLEM = auto()
    LOSS_OF_POWER = auto()
    RAIL_PROBLEM = auto()
    DOOR_PROBLEM = auto()
    TRAIN_PROBLEM = auto()
    MECHANICAL_PROBLEM = auto()
    # Human disruption
    DISRUPTIVE_PASSENGER = auto()
    PERSON_ON_TRACKS = auto()
    PERSON_STRUCK_BY_TRAIN = auto()
    MEDICAL_EMERGENCY = auto()
    VANDALISM = auto()
    # Object on tracks
    SOMETHING_ON_TRACKS = auto()
    FALLEN_TREE = auto()
    # EMS / NYPD / FDNY response
    EMS = auto()
    NYPD = auto()
    FDNY = auto()
    # Cleaning
    CLEANING = auto()
    # Misc
    SHORT_STAFFED = auto()
    FIRE = auto()
    FLOODING = auto()
    SOUTH_CHANNEL_BRIDGE = auto()
    TRACK_INSPECTIONS = auto()

In [18]:
# Match issues with issue type
ISSUE_TO_ISSUE_TYPE = {
    # Maintenance
    Issue.TRACK_MAINTENANCE: IssueType.MAINTENANCE,
    Issue.SIGNAL_MAINTENANCE: IssueType.MAINTENANCE,
    Issue.SWITCH_MAINTENANCE: IssueType.MAINTENANCE,
    Issue.WORK_TRAIN: IssueType.MAINTENANCE,
    Issue.MISC_MAINTENANCE: IssueType.MAINTENANCE,
    # Brake activated
    Issue.BRAKES_ACTIVATED: IssueType.BRAKE_ACTIVATED,
    Issue.EMERGENCY_BRAKE_PULLED: IssueType.BRAKE_ACTIVATED,
    # Mechanical issue
    Issue.SIGNAL_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.SWITCH_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.COMMS_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.LOSS_OF_POWER: IssueType.MECHANICAL_ISSUE,
    Issue.RAIL_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.DOOR_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.TRAIN_PROBLEM: IssueType.MECHANICAL_ISSUE,
    Issue.MECHANICAL_PROBLEM: IssueType.MECHANICAL_ISSUE,
    # Human disription
    Issue.DISRUPTIVE_PASSENGER: IssueType.HUMAN_DISRUPTION,
    Issue.PERSON_ON_TRACKS: IssueType.HUMAN_DISRUPTION,
    Issue.PERSON_STRUCK_BY_TRAIN: IssueType.HUMAN_DISRUPTION,
    Issue.MEDICAL_EMERGENCY: IssueType.HUMAN_DISRUPTION,
    Issue.VANDALISM: IssueType.HUMAN_DISRUPTION,
    # Object on tracks
    Issue.SOMETHING_ON_TRACKS: IssueType.OBJECT_ON_TRACKS,
    Issue.FALLEN_TREE: IssueType.OBJECT_ON_TRACKS,
    # EMS / NYPD / FDNY response
    Issue.EMS: IssueType.EMS_NYPD_FDNY_RESPONSE,
    Issue.NYPD: IssueType.EMS_NYPD_FDNY_RESPONSE,
    Issue.FDNY: IssueType.EMS_NYPD_FDNY_RESPONSE,
    # Cleaning
    Issue.CLEANING: IssueType.CLEANING,
    # Misc
    Issue.SHORT_STAFFED: IssueType.MISC,
    Issue.FIRE: IssueType.MISC,
    Issue.FLOODING: IssueType.MISC,
    Issue.SOUTH_CHANNEL_BRIDGE: IssueType.MISC,
    Issue.TRACK_INSPECTIONS: IssueType.MISC,
}

In [19]:
def get_issue_category(issue_str):
    ''' Given an issue in string format, return the issue type in string format '''
    if not issue_str:
        return None
    return ISSUE_TO_ISSUE_TYPE[Issue[issue_str]].name

In [20]:
# For convenience, we will define a named tuple to hold the issue and its associated regex pattern together.
IssueRegexPattern = namedtuple('IssueRegexPattern', ['issue', 'pattern'])

Here we define the regex pattern for each issue. Since our algorithm will choose the first pattern in the list that matches the description, **the order here matters!**. For that reason, we put issues that are more likely the root cause of the delay near the top (e.g. disruptive passenger), and issues that are more like fallbacks (e.g. NYPD) toward the end.

In [21]:
issue_regex_patterns = [
    IssueRegexPattern(Issue.SOUTH_CHANNEL_BRIDGE, r"(marine traffic|south channel bridge)"),
    IssueRegexPattern(Issue.FALLEN_TREE, r"tree"),
    IssueRegexPattern(Issue.FIRE, r"(fire|smoke)"),
    IssueRegexPattern(Issue.FLOODING, r"flooding|water"),
    IssueRegexPattern(Issue.DISRUPTIVE_PASSENGER, r"disruptive|assault|altercation"),
    IssueRegexPattern(Issue.DOOR_PROBLEM, r"door"),
    IssueRegexPattern(Issue.PERSON_STRUCK_BY_TRAIN, r"struck by[^\.]+train"),
    IssueRegexPattern(Issue.PERSON_ON_TRACKS, r"(unauthorized (person|individual)|fell onto the tracks)"),
    IssueRegexPattern(Issue.SOMETHING_ON_TRACKS, r"(debris|remov[^\.]+from the tracks)"),
    IssueRegexPattern(Issue.MEDICAL_EMERGENCY, r"(injured|medical|sick|((person|someone) in (need|crisis))|(needs help))"),
    IssueRegexPattern(Issue.EMERGENCY_BRAKE_PULLED, r"brake cord"),
    IssueRegexPattern(Issue.VANDALISM, r"vandaliz"),
    IssueRegexPattern(Issue.TRACK_MAINTENANCE, r"track maintenance"),
    IssueRegexPattern(Issue.SIGNAL_MAINTENANCE, r"signal maintenance"),
    IssueRegexPattern(Issue.SWITCH_MAINTENANCE, r"switch maintenance"),
    IssueRegexPattern(Issue.SIGNAL_PROBLEM, r"signal"),
    IssueRegexPattern(Issue.SWITCH_PROBLEM, r"switch"),
    IssueRegexPattern(Issue.COMMS_PROBLEM, r"communication[^\.]+(problem|malfunction|issue)"),
    IssueRegexPattern(Issue.RAIL_PROBLEM, r"(broken rail|replac[^\.]+rail|rail repair)"),
    IssueRegexPattern(Issue.LOSS_OF_POWER, r"power"),
    IssueRegexPattern(Issue.BRAKES_ACTIVATED, r"(train[^\.]+brake[^\.]+activat|activat[^\.]+train[^\.]+brake)"),
    IssueRegexPattern(Issue.CLEANING, r"cleaning"),
    IssueRegexPattern(Issue.WORK_TRAIN, r"work train"),
    IssueRegexPattern(Issue.MISC_MAINTENANCE, r"(maintenance|planned (track )?work|work equipment)"),
    IssueRegexPattern(Issue.TRAIN_PROBLEM, r"(removed? a train.*from service|remov.*train.*in need of repair)"),
    IssueRegexPattern(Issue.TRACK_INSPECTIONS, r"track inspection"),
    IssueRegexPattern(Issue.MECHANICAL_PROBLEM, r"mechanical problems?"),
    IssueRegexPattern(Issue.SHORT_STAFFED, r"with the crews we have available"),
    IssueRegexPattern(Issue.EMS, r"\bems\b"),
    IssueRegexPattern(Issue.NYPD, r"nypd"),
    IssueRegexPattern(Issue.FDNY, r"fdny"),
]

In [22]:
def find_first_issue(text, issue_patterns):
    ''' Given text and a list of IssueRegexPattern, return the first issue with a match '''
    for issue_pattern in issue_patterns:
        if re.search(issue_pattern.pattern, text.lower()) is not None:
            return issue_pattern.issue.name
    return None

In [23]:
nyct["Issue"] = nyct["Combined description"].apply(lambda x: find_first_issue(x, issue_regex_patterns))

In [24]:
nyct["Issue type"] = nyct["Issue"].apply(get_issue_category)

In [25]:
nyct

Unnamed: 0,Alert ID,Event ID,Update Number,Date,Agency,Status Label,Affected,Header,Description,Combined description,Issue,Issue type
9,333837,161166,0,07/18/2024 06:17:00 AM,NYCT Subway,delays,E | F,Northbound E F trains are holding in stations ...,,Northbound E F trains are holding in stations ...,RAIL_PROBLEM,MECHANICAL_ISSUE
10,332834,160668,0,07/15/2024 10:53:00 AM,NYCT Subway,delays,6,Southbound 6 trains are delayed while we addre...,,Southbound 6 trains are delayed while we addre...,SIGNAL_PROBLEM,MECHANICAL_ISSUE
16,330046,159187,0,07/04/2024 09:29:00 PM,NYCT Subway,delays,N,Northbound N trains are delayed while we remov...,,Northbound N trains are delayed while we remov...,SOMETHING_ON_TRACKS,OBJECT_ON_TRACKS
24,326100,157189,0,06/21/2024 05:59:00 PM,NYCT Subway,delays,2 | 5,Northbound 2 5 trains are delayed while we add...,,Northbound 2 5 trains are delayed while we add...,DOOR_PROBLEM,MECHANICAL_ISSUE
29,322274,155232,0,06/07/2024 07:37:00 PM,NYCT Subway,delays,F,Northbound F trains are delayed while we condu...,,Northbound F trains are delayed while we condu...,,
...,...,...,...,...,...,...,...,...,...,...,...,...
343788,118,87,4,04/28/2020 08:51:00 PM,NYCT Subway,delays,4,Southbound 4 trains have resumed making expres...,,Southbound 4 trains have resumed making expres...,SIGNAL_PROBLEM,MECHANICAL_ISSUE
343792,117,86,3,04/28/2020 08:45:00 PM,NYCT Subway,delays,2 | 3,Southbound 2 and 3 trains are proceeding at no...,,Southbound 2 and 3 trains are proceeding at no...,SIGNAL_PROBLEM,MECHANICAL_ISSUE
343794,115,85,1,04/28/2020 08:09:00 PM,NYCT Subway,delays,6,Southbound 6 trains are proceeding at normal s...,,Southbound 6 trains are proceeding at normal s...,SIGNAL_PROBLEM,MECHANICAL_ISSUE
343800,86,82,0,04/28/2020 02:38:00 PM,NYCT Subway,delays,3,3 trains are running with delays in both direc...,,3 trains are running with delays in both direc...,BRAKES_ACTIVATED,BRAKE_ACTIVATED


Let's manually inspect some of these classifications to make sure they seem reasonable.

In [26]:
def filter_by_issue_or_type(df, enum_value, column):
    ''' Return the rows in df where the column matches the enum value; enum_value can be None '''
    if (enum_value is None):
        return df[df[column].isna()]
    else:
        return df[df[column] == enum_value]

In [27]:
def filter_by_issue(df, issue):
    return filter_by_issue_or_type(df, issue, "Issue")

In [28]:
def print_as_list(df):
    for row in list(df):
        print(row, end="\n\n")

Let's print the descriptions of some delays tagged as switch problems:

In [29]:
print_as_list(filter_by_issue(nyct, Issue.SWITCH_PROBLEM.name).sample(10)["Combined description"])

There is no 3 train service between Crown Hts-Utica Av and New Lots Av in both directions. We are addressing a switch problem at New Lots Av. Service Changes: The last stop on southbound 3 trains will be Chambers St or Crown Hts-Utica Av. Some southbound 3 trains will run via the 2 line from Franklin Av to Flatbush Av-Brooklyn College To ease train congestion in Brooklyn, some southbound 4 trains will end at Brooklyn Bridge-City Hall or Bowling Green. Travel Alternatives: For service between New Lots Av and Crown Hts-Utica Av, use the B14 or B15 buses. As an alternative for service in Brooklyn, consider using 2 4 5 trains. Expect delays in 2 3 4 5 train service.

Northbound D N R trains are running at slow speeds in Brooklyn after our crews made temporary repairs to a malfunctioning switch near Atlantic Av-Barclays Ctr. Expect long waits between southbound trains as well.

G trains are delayed in both directions while our crews work to correct a switch malfunction near Bedford-Nostrand

And some delays where the issue is an object on the tracks:

In [30]:
print_as_list(filter_by_issue(nyct, Issue.SOMETHING_ON_TRACKS.name).sample(10)["Combined description"])

E F trains are running with delays in both directions. What Happened? We removed debris from the tracks near Briarwood.

4 5 6 trains are running with severe delays in both directions after we removed debris from the tracks that activated a train's brakes near 59 St. 4 5 6 trains have resumed making scheduled stops in both directions.

Southbound A trains are delayed while our crews remove debris from the tracks at Euclid Av.

Q trains are running with delays in both directions after we removed debris from the tracks that caused a train's brakes to activate at Sheepshead Bay.

Southbound 1 trains are running with delays after we removed debris from the tracks at 86 St. Southbound 1 trains have resumed making local stops from 96 St to 72 St.

Northbound 2 and 3 trains are delayed while we remove debris from the tracks at Nevins St.

Northbound 6 trains are delayed while we work to remove debris from the tracks near Westchester Sq-E Tremont Av.

2 trains are delayed in both directions wh

And finally, let's sample a few alerts with no issue identified.

In [31]:
print_as_list(filter_by_issue(nyct, None).sample(10)["Combined description"])

Southbound F trains are running with delays after we held a train at Broadway-Lafayette St.

Expect longer waits for 5 trains in both directions. We're running as much service as we can with the train crews we have available.

You may wait longer for an N train. We're running as much service as we can with the train crews we have available.

You may wait longer for a southbound R train. We're running as much service as we can with the train crews we have available.

Queens-bound E M trains are running at slower speeds because of congestion along Queens Boulevard. Some M trains are running on the F line from 47-50 Sts-Rockefeller Ctr to 36 St/Jackson Hts-Roosevelt Av. Expect longer waits for F R while sharing the tracks with rerouted trains.

You may wait longer for a 4 train. We're running as much service as we can with the train crews we have available.

Northbound 6 trains are delayed while we move equipment into place for track replacement work at Hunts Point Av.

You may experience

You may notice that some alerts *could* be classified as one of our identified issues, but weren't correctly picked up by our regex patterns. This is okay; as we'll soon see that our coverage is still high enough to work with. More advanced NLP techniques could help, but we've reached the [80/20 point](https://en.wikipedia.org/wiki/Pareto_principle).

## Identify the main station associated with the alert

Next, we'll want to identify which station the issue occurred near or at. We'll start by creating a regex pattern that matches *all* station names.

In [32]:
STATION_NAMES_DEDUPED = set(list(stations_df["Stop Name"]))
# Escape parentheses so they aren't interpreted as regex groups
ESCAPED_STATION_NAMES = [station_name.replace('(',r'\(').replace(')',r'\)') for station_name in STATION_NAMES_DEDUPED]
STATION_NAMES_REGEX = fr'({"|".join(ESCAPED_STATION_NAMES)})'

In [33]:
def clean_text_for_station_identification(text):
    ''' Perform some simple cleaning to help identify station names '''
    return text \
        .replace(' - ', '-') \
        .replace('Jct', 'Junction') \
        .replace('Delancy', 'Delancey') \
        .replace("Christopher St-Sheridan Sq", "Christopher St-Stonewall")

In [34]:
def get_station(description):
    ''' Given some text, find a station name that follows "at", "near", or "for" '''
    if not description:
        return None
    description_cleaned = clean_text_for_station_identification(description)
    if "at that station" in description:
        station_name_regex = STATION_NAMES_REGEX
        regex_group = 1
    else:
        station_name_regex = fr'(at|near|for) {STATION_NAMES_REGEX}'
        regex_group = 2
    station_name_match = re.search(station_name_regex, description_cleaned)
    if station_name_match:
        return station_name_match.group(regex_group)
    else:
        return None

In [35]:
nyct["Station"] = nyct["Combined description"].apply(get_station)

In [36]:
nyct["Station"].value_counts()

Station
14 St                       3496
59 St                       3011
125 St                      2826
Times Sq-42 St              2103
Atlantic Av-Barclays Ctr    2033
                            ... 
Pleasant Plains                4
New Dorp                       4
Bay Terrace                    3
Tompkinsville                  2
Arthur Kill                    2
Name: count, Length: 368, dtype: int64

Let's inspect a few descriptions where no station was identified.

In [37]:
no_station_identified = nyct[nyct["Station"].isna()]
print_as_list(no_station_identified.sample(10)["Combined description"])

A Rockaway Park Shuttle trains are delayed in both directions in the Rockaways while the South Channel Bridge is opened to allow marine traffic to pass.

A Rockaway Park Shuttle trains are delayed in both directions while the South Channel Bridge opens for marine traffic to pass.

Southbound Q trains are running with delays after we addressed a problem with our signaling system along the D N R lines in Brooklyn. D N R trains are experiencing delays, causing congestion which is delaying Q trains.

Northbound N trains are running with delays after we moved work equipment along the line in Brooklyn. Northbound N local service has resumed from Kings Hwy to 59 St.

A Rockaway Park Shuttle trains are delayed in both directions while the South Channel Bridge opens for marine traffic to pass.

You may wait longer for a northbound B train after we moved a train that had its brakes activated earlier in Brooklyn.

Expect longer wait times for uptown N and R trains following a temporary loss of po

One failure mode of our algorithm is alerts which use the phrasing `"... <station name> ... at that station."`. We will also leave this as a potential 80/20 improvement.

### Disambiguate station names by services

We have station names, but *which* station? I mean, how many 23 Sts are there?

In [38]:
len([station for station in stations_df["Stop Name"] if station == "23 St"])

5

I rest my case. In order to disambiguate which station each alert refers to, we can use the `"Affected"` column in the alerts dataset to determine which line the station is on, and then find the station which matches the name and line.

Why not just compare the `"Affected"` services in the alert to the `"Daytime Routes"` listed for each station? Take this alert for example:

In [39]:
nyct[(nyct["Affected"] == "A") & (nyct["Station"] == "23 St")].sample(1)

Unnamed: 0,Alert ID,Event ID,Update Number,Date,Agency,Status Label,Affected,Header,Description,Combined description,Issue,Issue type,Station
278162,62489,18051,0,03/22/2021 10:36:00 AM,NYCT Subway,delays,A,Northbound A trains are delayed while we inves...,,Northbound A trains are delayed while we inves...,BRAKES_ACTIVATED,BRAKE_ACTIVATED,23 St


In this case, the issue at 23 St affects service on the `A`, but 23 St does not serve the `A` train! We can remedy this by looking at *lines* instead of individual stations. Our approach will be to group stations by line, and determine which services run on that line, as well as all the station names. Then, for a given station name and affected services, we match it to a line which includes that station name and contains that service. If this approach fails for whatever reason, we will pick a station at random whose name matches.

One edge case arises in which a line contains a station which serves a route that does not run along the rest of the line. One example is BMT Broadway - Brighton, which includes West 8 St-NY Aquarium, which serves the `F`, despite the `F` not running along that line. This could cause issues because when searching for a train station called `"23 St"` that serves the `F`, the obvious choice would be the 23 St on the IND 6th Av - Culver, but the 23 St on BMT Broadway - Brighton would technically also be a candidate by our previous logic. We mitigate this issue by requiring that a service must be served by at least 3 stations on the line for it to "count".

One note is that we only have the *daytime* routes of each station, which excludes late-night service changes. However, this is a fine approximation with the data available.

In [40]:
# The "Affected" column in the alerts dataset is pipe-delimited
def split_affected_services(services_string):
    return services_string.split(' | ')

# The "Daytime Routes" column in the station dataset is space-delimited
def split_served_services(services_string):
    return services_string.split(' ')

In [41]:
def flatten_daytime_routes(daytime_routes_list, min_appearances=3):
    ''' Given a list of pipe-delimited services, flatten them into a set
        Also filters out services that appear fewer than `min_appearances` times '''
    # If the line has < min_appearances stations, relax the restriction
    min_appearances = min(len(daytime_routes_list), min_appearances)
    routes_list = []
    for daytime_routes in daytime_routes_list:
        routes_list += split_served_services(daytime_routes)
    counter = Counter(routes_list)
    return set([key for key in counter if counter[key] >= min_appearances])

In [42]:
services_per_line = stations_df.groupby(["Line", "Division"]).aggregate(**{
    "Stop Name": ("Stop Name", list), 
    "GTFS Stop ID": ("GTFS Stop ID", list),
    "Daytime Routes": ("Daytime Routes", flatten_daytime_routes)
}).reset_index()
services_per_line.head(5)

Unnamed: 0,Line,Division,Stop Name,GTFS Stop ID,Daytime Routes
0,4th Av,BMT,"[Atlantic Av-Barclays Ctr, Union St, 4 Av-9 St...","[R31, R32, R33, R34, R35, R36, R39, R40, R41, ...","{N, R}"
1,63rd St,IND,"[21 St-Queensbridge, Roosevelt Island, Lexingt...","[B04, B06, B08]",{F}
2,6th Av - Culver,IND,"[W 4 St-Wash Sq, 57 St, 47-50 Sts-Rockefeller ...","[D20, B10, D15, D16, D17, D18, D19, D21, D22, ...","{M, B, F, D, G}"
3,8th Av - Fulton St,IND,"[Inwood-207 St, Dyckman St, 190 St, 181 St, 17...","[A02, A03, A05, A06, A07, A09, A10, A11, A12, ...","{C, A, B, E}"
4,Astoria,BMT,"[Astoria-Ditmars Blvd, Astoria Blvd, 30 Av, Br...","[R01, R03, R04, R05, R06, R08, R11, R13, R09]","{N, W}"


Let's verify that a line with fewer than 3 stops still contains routes:

In [43]:
services_per_line[services_per_line["Line"] == "Lexington - Shuttle"]

Unnamed: 0,Line,Division,Stop Name,GTFS Stop ID,Daytime Routes
19,Lexington - Shuttle,IRT,"[Times Sq-42 St, Grand Central-42 St]","[902, 901]",{S}


In [44]:
def does_line_match_name_and_services(line, station_name, services):
    ''' Ensure the station name matches a station in the line and ensure at least one station in the affected routes runs on that line '''
    return len(set(services).intersection(line["Daytime Routes"])) > 0 and station_name in line["Stop Name"]

In [45]:
def fallback_disambiguate_station_name(station_name):
    ''' Pick a random station that matches the station name '''
    return stations_df[stations_df["Stop Name"] == station_name].sample(1)["GTFS Stop ID"].iloc[0]

In [46]:
def disambiguate_station_name_by_line(station_name, affected_services):
    ''' Returns the GTFS Stop ID for the station that matches the affected services
        Resorts to randomly picking a station with a matching name as a fallback '''
    if not station_name or not affected_services:
        return None
    affected_services = split_affected_services(affected_services)
    matching_lines = services_per_line[services_per_line.apply(
        lambda line: does_line_match_name_and_services(line, station_name, affected_services), axis=1)]
    if len(matching_lines) == 0:
        return fallback_disambiguate_station_name(station_name)
    matching_line = matching_lines.sample(1).iloc[0]
    # Guaranteed to exist because of the filter
    matching_stop_name_index = list(matching_line["Stop Name"]).index(station_name)
    return matching_line["GTFS Stop ID"][matching_stop_name_index]

This line can take a while to run.

In [47]:
nyct["GTFS Stop ID"] = nyct.apply(lambda x: disambiguate_station_name_by_line(x["Station"], x["Affected"]), axis=1)

Let's inspect some results by looking at alerts for `"23 St"` and ensure there are different Station IDs present.

In [48]:
nyct[nyct["Station"] == "23 St"][["Affected", "Station", "GTFS Stop ID"]].sample(10)

Unnamed: 0,Affected,Station,GTFS Stop ID
228657,F,23 St,D18
136204,1,23 St,130
269412,A | E,23 St,A30
321941,E,23 St,A30
228961,N | Q,23 St,R19
5631,F | M,23 St,D18
148467,1 | 2,23 St,130
275097,A | C | E,23 St,A30
73518,1,23 St,130
313857,6,23 St,634


And to verify one of those GTFS Stop IDs:

In [49]:
stations_df[stations_df["GTFS Stop ID"] == "A30"]

Unnamed: 0,GTFS Stop ID,Station ID,Complex ID,Division,Line,Stop Name,Borough,CBD,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,North Direction Label,South Direction Label,ADA,ADA Northbound,ADA Southbound,ADA Notes,Georeference
164,A30,165,165,IND,8th Av - Fulton St,23 St,M,True,C E,Subway,40.745906,-73.998041,Uptown,Downtown,0,0,0,,POINT (-73.998041 40.745906)


## Group multiple alerts for the same event together

Now, we can take the penultimate step; grouping events together by the `Event ID`. Each alert is tied to an `Event ID` along with an `Update Number` that starts at `0` and increases with each update. Because we would like to analyze *events*, we will group by the `Event ID` and aggregate the remaining relevant data.

In [50]:
def parse_alert_datetime(alert_datetime):
    ''' Converts a datetime string to a Python datetime object '''
    return datetime.strptime(alert_datetime, "%m/%d/%Y %I:%M:%S %p")

def parse_alert_datetimes(alert_datetimes):
    ''' Converts a list of datetime strings to Python datetime objects '''
    return [parse_alert_datetime(dt) for dt in alert_datetimes]

def aggregate_items_dedup(lists_of_items):
    ''' Flatten a list of lists to a deduplicated set '''
    items_set = set()
    for list_of_items in lists_of_items:
        items_set = items_set.union(set(list_of_items))
    return items_set

def group_affected_services(list_of_affected_services):
    ''' Aggregate affected services by deduplicating '''
    return aggregate_items_dedup([split_affected_services(affected) for affected in list_of_affected_services])

In [51]:
# First we sort by update number, so that when we choose the first issue / stop ID, it is always from the earliest alert
nyct_events_groupby = nyct.sort_values("Update Number").groupby("Event ID")
nyct_events = nyct_events_groupby.agg(**{
    'First alert datetime': ('Date', lambda x: min(parse_alert_datetimes(x))),
    'Last alert datetime': ('Date', lambda x: max(parse_alert_datetimes(x))),
    'Number of updates': ('Update Number', 'count'),
    'Affected services': ('Affected', group_affected_services),
    'Combined description': ('Combined description', list),
    'Issue': ('Issue', 'first'),
    'Issue type': ('Issue type', 'first'),
    'GTFS Stop ID': ('GTFS Stop ID', 'first'),
}).reset_index()

In [52]:
nyct_events.sample(10)

Unnamed: 0,Event ID,First alert datetime,Last alert datetime,Number of updates,Affected services,Combined description,Issue,Issue type,GTFS Stop ID
3972,8838,2020-10-23 09:32:00,2020-10-23 09:44:00,3,{L},[Expect longer wait time for Brooklyn-bound L ...,TRAIN_PROBLEM,MECHANICAL_ISSUE,D19
4669,10627,2020-11-18 11:22:00,2020-11-18 11:27:00,2,"{F, E}",[Southbound E and F trains are delayed while w...,MEDICAL_EMERGENCY,HUMAN_DISRUPTION,B04
16809,40903,2021-12-03 16:02:00,2021-12-03 16:02:00,1,"{C, A}",[Northbound A C trains are running with delays...,SIGNAL_PROBLEM,MECHANICAL_ISSUE,A34
15575,36747,2021-10-31 06:05:00,2021-10-31 06:57:00,2,{E},[Northbound E trains are delayed while FDNY wo...,FIRE,MISC,G09
62295,154114,2024-05-31 23:28:00,2024-05-31 23:28:00,1,"{G, F}",[Northbound F trains and Court Sq-bound G trai...,SIGNAL_PROBLEM,MECHANICAL_ISSUE,F20
15397,36079,2021-10-26 18:54:00,2021-10-26 19:39:00,2,"{C, E}",[Southbound C and E trains are delayed while w...,SIGNAL_PROBLEM,MECHANICAL_ISSUE,A31
8777,18767,2021-04-04 05:06:00,2021-04-04 05:06:00,1,{3},[3 trains are delayed entering and leaving Har...,PERSON_ON_TRACKS,HUMAN_DISRUPTION,301
21893,56345,2022-04-04 00:59:00,2022-04-04 01:07:00,3,{Q},[Northbound Q trains are delayed while our cre...,SIGNAL_PROBLEM,MECHANICAL_ISSUE,D31
12552,27030,2021-07-30 11:29:00,2021-07-30 11:55:00,3,{Q},[Southbound Q trains are delayed while we trou...,DOOR_PROBLEM,MECHANICAL_ISSUE,D32
30628,81550,2022-10-27 00:31:00,2022-10-27 00:36:00,2,"{M, J}",[Broad St-bound J and Middle Village-Metropoli...,DISRUPTIVE_PASSENGER,HUMAN_DISRUPTION,M16


In [53]:
def format_percent(ratio):
    return "{:.2%}".format(ratio)

In [54]:
no_issue_identified = filter_by_issue_or_type(nyct_events, None, "Issue")
print(f'Percent of events with an identified issue: {format_percent(1 - len(no_issue_identified) / len(nyct_events))}')

Percent of events with an identified issue: 96.36%


In [55]:
no_station_identified = nyct_events[nyct_events["GTFS Stop ID"].isna()]
print(f'Percent of events with an identified station: {format_percent(1 - len(no_station_identified) / len(nyct_events))}')

Percent of events with an identified station: 91.27%


In [56]:
percent_of_stations_represented = len(nyct_events["GTFS Stop ID"].value_counts()) / len(stations_df)
print(f'Percent of stations represented: {format_percent(percent_of_stations_represented)}')

Percent of stations represented: 97.38%


## Join with station data

Now, we can complete our data preparation by joining each event with its associated station, if it was identified with one.

In [57]:
# Perform a left join so that we don't drop any alerts.
nyct_events_station = nyct_events.join(stations_df.set_index("GTFS Stop ID"), on="GTFS Stop ID", how="left")
# Drop columns that seem less relevant.
nyct_events_station = nyct_events_station.drop(
    columns=["Station ID", "Complex ID", "CBD", "North Direction Label", "South Direction Label", "ADA Northbound", "ADA Southbound", 
             "ADA Notes", "Georeference"])
nyct_events_station["ADA"] = nyct_events_station["ADA"].apply(lambda x: np.nan if pd.isna(x) else bool(x))

In [58]:
nyct_events_station.sample(10)

Unnamed: 0,Event ID,First alert datetime,Last alert datetime,Number of updates,Affected services,Combined description,Issue,Issue type,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,ADA
11389,24369,2021-06-25 09:56:00,2021-06-25 09:56:00,1,"{F, E}",[Queens-bound E and F trains are delayed while...,DISRUPTIVE_PASSENGER,HUMAN_DISRUPTION,F07,IND,Queens Blvd,75 Av,Q,E F,Subway,40.718331,-73.837324,False
46926,118539,2023-09-06 16:29:00,2023-09-06 16:29:00,1,"{N, R, W}",[Northbound N R W trains are running with dela...,DISRUPTIVE_PASSENGER,HUMAN_DISRUPTION,R15,BMT,Broadway - Brighton,49 St,M,N R W,Subway,40.759901,-73.984139,True
51119,129069,2023-11-24 12:43:00,2023-11-24 12:45:00,2,{F},[Northbound F trains are delayed while we remo...,CLEANING,CLEANING,F15,IND,6th Av - Culver,Delancey St-Essex St,M,F,Subway,40.718611,-73.988114,False
57242,142113,2024-03-04 02:30:00,2024-03-04 02:30:00,1,"{C, A}",[A C trains are running with delays in both di...,LOSS_OF_POWER,MECHANICAL_ISSUE,A40,IND,8th Av - Fulton St,High St,Bk,A C,Subway,40.699337,-73.990531,False
63539,157680,2024-06-24 21:49:00,2024-06-24 21:52:00,2,{5},[Southbound 5 trains are delayed while we inve...,FALLEN_TREE,OBJECT_ON_TRACKS,502,IRT,Dyre Av,Baychester Av,Bx,5,Open Cut,40.878663,-73.838591,False
207,550,2020-05-06 09:29:00,2020-05-06 09:29:00,1,{D},[Expect longer wait times in northbound D trai...,BRAKES_ACTIVATED,BRAKE_ACTIVATED,D43,BMT,Sea Beach / West End / Culver / Brighton,Coney Island-Stillwell Av,Bk,D F N Q,Viaduct,40.577422,-73.981233,True
21199,54325,2022-03-17 17:21:00,2022-03-17 17:21:00,1,{F},[Southbound F trains are running with delays a...,DISRUPTIVE_PASSENGER,HUMAN_DISRUPTION,B06,IND,63rd St,Roosevelt Island,M,F,Subway,40.759145,-73.95326,True
44134,111404,2023-07-12 10:57:00,2023-07-12 12:11:00,3,{Q},[Southbound Q trains are delayed while we inve...,BRAKES_ACTIVATED,BRAKE_ACTIVATED,B08,IND,63rd St,Lexington Av/63 St,M,F Q,Subway,40.764629,-73.966113,True
46820,118275,2023-09-04 09:14:00,2023-09-04 10:31:00,2,"{N, R}",[Southbound N trains are running with delays a...,RAIL_PROBLEM,MECHANICAL_ISSUE,R25,BMT,Broadway,Cortlandt St,M,R W,Subway,40.710668,-74.011029,True
33754,88775,2022-12-28 15:44:00,2022-12-28 16:09:00,3,{7},[34 St-bound 7 trains are delayed while we inv...,BRAKES_ACTIVATED,BRAKE_ACTIVATED,701,IRT,Flushing,Flushing-Main St,Q,7,Subway,40.7596,-73.83003,True


# Export

Now we'll prepare and export our hard work to analyze further in the next notebook. We'll first define human-friendly descriptions for each issue and issue type.

In [59]:
ISSUE_TO_TEXT = {
    # Maintenance
    Issue.TRACK_MAINTENANCE: "Track maintenance",
    Issue.SIGNAL_MAINTENANCE: "Signal maintenance",
    Issue.SWITCH_MAINTENANCE: "Switch maintenance",
    Issue.WORK_TRAIN: "Work train-related",
    Issue.MISC_MAINTENANCE: "Unspecified maintenance",
    # Brake activated
    Issue.BRAKES_ACTIVATED: "Brakes activated",
    Issue.EMERGENCY_BRAKE_PULLED: "Emergency brake cord pulled",
    # Mechanical issue
    Issue.SIGNAL_PROBLEM: "Signal problem",
    Issue.SWITCH_PROBLEM: "Switch problem",
    Issue.COMMS_PROBLEM: "Communication problem",
    Issue.LOSS_OF_POWER: "Power issue",
    Issue.RAIL_PROBLEM: "Rail problem",
    Issue.DOOR_PROBLEM: "Door problem",
    Issue.TRAIN_PROBLEM: "Unspecified train problem",
    Issue.MECHANICAL_PROBLEM: "Unspecified mechanical problem",
    # Human disription
    Issue.DISRUPTIVE_PASSENGER: "Disruptive passenger",
    Issue.PERSON_ON_TRACKS: "Person on tracks",
    Issue.PERSON_STRUCK_BY_TRAIN: "Person struck by train",
    Issue.MEDICAL_EMERGENCY: "Medical emergency",
    Issue.VANDALISM: "Vandalism",
    # Object on tracks
    Issue.SOMETHING_ON_TRACKS: "Something on tracks",
    Issue.FALLEN_TREE: "Fallen tree",
    # EMS / NYPD / FDNY response
    Issue.EMS: "Unspecified EMS response",
    Issue.NYPD: "Unspecified NYPD response",
    Issue.FDNY: "Unspecified FDNY response",
    # Cleaning
    Issue.CLEANING: "Cleaning",
    # Short staffed
    Issue.SHORT_STAFFED: "Short-staffed",
    # Misc
    Issue.FIRE: "Fire / smoke",
    Issue.FLOODING: "Flooding",
    Issue.SOUTH_CHANNEL_BRIDGE: "South Channel Bridge open",
    Issue.TRACK_INSPECTIONS: "Track inspection",
}

ISSUE_TYPE_TO_TEXT = {
    IssueType.MAINTENANCE: "Maintenance",
    IssueType.BRAKE_ACTIVATED: "Brake activated",
    IssueType.MECHANICAL_ISSUE: "Mechanical issue",
    IssueType.HUMAN_DISRUPTION: "Passenger issue",
    IssueType.OBJECT_ON_TRACKS: "Object on tracks",
    IssueType.EMS_NYPD_FDNY_RESPONSE: "EMS/NYPD/FDNY response",
    IssueType.CLEANING: "Cleaning",
    IssueType.MISC: "Miscellaneous"
}

In [60]:
def get_issue_text(issue_enum_str):
    if not issue_enum_str:
        return None
    return ISSUE_TO_TEXT[Issue[issue_enum_str]]

def get_issue_type_text(issue_type_enum_str):
    if not issue_type_enum_str:
        return None
    return ISSUE_TYPE_TO_TEXT[IssueType[issue_type_enum_str]]

## CSV export

To prepare for CSV serialization, every field should be a string or number.

In [61]:
def serialize_nyct_events_dataset(df):
    ''' Prepare the events dataset for CSV '''
    new_df = df.copy()
    new_df["First alert datetime"] = new_df["First alert datetime"].apply(lambda x: x.isoformat())
    new_df["Last alert datetime"] = new_df["Last alert datetime"].apply(lambda x: x.isoformat())
    new_df["Affected services"] = new_df["Affected services"].apply(lambda x: " ".join(x))
    new_df["Combined description"] = new_df["Combined description"].apply(lambda x: " | ".join(x))
    new_df["Issue"] = new_df["Issue"].apply(get_issue_text)
    new_df["Issue type"] = new_df["Issue type"].apply(get_issue_type_text)
    return new_df

In [62]:
serialized_nyct_events = serialize_nyct_events_dataset(nyct_events_station)
serialized_nyct_events.sample(10)

Unnamed: 0,Event ID,First alert datetime,Last alert datetime,Number of updates,Affected services,Combined description,Issue,Issue type,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,ADA
15302,35761,2021-10-24T07:41:00,2021-10-24T08:15:00,4,A,Southbound A trains are delayed while NYPD res...,Disruptive passenger,Passenger issue,A06,IND,8th Av - Fulton St,181 St,M,A,Subway,40.851695,-73.937969,True
34481,90147,2023-01-11T08:21:00,2023-01-11T08:38:00,2,2,2 trains are delayed in both directions while ...,Person on tracks,Passenger issue,211,IRT,Lenox - White Plains Rd,Pelham Pkwy,Bx,2 5,Elevated,40.857192,-73.867615,True
49535,125263,2023-10-26T17:27:00,2023-10-26T18:02:00,2,C,Southbound C trains are running with severe de...,Disruptive passenger,Passenger issue,A09,IND,8th Av - Fulton St,168 St,M,A C,Subway,40.840719,-73.939561,True
40317,102450,2023-05-03T00:08:00,2023-05-03T00:17:00,2,4 5,Northbound 4 5 trains are delayed while we rem...,Unspecified train problem,Mechanical issue,225,IRT,Lenox - White Plains Rd,125 St,M,2 3,Subway,40.807754,-73.945495,False
12953,27904,2021-08-13T08:15:00,2021-08-13T08:28:00,2,D,Northbound D trains are delayed while our crew...,Track maintenance,Maintenance,D10,IND,Concourse,167 St,Bx,B D,Subway,40.833771,-73.91844,False
56412,140601,2024-02-20T11:50:00,2024-02-20T11:54:00,2,4 5 6,Southbound 6 trains are running on the express...,Unspecified mechanical problem,Mechanical issue,622,IRT,Lexington Av,116 St,M,6,Subway,40.798629,-73.941617,False
22449,57834,2022-04-16T19:11:00,2022-04-16T19:11:00,1,Q,Northbound Q trains are running with delays af...,Door problem,Mechanical issue,B08,IND,63rd St,Lexington Av/63 St,M,F Q,Subway,40.764629,-73.966113,True
49444,125034,2023-10-25T05:57:00,2023-10-25T05:57:00,1,4,Southbound 4 trains are running with delays wh...,Rail problem,Mechanical issue,416,IRT,Jerome Av,138 St-Grand Concourse,Bx,4 5,Subway,40.813224,-73.929849,False
39639,100987,2023-04-20T18:45:00,2023-04-20T19:09:00,3,4 5,Southbound 4 5 trains are delayed while we add...,Unspecified mechanical problem,Mechanical issue,419,IRT,Lexington Av,Wall St,M,4 5,Subway,40.707557,-74.011862,False
410,971,2020-05-15T21:36:00,2020-05-15T22:21:00,2,4,Southbound 4 trains are running at slower spee...,Signal problem,Mechanical issue,D03,IND,Concourse,Bedford Park Blvd,Bx,B D,Subway,40.873244,-73.887138,True


In [72]:
serialized_nyct_events.to_csv('../../data/processed-data/nyct-events.csv', index=False)

## JSON export

We can also export as JSON, which is more friendly for web applications. We will prepare the data similarly to the CSV preparation, except we can use lists.

In [64]:
def prepare_nyct_events_for_json(df):
    new_df = df.copy()
    new_df["First alert datetime"] = new_df["First alert datetime"].apply(lambda x: x.isoformat())
    new_df["Last alert datetime"] = new_df["Last alert datetime"].apply(lambda x: x.isoformat())
    new_df["Affected services"] = new_df["Affected services"].apply(list)
    new_df["Issue"] = new_df["Issue"].apply(get_issue_text)
    new_df["Issue type"] = new_df["Issue type"].apply(get_issue_type_text)
    new_df["Daytime Routes"] = new_df["Daytime Routes"].apply(lambda x: [] if pd.isna(x) else split_served_services(x))
    return new_df

We can also conveniently package all the station data, if it is not NaN.

In [65]:
def convert_nyct_df_to_json(df):
    return [
        {
            "eventId": row["Event ID"],
            "firstAlertDatetime": row["First alert datetime"],
            "lastAlertDatetime": row["Last alert datetime"],
            "numUpdates": row["Number of updates"],
            "affectedServices": row["Affected services"],
            "combinedDescriptions": row["Combined description"],
            "issue": row["Issue"],
            "issueType": row["Issue type"],
            "stationData": {
                "gtfsStopId": row["GTFS Stop ID"],
                "division": row["Division"],
                "line": row["Line"],
                "stopName": row["Stop Name"],
                "daytimeRoutes": row["Daytime Routes"],
                "borough": row["Borough"],
                "structure": row["Structure"],
                "latLong": [row["GTFS Latitude"], row["GTFS Longitude"]],
                "adaFriendly": row["ADA"]
            } if not pd.isna(row["GTFS Stop ID"]) else None
        } for i, row in df.iterrows()]

In [66]:
json_ready_nyct_events = prepare_nyct_events_for_json(nyct_events_station)

In [67]:
nyct_events_json = convert_nyct_df_to_json(json_ready_nyct_events)

Print an example of the JSON data.

In [68]:
print(json.dumps(nyct_events_json[0], indent=2))

{
  "eventId": 82,
  "firstAlertDatetime": "2020-04-28T14:38:00",
  "lastAlertDatetime": "2020-04-28T14:45:00",
  "numUpdates": 2,
  "affectedServices": [
    "3"
  ],
  "combinedDescriptions": [
    "3 trains are running with delays in both directions while we investigate why a train's brakes were activated at Harlem - 148 St.",
    "3 trains are proceeding with delays in both directions after we moved a train that had its brakes activated at Harlem - 148 St."
  ],
  "issue": "Brakes activated",
  "issueType": "Brake activated",
  "stationData": {
    "gtfsStopId": "301",
    "division": "IRT",
    "line": "Lenox - White Plains Rd",
    "stopName": "Harlem-148 St",
    "daytimeRoutes": [
      "3"
    ],
    "borough": "M",
    "structure": "Subway",
    "latLong": [
      40.82388,
      -73.93647
    ],
    "adaFriendly": false
  }
}


Print an example where there is no station data.

In [69]:
print(json.dumps(
    [obj for obj in nyct_events_json if not obj["stationData"]][0], indent=2))

{
  "eventId": 171,
  "firstAlertDatetime": "2020-04-29T09:31:00",
  "lastAlertDatetime": "2020-04-29T09:31:00",
  "numUpdates": 1,
  "affectedServices": [
    "2"
  ],
  "combinedDescriptions": [
    "2 trains are running approximately every 12 minutes between Wakefield - 241 St and Flatbush Av - Brooklyn College. We're running as many trains as we possibly can with the crews we have available."
  ],
  "issue": "Short-staffed",
  "issueType": "Miscellaneous",
  "stationData": null
}


In [70]:
with open('../../data/processed-data/nyct-events.json', 'w') as f:
    json.dump(nyct_events_json, f)