# Part 2 - Cleaning and Transforming of the Alerts Data

##Import Alerts Uncleaned Data in Databricks

In [0]:
import re
import pandas as pd
SAS_TOKEN = 'sp=racwdlmeop&st=2023-01-19T15:17:20Z&se=2023-02-10T23:17:20Z&spr=https&sv=2021-06-08&sr=c&sig=SNP1pr7qFgO1k1a8nm2MmfX9mp2EnPJKaBQ7eHEEgsg%3D'
CONTAINER = 'fg4'
STOR_ACCT = 'cohort40storage'
ROOT_PATH = f'wasbs://{CONTAINER}@{STOR_ACCT}.blob.core.windows.net/'

spark.conf.set(f'fs.azure.sas.{CONTAINER}.{STOR_ACCT}.blob.core.windows.net', SAS_TOKEN)

read_path = ROOT_PATH + 'mta-nypd/alerts.csv'



After pulling in the subway alerts using Selenium and saving it to a blob container, the data has to be explored, cleaned and filtered. The cell above provides the permissions to access the saved blob. SAS tokens and container names are hidden for security purposes. Regular expression and pandas are also imported for the data cleaning step. The cell below reads in the blob and displays the first 1000 rows in a pyspark data frame.

In [0]:
alerts = spark.read.csv(
    read_path, 
    header=True, 
    mode="DROPMALFORMED", 
    multiLine = True
)


After browsing through the table above, it was noted that the title of the alerts were informative and can be used to filter the alerts. For example, some alerts appear to be an update of a previous alert, signifying that it is not referring to a new incident, but to a previous one. The titles often also include the boroughs and the trains that are affected by the alerts. To get an idea of the size of the dataframe, the alerts dataframe was changed from a pyspark dataframe to a pandas dataframe. The shape attribute showed that there were almost 130 thousand rows and 5 columns.

In [0]:
df1 = alerts.toPandas()
df1.shape

Out[3]: (129993, 5)

Given that updated alerts are not referring to new incidents or delays, the code below created two lists; one with the indices of the alerts with the word "update" and one with the indices without. Calling on the length method on both list shows that about half the alerts are updates. This means that about half the alerts can be filtered out.

In [0]:
updates = []
non_updates = []
for i in range(len(df1)):
    if re.search('update', df1.title[i].lower()):
        updates.append(i)
    else:
        non_updates.append(i)
        
print(f'There are {len(updates)} alerts that are updates.')
print(f'There are {len(non_updates)} alerts that do not contain the word "update" in the title.')

There are 64670 alerts that are updates.
There are 65323 alerts that do not contain the word "update" in the title.


It was then noted that alert titles that specify the trains that are affected by the alert, using the word 'train' appear to be referring to unexpected incidents or delays. Those that include the word 'line' appear to be planned delayed and those that do not include 'train' or 'line' are commonly planned service alerts or holiday delays. For the purpose of this analysis, we want to look at the unexpected alerts so the titles that did not include the word 'train' will be filtered out. The code below shows that there are significantly more alerts that appear to be unexpected.

In [0]:
non_update_train = []
non_update_missing_trains = []
for i in non_updates:
    if re.search('train', df1.title[i].lower()):
        non_update_train.append(i)
    else:
        non_update_missing_trains.append(i)
        
print(f'There are {len(non_update_train)} alerts with the word "train" in the title.')
print(f'There are {len(non_update_missing_trains)} alerts without the word "train" in the title.')

There are 55669 alerts with the word "train" in the title.
There are 9654 alerts without the word "train" in the title.


In [0]:
activity = []
for i in range(len(df1)):
    if re.search('train', df1.title[i].lower()):
        activity.append(1)
    else:
        activity.append(0)

df2 = df1.assign(Delay_Status = activity)
# Adds a delay status to dataframe

updates = []

for i in range(len(df2)):
    if re.search('update', df2.title[i].lower()):
        updates.append(1)
    else:
        updates.append(0)

df3 = df2.assign(Update_Status = updates)
# Adds an update status into dataframe

df3.head()

Unnamed: 0,id,datetime,agency,title,message,Delay_Status,Update_Status
0,1172816,1/17/22 11:47 PM,NYC,"Update: BKLYN, F Train, Delays",Northbound F trains are running on the A line ...,1,1
1,1172813,1/17/22 11:42 PM,NYC,"Update: BKLYN, F Train, Delays",Northbound F trains are running on the A line ...,1,1
2,1172812,1/17/22 11:42 PM,NYC,"MANH, A and E Trains, Delays",A E trains are delayed while NYPD responds to ...,1,0
3,1172811,1/17/22 11:36 PM,NYC,"MANH, L Train, Delays",8 Av-bound L trains are delayed while our crew...,1,0
4,1172808,1/17/22 11:31 PM,NYC,"MANH, Q Train, Delays",Q trains are running with delays in both direc...,1,0


The code above added two columns to start the filtering process. One column was named 'Delay_Status' and was given a 1 if the alert is believed to be for an unexpected delay and 0 if the alert is believed to refer to a planned delay. The second column was named 'Update_Status' and was given a 1 if the alert was for an update and 0 if it was not. Then using the code below, only the rows that have an 'Delay_Status' of 1 and an 'Update_Status' of 0 are selected for further processing. This version of this dataset will be referred to as 'unexpected_delays'

In [0]:
non_update_df = df3[df3['Update_Status']==0]
print(non_update_df.shape)
# removed update alerts
unexpected_delays = non_update_df[non_update_df['Delay_Status'] == 1]
# remove expected delays
print(unexpected_delays.shape)

(65323, 7)
(55669, 7)


It is ideal to categorize the alerts by affected train line and the affected borough. The list below shows the variety of the different trains listed across all the messages. It should be noted that some are errors and some are equivalent to some others. For example, 'sir' and 'SIR' and all that are similar refers to the Staten Island Railroad.

In [0]:
train_lines = {' 1 ',' 2 ',' 3 ',' 4 ',' 5 ',' 6 ',' 7 ',' A ',' B ',' C ',' D ',' E ',' F ',' J ',' L ',' G ',' M ',' N ',' Q ',' R ',' S ',' W ',' Z ',' 1, ',' 2, ',' 3, ',' 4, ',' 5, ',' 6, ',' 7, ',' A, ',' B, ',' C, ',' D, ',' E, ',' F, ',' J, ',' L, ',' G, ',' M, ',' N, ',' Q, ',' R, ',' S, ',' W, ',' Z, ',' FM, ',' sir ', ' SIR ', ' SiR ', ' SIR, '}

The code below creates a new dataframe, similar to unexpected_delays but with one additional column, the affected train line. The code iterates through the all the trains listed above and searches for the train in the title of each row of unexpected_delays. Each time there is a match, the row is added to the new dataframe. Since each alert can affect multiple trains, an alert can be represented by multiple rows in the new dataframe; each different from one another by the affected train.

In [0]:
train_lines_df = pd.DataFrame()
for line in train_lines:
    temp_df = unexpected_delays[unexpected_delays['title'].str.contains(line)]
    temp_df['train_line'] = line.strip(" ,")
    train_lines_df = pd.concat([train_lines_df,temp_df])
    
# Adds a column for associated train line
train_lines_df.shape

Out[27]: (81848, 8)

This code below works the same way as the previous cell, except by adding the affected borough, rather than the train line. This loop could also add additional rows to the dataframe. If an alert affects multiple boroughs then it will be split into multiple rows.

In [0]:
borough = ['MANH', 'BX', 'QNS', 'BKLYN', 'SIR']
train_lines_borough_df = pd.DataFrame()
# train_lines_df = pd.DataFrame()
for bor in borough:
    temp_df = train_lines_df[train_lines_df['title'].str.contains(bor)]
    temp_df['Borough'] = bor
    train_lines_borough_df = pd.concat([train_lines_borough_df,temp_df])

train_lines_borough_df.head()
# latest dataset with borough

Since the data is filtered out to only have alerts that we understand to be delays and not updates, we can drop the 'Delay_Status' and 'Update_Status' columns.

In [0]:
del train_lines_borough_df['Delay_Status']
del train_lines_borough_df['Update_Status']
train_lines_borough_df.head()

Unnamed: 0,id,datetime,agency,title,message,train_line,Borough
41,1172692,1/17/22 4:23 PM,NYC,"MANH, 1 Train, Some Delays",1 trains are delayed entering/leaving South Fe...,1,MANH
55,1172650,1/17/22 1:57 PM,NYC,"MANH, 1 Train, Some Delays",1 trains may experience delays while entering ...,1,MANH
119,1172337,1/16/22 9:52 PM,NYC,"MANH, 1 Train, Local to Express",Uptown 1 trains are running express from Times...,1,MANH
203,1171944,1/15/22 4:52 PM,NYC,"MANH, 1 and 2 Trains, Delays",1 2 trains are delayed in both directions whil...,1,MANH
267,1171633,1/14/22 9:39 PM,NYC,"BX, MANH, 1 Train, Local to Express",Northbound 1 trains are running on the express...,1,MANH


To save the datatable, the pandas dataframe was converted to a pyspark dataframe and saved as a csv called "Nonupdates_Active_Borough_Train_Subway_Alerts.csv".

In [0]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
mySchema = StructType([ StructField("id", StringType(), True)\
                       ,StructField("datetime", StringType(), True)\
                       ,StructField("agency", StringType(), True)\
                       ,StructField("title", StringType(), True)\
                       ,StructField("message", StringType(), True)\
                       ,StructField("train_line", StringType(), True)\
                       ,StructField("Borough", StringType(), True)])
subway_alerts = spark.createDataFrame(train_lines_borough_df,schema=mySchema)

subway_alerts.coalesce(1).write.mode('overwrite').csv(ROOT_PATH + f"/Nonupdates_Active_Borough_Train_Subway_Alerts.csv", header = 'True')

This next portion was done in Visual Studio Code. We will begin with reading in the file from the cell above from its blob storage.

In [0]:
# Assigning column header names
column_names = ['Alert_Code', 'DateTime', 'Agency', 'Title', 'Message', 'Train_Line', 'Borough']

# Reading in the data 
blob_client1 = BlobClient.from_blob_url('https://{STOR_ACCT}.blob.core.windows.net/{CONTAINER}/Nonupdates_Active_Borough_Train_Subway_Alerts.csv/part-00000-tid-2334004524103334285-aee0707e-bc17-434d-9d5e-7e56df1da108-713-1-c000.csv?{SAS_TOKEN}')


full_alerts = pd.read_csv(blob_client1.download_blob(), names = column_names, header=None, error_bad_lines=False)


In [0]:
# We drop the nulls 
full_alerts.dropna(subset=["Message"], inplace=True)

The next step is to categorize the MTA messages. We have a separate file of code that shows how we figured out our categories using topic modeling (LDA model). In the code below, we loop through two dictionaries and assign messages to categories based on its key and value. The code also creates the "Message_Category" column which is where the new assigned value is placed.

In [0]:
# Define the keywords or phrases that will trigger a certain category
single_keywords = {
    "nypd": "NYPD/FDNY Investigation",
    "police": "NYPD/FDNY Investigation",
    "investigation": "NYPD/FDNY Investigation",
    "investigate": "NYPD/FDNY Investigation",
    "unauthorized person": "NYPD/FDNY Investigation",
    "fdny": "NYPD/FDNY Investigation",
    "fire": "NYPD/FDNY Investigation",
    "assaulted": "NYPD/FDNY Investigation",
    "assault": "NYPD/FDNY Investigation", 
    "disruptive": "NYPD/FDNY Investigation",
    "altercation": "NYPD/FDNY Investigation",
    "unruly passenger": "NYPD/FDNY Investigation",
    "unruly": "NYPD/FDNY Investigation",
    "maintenance": "Train/Track Maintenance",
    "clean": "Train/Track Maintenance", 
    "cleaned": "Train/Track Maintenance",
    "cleaning": "Train/Track Maintenance",
    "switch": "Train/Track Maintenance",
    "replaced rails": "Train/Track Maintenance",
    "replaced a rail": "Train/Track Maintenance",
    "rail replacement": "Train/Track Maintenance",
    "replace rails": "Train/Track Maintenance",
    "replacing rails": "Train/Track Maintenance",
    "rail condition": "Train/Track Maintenance",
    "replace a rail": "Train/Track Maintenance",
    "broken rail": "Train/Track Maintenance",
    "tree on the tracks": "Train/Track Maintenance",
    "debris": "Train/Track Maintenance",
    "garbage": "Train/Track Maintenance",
    "vandalized": "Train/Track Maintenance",
    "vandalism": "Train/Track Maintenance",
    "dirty": "Train/Track Maintenance",
    "track work": "Train/Track Maintenance",
    "from the tracks": "Train/Track Maintenance",
    "track replacement": "Train/Track Maintenance", 
    "work train": "Train/Track Maintenance",
    "rail power": "Train/Track Maintenance",
    "repair": "Train/Track Maintenance",
    "move equipment": "Train/Track Maintenance",
    "track condition": "Train/Track Maintenance",
    "inspection": "Train/Track Maintenance",
    "replacement track": "Train/Track Maintenance",
    "remove": "Train/Track Maintenance",
    "elevators": "Mechanical Issues",
    "mechanical": "Mechanical Issues",
    "emergency brake": "Mechanical Issues",
    "door problem": "Mechanical Issues",
    "malfunction": "Mechanical Issues",
    "power outage": "Mechanical Issues",
    "loss of power": "Mechanical Issues",
    "communication issue": "Mechanical Issues",
    "communications issue": "Mechanical Issues",
    "lighting": "Mechanical Issues",
    "connectivity": "Mechanical Issues",
    "communications problem": "Mechanical Issues",
    "stalled train": "Mechanical Issues",
    "signal": "Signal Issues",
    "sick": "Medical",
    "ems": "Medical",
    "injured": "Medical",
    "injury": "Medical",
    "medical": "Medical",
    "emergency teams": "Medical",
    "emergency crews": "Medical",
    "struck by": "Medical",
    "emergency personel": "Medical",
    "are running on": "Change of Service",
    "are running along": "Change of Service",
    "running express": "Change of Service",
    "for continuing service": "Change of Service"
     
    
}
# Same as above
combined_keywords = {
    ("someone", "doors"): "NYPD/FDNY Investigation",
    ("passenger", "doors"): "NYPD/FDNY Investigation",
    ("removed", "tracks"): "Train/Track Maintenance",
    ("remove", "tracks"): "Train/Track Maintenance",
    ("removed", "service"): "Train/Track Maintenance",
    ("remove", "service"): "Train/Track Maintenance",
    ("inspect", "tracks"): "Train/Track Maintenance",
    ("inspected", "tracks"): "Train/Track Maintenance",
    ("isolate", "train"): "Train/Track Maintenance",
    ("removed", "car"): "Train/Track Maintenance",
    ("move", "storage"): "Train/Track Maintenance",
    ("equipment", "work"): "Train/Track Maintenance",
    ("brakes", "activated"): "Mechanical Issues",
    ("brakes", "activate"): "Mechanical Issues",
    ("brakes", "activating"): "Mechanical Issues",
    ("brake's", "activated"): "Mechanical Issues",
    ("loss of", "power"): "Mechanical Issues",
    ("share", "track"): "Change of Service",
    ("sharing", "track"): "Change of Service",   
    ("service", "suspended"): "Change of Service",
    ("divert", "trains"): "Change of Service",
    
}

# Loop through each message in the dataframe
for index, row in full_alerts.iterrows():
    message = row["Message"]
    if pd.isnull(message) or message is None:
        message = ''
    if isinstance(message, str):
        message = message.lower()
    # Initialize the category to None
    category = None
    # Loop through the keywords
    for keyword, cat in single_keywords.items():
        # Check if the keyword is in the message
        if keyword in message:
            category = cat
            break
    for keywords, cat in combined_keywords.items():
        if all(k in message for k in keywords):
            category = cat
            break
    if category is None:
        category = "Miscellaneous"
    full_alerts.at[index, "Message_Category"] = category


In [0]:
# Write out the new dataframe as a CSV
full_alerts.to_csv('full_alerts.csv', index=False)