# On this Notebook
creation date: February 23rd, 2025

This notebook is used to gain a preliminary understanding of the subway data, and then clean the data to correct for errors and any differences in spelling when the data was inputted.

In [35]:
import pandas as pd 
from collections import Counter
import jellyfish

df = pd.read_csv('data/delays/subway/ttc-subway-delay-data-2014-2024.csv')

## Subway Lines
Takes a look at which lines are involved, and if there are any inconsistencies in spelling.

### Lines Code Info
- YU: Yonge-University Line 1   | Terminal Stations: Finch and Vaughan Metropolitan Centre
- BD: Bloor-Danforth Line 2     | Terminal Stations: Kipling and Kennedy
- SRT: Scarborough Line 3       | Terminal Stations: Kennedy and McCowan | **Note, this line is no longer open**
- SHP: Sheppard Line 4          | Terminal Stations: Sheppard-Yonge and Don Mills

In [36]:
print(df['Line'].value_counts())

Line
YU              107901
BD               95079
SHP               7723
SRT               7667
YU/BD             2890
                 ...  
YU - BD LINE         1
60                   1
9 BELLAMY            1
YU BD                1
20 CLIFFSIDE         1
Name: count, Length: 95, dtype: int64


Use counter to determine the frequencies of different lines

In [37]:
# find the set of all subway lines
lines = df['Line'].tolist()
lines = Counter(lines)
print(lines)

Counter({'YU': 107901, 'BD': 95079, 'SHP': 7723, 'SRT': 7667, 'YU/BD': 2890, nan: 699, 'YU / BD': 92, 'B/D': 63, 'YUS': 39, 'YU/ BD': 33, 'BD/YU': 21, 'YU & BD': 18, 'YUS/BD': 18, '999': 9, 'BD LINE': 6, 'BD/YUS': 5, 'YU - BD': 5, 'YU-BD': 4, 'YU/BD LINES': 4, 'YU/BD LINE': 3, 'SHEP': 3, 'BLOOR DANFORTH': 2, '95 YORK MILLS': 2, '36 FINCH WEST': 2, '29 DUFFERIN': 2, '510 SPADINA': 2, '11 BAYVIEW': 2, 'YU LINE': 2, '16 MCCOWAN': 2, '35 JANE': 2, 'BD LINE 2': 2, '31 GREENWOOD': 1, '60': 1, '9 BELLAMY': 1, '45 KIPLING': 1, '504': 1, '500': 1, 'SHEPPARD': 1, '104 FAYWOOD': 1, '60 STEELES WEST': 1, '25 DON MILLS': 1, '555': 1, '126 CHRISTIE': 1, '37 ISLINGTON': 1, '504 KING': 1, '116 MORNINGSIDE': 1, '73 ROYAL YORK': 1, 'BLOOR DANFORTH LINE': 1, 'YU/SHEP': 1, '66': 1, '341 KEELE': 1, '63 OSSINGTON': 1, '32 EGLINTON WEST': 1, '129 MCCOWAN NORTH': 1, 'YU BD': 1, 'YU - BD LINE': 1, '85 SHEPPARD EAST': 1, 'BLOOR DANFORTH LINES': 1, 'YONGE UNIVERSITY SERVI': 1, '704 RAD BUS': 1, 'YU\\BD': 1, '46 

Here, the Yonge / Bloor lines' spelling needs to be unified and the Bloor line needs to be unified as well.

Note also that there seem to be a negligible amount of nan and miscellaneous lines. These could be removed, so we focus on the main lines as well.

In [38]:
def map_lines(line):
    # if line ins't a string, return 'Other'
    if not isinstance(line, str):
        return 'Other'
    if 'YU' in line and 'BD' in line:
        return 'LINE 1 and LINE 2'
    elif 'YU' in line:
        return 'LINE 1'
    elif 'B' in line and 'D' in line:
        return 'LINE 2'
    elif 'SHP' in line:
        return 'LINE 4'
    elif 'SRT' in line:
        return 'LINE 3'
    else:
        return 'Other'

df['Line Number'] = df['Line'].map(map_lines)
df = df[df['Line Number'] != 'Other']
df['Line Number'].value_counts()

Line Number
LINE 1               107944
LINE 2                95160
LINE 4                 7723
LINE 3                 7667
LINE 1 and LINE 2      3103
Name: count, dtype: int64

In [39]:
test = df[(df['Min Delay'] == 0) & (df['Min Gap'] == 0)]
print(f"There are {test.shape[0]} rows with 0 min delay and 0 min gap")

There are 144264 rows with 0 min delay and 0 min gap


## Delay Code mapping

Maps the delay codes to their respective causes in words.

In [40]:
codes = pd.read_csv('data/delays/subway/subway_delay_code.csv')
delay_codes = dict(zip(codes['SUB RMENU CODE'], codes['CODE DESCRIPTION']))

def delay_causes(code):
    if code not in delay_codes:
        return 'Unrecognized Code'
    return delay_codes[code]

df['Cause for Delay'] = df['Code'].map(delay_causes)

## Bound

Clean out the bounds that do not make sense (the ones that are not N, W, E, or S). After much research, it still does not appear Y, R, 5, or 0 could represent anything. **Note there's quite a few 'B' bounds. I think this might indicate both directions experienced delay.


!!! **This part is not finished yet**

In [44]:
df = pd.read_csv('data/delays/subway/ttc-subway-delay-data-2014-2024.csv')

In [46]:
df.head()

Unnamed: 0,Datetime,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle
0,1/1/2014 0:21:00,Wednesday,VICTORIA PARK STATION,MUPR1,55,60,W,BD,5111
1,1/1/2014 2:06:00,Wednesday,HIGH PARK STATION,SUDP,3,7,W,BD,5001
2,1/1/2014 2:40:00,Wednesday,SHEPPARD STATION,MUNCA,0,0,,YU,0
3,1/1/2014 3:10:00,Wednesday,LANSDOWNE STATION,SUDP,3,8,W,BD,5116
4,1/1/2014 3:20:00,Wednesday,BLOOR STATION,MUSAN,5,10,S,YU,5386


In [42]:
# drop rows if their bounds are not N, W, E, S
print(df.shape)
df = df[df['Bound'].isin(['N', 'W', 'E', 'S', 'B'])]
df['Bound'].value_counts()

(221597, 11)


Bound
S    46637
W    39885
N    39565
E    37706
B      311
Name: count, dtype: int64

## Station Name Cleaning

Makes it so that the station names are control. Corrects for any misspellings or differing names. Does this by using matching with list of existing stations based on lines.

Read in the station names

In [23]:
stations = {}

for i in range(1, 5):
    with open(f'data/delays/subway/stations_on_lines/line{i}.txt', 'r') as f:
        line = f.read().splitlines()[1:]
    stations[f'LINE {i}'] = line

print(stations)

{'LINE 1': ['FINCH', 'NORTH YORK CENTRE', 'SHEPPARD-YONGE', 'YORK MILLS', 'LAWRENCE', 'EGLINTON', 'DAVISVILLE', 'ST CLAIR', 'SUMMERHILL', 'ROSEDALE', 'BLOOR', 'WELLESLEY', 'COLLEGE', 'DUNDAS', 'QUEEN', 'KING', 'UNION', 'ST ANDREW', 'OSGOODE', 'ST PATRICK', "QUEEN'S PARK", 'MUSEUM', 'ST GEORGE', 'SPADINA', 'DUPONT', 'ST CLAIR WEST', 'EGLINTON WEST', 'GLENCAIRN', 'LAWRENCE WEST', 'YORKDALE', 'WILSON', 'SHEPPARD WEST', 'DOWNSVIEW PARK', 'FINCH WEST', 'YORK UNIVERSITY', 'PIONEER VILLAGE', 'HIGHWAY 407', 'VAUGHAN METROPOLITAN CENTRE'], 'LINE 2': ['KIPLING', 'ISLINGTON', 'ROYAL YORK', 'OLD MILL', 'JANE', 'RUNNYMEDE', 'HIGH PARK', 'KEELE', 'DUNDAS WEST', 'LANSDOWNE', 'DUFFERIN', 'OSSINGTON', 'CHRISTIE', 'BATHURST', 'SPADINA', 'ST GEORGE', 'BAY', 'BLOOR-YONGE', 'SHERBOURNE', 'CASTLE FRANK', 'BROADVIEW', 'CHESTER', 'PAPE', 'DONLANDS', 'GREENWOOD', 'COXWELL', 'WOODBINE', 'MAIN STREET', 'VICTORIA PARK', 'WARDEN', 'KENNEDY'], 'LINE 3': ['KENNEDY', 'LAWRENCE EAST', 'ELLESMERE', 'MIDLAND', 'SCARBORO

Define the function for fuzzy matching. Note we will use jaro_winkler_similarity, which puts emphasis on the words at the start (to handle cases like UNION TO KING, where both UNION and KING as stations)

In [24]:
def map_stations(line, station):
    station = station.replace("STATION", "").strip()
    best_match = 'NO MATCH'
    highest_score = 0.8

    if line == "LINE 1 and LINE 2":
        line = "LINE 1"

    for s in stations[line]:
        if ('WEST' in s and 'WEST' not in station) or ('WEST' not in s and 'WEST' in station):
            continue

        score = jellyfish.jaro_winkler_similarity(s, station)
        if score > highest_score:
            highest_score = score
            best_match = s
        
    return best_match

In [25]:
df['Station Name'] = df.apply(lambda x: map_stations(x['Line Number'], x['Station']), axis=1)

In [26]:
test = df[df['Station Name'] == 'NO MATCH'] 
print(test['Station'].value_counts())
print(test.shape)
test.to_csv('data/delays/subway/cleaned_data/subway_delays_no_match.csv', index=False)
df.to_csv('data/delays/subway/cleaned_data/subway_delays.csv', index=False)

Station
YONGE BD STATION          2870
YONGE SHP STATION         1187
WARDEN STATION              86
BLOOR DANFORTH SUBWAY       55
YONGE SHEP STATION          37
                          ... 
MCBRIEN BUIDING              1
W/O CASTLE FRANK TO GR       1
CHURCH EMERGENCY EXIT        1
YONGE TO COXWELL STATI       1
HOSTLER 2 WILSON YARD        1
Name: count, Length: 186, dtype: int64
(4639, 12)


## Delay and Min Gap Both Zero

There seems to be a large number of data points where both min delay and in gap are 0.

In [14]:
test = df[(df['Min Delay'] == 0) & (df['Min Gap'] == 0)]
print(f"There are {test.shape[0]} rows with 0 min delay and 0 min gap")

There are 87591 rows with 0 min delay and 0 min gap
