# New York City MTA Analysis

This project analyzes the average ridership between different subway stations in the New York City subway system to understand where the highest levels of congestion are. We use a graph representation to model the subway network and use properties of graphs to summarize the congestion at and between stations. This analysis can help riders plan the optimal times to use the subway to avoid traffic, as well as help the city plan new routes in very congested areas to make sure everyone gets to their destination on time.

In [1]:
# Import statements
import numpy as np
import pandas as pd
import networkx as nx

# Read in csv files
stations_raw = pd.read_csv('../../data/mta/MTA_Subway_Stations_updated.csv')
rides_raw = pd.read_csv('../../data/mta/MTA_Subway_Origin-Destination_Ridership_Estimate__2024_20241008.csv')

Here are samples of the the 2 datasets used for analysis. The stations dataset contains information about each station including its name, location, line, and more. The rides dataset contains the average ridership between 2 stations by hour throughout the year. We will clean this data and use it to represent the subway system as a graph network in order to leverage graph properties in our analysis.

In [4]:
stations_raw.head()

Unnamed: 0,GTFS Stop ID,Station ID,Complex ID,Division,Line,Stop Name,Borough,CBD,Daytime Routes,Structure,GTFS Latitude,GTFS Longitude,North Direction Label,South Direction Label,ADA,ADA Northbound,ADA Southbound,ADA Notes,Georeference
0,R01,1,1,BMT,Astoria,Astoria-Ditmars Blvd,Q,False,N W,Elevated,40.775036,-73.912034,Last Stop,Manhattan,0,0,0,,POINT (-73.912034 40.775036)
1,R03,2,2,BMT,Astoria,Astoria Blvd,Q,False,N W,Elevated,40.770258,-73.917843,Astoria,Manhattan,1,1,1,,POINT (-73.917843 40.770258)
2,R04,3,3,BMT,Astoria,30 Av,Q,False,N W,Elevated,40.766779,-73.921479,Astoria,Manhattan,0,0,0,,POINT (-73.921479 40.766779)
3,R05,4,4,BMT,Astoria,Broadway,Q,False,N W,Elevated,40.76182,-73.925508,Astoria,Manhattan,0,0,0,,POINT (-73.925508 40.76182)
4,R06,5,5,BMT,Astoria,36 Av,Q,False,N W,Elevated,40.756804,-73.929575,Astoria,Manhattan,0,0,0,,POINT (-73.929575 40.756804)


In [5]:
rides_raw.head()

Unnamed: 0,Year,Month,Day of Week,Hour of Day,Timestamp,Origin Station Complex ID,Origin Station Complex Name,Origin Latitude,Origin Longitude,Destination Station Complex ID,Destination Station Complex Name,Destination Latitude,Destination Longitude,Estimated Average Ridership,Origin Point,Destination Point
0,2024,1,Monday,1,01/08/2024 01:00:00 AM,26,"DeKalb Av (B,Q,R)",40.690635,-73.981824,355,"Winthrop St (2,5)",40.656652,-73.9502,0.5556,POINT (-73.981824 40.690635),POINT (-73.9502 40.656652)
1,2024,1,Monday,1,01/08/2024 01:00:00 AM,231,"Grand St (B,D)",40.718267,-73.993753,284,Nassau Av (G),40.724635,-73.951277,0.3068,POINT (-73.993753 40.718267),POINT (-73.951277 40.724635)
2,2024,1,Monday,1,01/08/2024 01:00:00 AM,313,"72 St (1,2,3)",40.778453,-73.98197,71,8 Av (N),40.635064,-74.011719,0.3012,POINT (-73.98197 40.778453),POINT (-74.011719 40.635064)
3,2024,1,Monday,1,01/08/2024 01:00:00 AM,320,23 St (1),40.744081,-73.995657,309,103 St (1),40.799446,-73.968379,0.9,POINT (-73.995657 40.744081),POINT (-73.968379 40.799446)
4,2024,1,Monday,1,01/08/2024 01:00:00 AM,399,68 St-Hunter College (6),40.768141,-73.96387,618,"14 St (A,C,E)/8 Av (L)",40.740335,-74.002134,0.294,POINT (-73.96387 40.768141),POINT (-74.002134 40.740335)


## Data Processing

The first step in analysis is dropping unnecessary features in both datasets. In the stations data we will keep the borough of the station so we can look at the congestion within all of New York's boroughs. In the ridership data we will keep the average ridership as well as the day of week and hour so that we can compare congestion at different times during the week.

In [6]:
# Create DataFrame of station names, boroughs, and ids to use as node features
station_boroughs = stations_raw[['Complex ID', 'Borough']].drop_duplicates()
station_names = rides_raw[['Origin Station Complex Name', 'Origin Station Complex ID']].drop_duplicates()
stations = station_boroughs.merge(station_names, left_on='Complex ID', right_on='Origin Station Complex ID')
stations = stations.rename(columns={'Complex ID': 'ID', 'Origin Station Complex Name': 'Name'})
stations = stations[['ID', 'Borough', 'Name']]
stations.head()

Unnamed: 0,ID,Borough,Name
0,1,Q,"Astoria-Ditmars Blvd (N,W)"
1,2,Q,"Astoria Blvd (N,W)"
2,3,Q,"30 Av (N,W)"
3,4,Q,"Broadway (N,W)"
4,5,Q,"36 Av (N,W)"


In [7]:
# Create DataFrame of rides between stations
rides = rides_raw[['Day of Week', 'Hour of Day', 'Origin Station Complex ID', 'Destination Station Complex ID', 'Estimated Average Ridership']]
rides.head()

Unnamed: 0,Day of Week,Hour of Day,Origin Station Complex ID,Destination Station Complex ID,Estimated Average Ridership
0,Monday,1,26,355,0.5556
1,Monday,1,231,284,0.3068
2,Monday,1,313,71,0.3012
3,Monday,1,320,309,0.9
4,Monday,1,399,618,0.294


We create our graph representation using the NetworkX packages. We use a MultiDiGraph because all rides are directed meaning they start at one station and end at another. We also use multiple edges between stations where each edge represents the ridership at a different day and time. This way it is easier to analyze the congestion during the week by filtering edges by what time we want.

In [8]:
# Create Graph
G = nx.MultiDiGraph()

# Create graph nodes with features for name and borough
id = stations['ID']
name = stations['Name']
borough = stations['Borough']

# Add nodes
for i, n, b in zip(id, name, borough):
    G.add_node(i, name=n, borough=b)

# Create graph edges with features day, hour, origin, destination, and ridership
day = rides['Day of Week']
hour = rides['Hour of Day']
origin = rides['Origin Station Complex ID']
destination = rides['Destination Station Complex ID']
ridership = rides['Estimated Average Ridership']

# Add edges
for d, h, orig, dest, r in zip(day, hour, origin, destination, ridership):
    G.add_edge(orig, dest, day=d, hour=h, ridership=r)

## Data Analysis

Here we want to answer a few questions about the subway system to gain more insight about congestion in New York. The questions address different aspects of the data from analyzing locations of stations to times of riders so the answers describe the entire network well overall.

1. What are the top 5 origin subway stations from where most riders took subway ride across each borough?

In [9]:

def q1a(G):
    # Get all borough names
    boroughs = sorted(list(set([n[1]['borough'] for n in G.nodes(data=True)])))

    top_stations = {}
    for b in boroughs:
        # Get all stations in borough
        in_borough = list(set([n[0] for n in G.nodes(data=True) if n[1]['borough'] == b]))

        riders = {}
        for station in in_borough:
            # For each station add up the ridership of all outgoing edges
            cur_station = 0
            for _, _, info in G.out_edges(station, data=True):
                cur_station += info['ridership']            
            riders[station] = cur_station

        # Sort stations by ridership and get the top 5
        top5 = sorted(riders, key=riders.get, reverse=True)[:5]
        top5_names = [G.nodes[i]['name'] for i in top5]
        top_stations[b] = top5_names

    codes = {'Bx': 'Bronx',
            'Bk': 'Brooklyn',
            'M': 'Manhattan',
            'Q': 'Queens',
            'SI': 'Staten Island'}

    # Print out the top stations by borough
    for b, stations in top_stations.items():
        print(f'Top Stations in {codes[b]}:')
        [print(f'{i+1}. {s}') for i, s in enumerate(stations)]
        print()

q1a(G)

Top Stations in Brooklyn:
1. Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5)
2. Bedford Av (L)
3. Jay St-MetroTech (A,C,F,R)
4. Court St (R)/Borough Hall (2,3,4,5)
5. Crown Hts-Utica Av (3,4)

Top Stations in Bronx:
1. 161 St-Yankee Stadium (B,D,4)
2. 3 Av-149 St (2,5)
3. Parkchester (6)
4. Fordham Rd (4)
5. Hunts Point Av (6)

Top Stations in Manhattan:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
5. 34 St-Penn Station (A,C,E)

Top Stations in Queens:
1. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
2. Flushing-Main St (7)
3. 103 St-Corona Plaza (7)
4. Sutphin Blvd-Archer Av-JFK Airport (E,J,Z)
5. Junction Blvd (7)



2. What are the top 5 origin subway stations from where most riders took subway rides on Monday, Tuesday, and Wednesday combined?

In [10]:
def q1b(G):
    # All valid days of week
    days = set(['Monday', 'Tuesday', 'Wednesday'])

    riders = {}
    for station in G.nodes:
        # For each station, add outgoing ridership if on correct day
        cur_station = 0
        for _, _, info in G.out_edges(station, data=True):
            if info['day'] in days:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations on Monday, Tuesday, Wednesday:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q1b(G)

Top Stations on Monday, Tuesday, Wednesday:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
5. Fulton St (A,C,J,Z,2,3,4,5)


3. What are the top 5 origin subway stations from where most riders took subway rides on Saturday and Sunday combined?

In [11]:
def q1c(G):
    # All valid days of week
    days = set(['Saturday', 'Sunday'])

    riders = {}
    for station in G.nodes:
        # For each station, add outgoing ridership if on correct day
        cur_station = 0
        for _, _, info in G.out_edges(station, data=True):
            if info['day'] in days:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations on Saturday, Sunday:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q1c(G)

Top Stations on Saturday, Sunday:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
3. Grand Central-42 St (S,4,5,6,7)
4. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
5. 34 St-Penn Station (A,C,E)


4. What are the top 5 origin subway stations from where most riders took subway rides between 1am-5am across all days and boroughs?

In [12]:
def q1d(G):
    # Hours from 1am to 5am
    hours = range(1, 6)

    riders = {}
    for station in G.nodes:
        # For each station, add outgoing ridership if at correct time
        cur_station = 0
        for _, _, info in G.out_edges(station, data=True):
            if info['hour'] in hours:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations from 12am to 5am:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q1d(G)

Top Stations from 12am to 5am:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
3. Flushing-Main St (7)
4. 103 St-Corona Plaza (7)
5. Jamaica Center-Parsons/Archer (E,J,Z)


5. What are the top 5 origin subway stations from where most riders took subway rides between 6am-9am across all days and boroughs?

In [13]:
def q1e(G):
    # Hours from 6am to 9am
    hours = range(6, 10)

    riders = {}
    for station in G.nodes:
        # For each station, add outgoing ridership if at correct time
        cur_station = 0
        for _, _, info in G.out_edges(station, data=True):
            if info['hour'] in hours:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations from 6am to 9am:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q1e(G)

Top Stations from 6am to 9am:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
4. 34 St-Penn Station (1,2,3)
5. Flushing-Main St (7)


6. What are the top 5 destination subway stations from where most riders took subway rides across each borough?

In [14]:
def q2a(G):
    # Get all borough names
    boroughs = sorted(list(set([n[1]['borough'] for n in G.nodes(data=True)])))

    top_stations = {}
    for b in boroughs:
        # Get all stations in borough
        in_borough = list(set([n[0] for n in G.nodes(data=True) if n[1]['borough'] == b]))

        riders = {}
        for station in in_borough:
            # For each station add up the ridership of all incoming edges
            cur_station = 0
            for _, _, info in G.in_edges(station, data=True):
                cur_station += info['ridership']            
            riders[station] = cur_station

        # Sort stations by ridership and get the top 5
        top5 = sorted(riders, key=riders.get, reverse=True)[:5]
        top5_names = [G.nodes[i]['name'] for i in top5]
        top_stations[b] = top5_names

    codes = {'Bx': 'Bronx',
            'Bk': 'Brooklyn',
            'M': 'Manhattan',
            'Q': 'Queens',
            'SI': 'Staten Island'}

    # Print out the top stations by borough
    for b, stations in top_stations.items():
        print(f'Top Stations in {codes[b]}:')
        [print(f'{i+1}. {s}') for i, s in enumerate(stations)]
        print()

q2a(G)

Top Stations in Brooklyn:
1. Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5)
2. Bedford Av (L)
3. Jay St-MetroTech (A,C,F,R)
4. Court St (R)/Borough Hall (2,3,4,5)
5. Crown Hts-Utica Av (3,4)

Top Stations in Bronx:
1. 161 St-Yankee Stadium (B,D,4)
2. 3 Av-149 St (2,5)
3. Parkchester (6)
4. 149 St-Grand Concourse (2,4,5)
5. Fordham Rd (4)

Top Stations in Manhattan:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
5. Fulton St (A,C,J,Z,2,3,4,5)

Top Stations in Queens:
1. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
2. Flushing-Main St (7)
3. Court Sq (E,G,M,7)
4. 103 St-Corona Plaza (7)
5. Junction Blvd (7)



7. What are the top 5 destination subway stations from where most riders took subway rides on Thursday and Friday combined?

In [15]:
def q2b(G):
    # All valid days of week
    days = set(['Thursday', 'Friday'])

    riders = {}
    for station in G.nodes:
        # For each station, add incoming ridership if on correct day
        cur_station = 0
        for _, _, info in G.in_edges(station, data=True):
            if info['day'] in days:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations on Thursday, Friday:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q2b(G)

Top Stations on Thursday, Friday:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
5. Fulton St (A,C,J,Z,2,3,4,5)


8. What are the top 5 destination sub-way stations from where most riders took subway rides on Saturday?

In [16]:
def q2c(G):
    # All valid days of week
    days = set(['Saturday'])

    riders = {}
    for station in G.nodes:
        # For each station, add incoming ridership if on correct day
        cur_station = 0
        for _, _, info in G.in_edges(station, data=True):
            if info['day'] in days:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations on Saturday:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q2c(G)

Top Stations on Saturday:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
3. 14 St-Union Sq (L,N,Q,R,W,4,5,6)
4. Grand Central-42 St (S,4,5,6,7)
5. 34 St-Penn Station (A,C,E)


9. What are the top 5 destination subway stations from where most riders took subway rides between 12am-5am across all days and boroughs?

In [17]:
def q2d(G):
    # Hours from 12am to 5am
    hours = range(0, 6)

    riders = {}
    for station in G.nodes:
        # For each station, add incoming ridership if at correct time
        cur_station = 0
        for _, _, info in G.in_edges(station, data=True):
            if info['hour'] in hours:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations from 12am to 5am:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q2d(G)

Top Stations from 12am to 5am:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
5. Fulton St (A,C,J,Z,2,3,4,5)


10. What are the top 5 destination subway stations from where most riders took subway rides between 6pm-9pm across all days and boroughs?

In [18]:
def q2e(G):
    # Hours from 6pm-9pm
    hours = range(18, 22)

    riders = {}
    for station in G.nodes:
        # For each station, add incoming ridership if at correct time
        cur_station = 0
        for _, _, info in G.in_edges(station, data=True):
            if info['hour'] in hours:
                cur_station += info['ridership']
        riders[station] = cur_station

    # Sort stations by ridership and get the top 5
    top5 = sorted(riders, key=riders.get, reverse=True)[:5]
    top5_names = [G.nodes[i]['name'] for i in top5]

    # Print the top stations
    print(f'Top Stations from 6pm to 9pm:')
    [print(f'{i+1}. {s}') for i, s in enumerate(top5_names)]

q2e(G)

Top Stations from 6pm to 9pm:
1. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
2. Grand Central-42 St (S,4,5,6,7)
3. 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
4. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
5. 14 St-Union Sq (L,N,Q,R,W,4,5,6)


11. What are the top 10 congested source-destination subway stations pair on Monday between 1pm-2pm?

In [19]:
def q3a(G):
    # Valid day of week and hours
    days = set(['Monday'])
    hours = range(13, 15)

    riders = {}
    for start, end, info in G.edges(data=True):
        # Add ridership info to dictionary if edge matches the valid features
        if info['day'] in days and info['hour'] in hours:
            riders[(start, end)] = riders.get((start, end), 0) + info['ridership']

    # Sort edges by ridership and take top 10
    top10 = sorted(riders, key=riders.get, reverse=True)[:10]
    top10_names = [(riders[(j, k)], G.nodes[j]['name'], G.nodes[k]['name']) for j, k in top10]

    # Print the top edges
    print(f'Top Station Pairs from 1pm-2pm on Monday:\n')
    [print(f'{i+1}. {start} --> {end}\n {int(ride)} average riders') for i, (ride, start, end) in enumerate(top10_names)]

q3a(G)

Top Station Pairs from 1pm-2pm on Monday:

1. Grand Central-42 St (S,4,5,6,7) --> Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E)
 282 average riders
2. Flushing-Main St (7) --> 103 St-Corona Plaza (7)
 273 average riders
3. Fulton St (A,C,J,Z,2,3,4,5) --> Grand Central-42 St (S,4,5,6,7)
 272 average riders
4. Flushing-Main St (7) --> Junction Blvd (7)
 271 average riders
5. Grand Central-42 St (S,4,5,6,7) --> 14 St-Union Sq (L,N,Q,R,W,4,5,6)
 270 average riders
6. Flushing-Main St (7) --> 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
 259 average riders
7. 14 St-Union Sq (L,N,Q,R,W,4,5,6) --> Grand Central-42 St (S,4,5,6,7)
 255 average riders
8. Junction Blvd (7) --> Flushing-Main St (7)
 239 average riders
9. Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E) --> Grand Central-42 St (S,4,5,6,7)
 232 average riders
10. Grand Central-42 St (S,4,5,6,7) --> Fulton St (A,C,J,Z,2,3,4,5)
 231 average riders


12. What are the top 10 congested source-destination subway stations pair on Queens borough, on Fridays between 6pm-9pm?

In [20]:
def q3b(G):
    # Valid day of week, hours, and borough
    days = set(['Friday'])
    hours = range(18, 22)
    borough = 'Q'

    riders = {}
    for start, end, info in G.edges(data=True):
        # Add ridership info to dictionary if edge matches the valid features
        if info['day'] in days and info['hour'] in hours and G.nodes[start]['borough'] == borough and G.nodes[end]['borough'] == borough:
            riders[(start, end)] = riders.get((start, end), 0) + info['ridership']

    # Sort edges by ridership and take top 10
    top10 = sorted(riders, key=riders.get, reverse=True)[:10]
    top10_names = [(riders[(j, k)], G.nodes[j]['name'], G.nodes[k]['name']) for j, k in top10]

    # Print the top edges
    print(f'Top Station Pairs from 6pm to 9pm on Friday in Queens:\n')
    [print(f'{i+1}. {start} --> {end}\n {int(ride)} average riders') for i, (ride, start, end) in enumerate(top10_names)]

q3b(G)

Top Station Pairs from 6pm to 9pm on Friday in Queens:

1. Flushing-Main St (7) --> 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R)
 645 average riders
2. Flushing-Main St (7) --> 103 St-Corona Plaza (7)
 644 average riders
3. Flushing-Main St (7) --> Junction Blvd (7)
 593 average riders
4. Junction Blvd (7) --> Flushing-Main St (7)
 433 average riders
5. Flushing-Main St (7) --> 90 St-Elmhurst Av (7)
 385 average riders
6. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R) --> Flushing-Main St (7)
 345 average riders
7. 103 St-Corona Plaza (7) --> Flushing-Main St (7)
 278 average riders
8. Flushing-Main St (7) --> 111 St (7)
 240 average riders
9. 82 St-Jackson Hts (7) --> Flushing-Main St (7)
 235 average riders
10. 74-Broadway (7)/Jackson Hts-Roosevelt Av (E,F,M,R) --> Jamaica-179 St (F)
 222 average riders


13. What are the top 10 congested source-destination subway stations pair on Brooklyn borough, Ridership between 1am-5am?

In [24]:
def q3c(G):
    # Valid hours and borough
    hours = range(1, 6)
    borough = 'Bk'

    riders = {}
    for start, end, info in G.edges(data=True):
        # Add ridership info to dictionary if edge matches the valid features
        if info['hour'] in hours and G.nodes[start]['borough'] == borough and G.nodes[end]['borough'] == borough:
            riders[(start, end)] = riders.get((start, end), 0) + info['ridership']

    # Sort edges by ridership and take top 10
    top10 = sorted(riders, key=riders.get, reverse=True)[:10]
    top10_names = [(riders[(j, k)], G.nodes[j]['name'], G.nodes[k]['name']) for j, k in top10]

    # Print the top edges
    print(f'Top Station Pairs from 1am to 5am in Brooklyn:\n')
    [print(f'{i+1}. {start} --> {end}\n {int(ride)} average riders') for i, (ride, start, end) in enumerate(top10_names)]

q3c(G)

Top Station Pairs from 1am to 5am in Brooklyn:

1. Crown Hts-Utica Av (3,4) --> Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5)
 136 average riders
2. Flatbush Av-Brooklyn College (2,5) --> Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5)
 107 average riders
3. Bedford Av (L) --> Myrtle-Wyckoff Avs (L,M)
 104 average riders
4. Crown Hts-Utica Av (3,4) --> Court St (R)/Borough Hall (2,3,4,5)
 104 average riders
5. Myrtle-Wyckoff Avs (L,M) --> Bedford Av (L)
 93 average riders
6. Crown Hts-Utica Av (3,4) --> Nevins St (2,3,4,5)
 92 average riders
7. Euclid Av (A,C) --> Jay St-MetroTech (A,C,F,R)
 92 average riders
8. Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5) --> 36 St (D,N,R)
 86 average riders
9. Bedford Av (L) --> DeKalb Av (L)
 86 average riders
10. Lorimer St (L)/Metropolitan Av (G) --> Myrtle-Wyckoff Avs (L,M)
 77 average riders


14. What are the top 10 congested source-destination sub-way stations pair where Source is Brooklyn, Destination is Manhattan, Monday-Thursday 6am-7am?

In [25]:
def q3d(G):
    # Valid day of week, hours, and borough
    days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
    hours = range(6, 8)
    source = 'Bk'
    dest = 'M'

    riders = {}
    for start, end, info in G.edges(data=True):
        # Add ridership info to dictionary if edge matches the valid features
        if info['hour'] in hours and info['day'] in days and G.nodes[start]['borough'] == source and G.nodes[end]['borough'] == dest:
            riders[(start, end)] = riders.get((start, end), 0) + info['ridership']

    # Sort edges by ridership and take top 10
    top10 = sorted(riders, key=riders.get, reverse=True)[:10]
    top10_names = [(riders[(j, k)], G.nodes[j]['name'], G.nodes[k]['name']) for j, k in top10]

    # Print the top edges
    print(f'Top Station Pairs from 6am to 7am in from Monday to Thursday from Brooklyn to Manhattan:\n')
    [print(f'{i+1}. {start} --> {end}\n {int(ride)} average riders') for i, (ride, start, end) in enumerate(top10_names)]

q3d(G)

Top Station Pairs from 6am to 7am in from Monday to Thursday from Brooklyn to Manhattan:

1. Atlantic Av-Barclays Ctr (B,D,N,Q,R,2,3,4,5) --> Bowling Green (4,5)
 1029 average riders
2. Crown Hts-Utica Av (3,4) --> Grand Central-42 St (S,4,5,6,7)
 692 average riders
3. Kings Hwy (B,Q) --> 34 St-Herald Sq (B,D,F,M,N,Q,R,W)
 655 average riders
4. Court St (R)/Borough Hall (2,3,4,5) --> Grand Central-42 St (S,4,5,6,7)
 653 average riders
5. Flatbush Av-Brooklyn College (2,5) --> Grand Central-42 St (S,4,5,6,7)
 589 average riders
6. Crown Hts-Utica Av (3,4) --> Fulton St (A,C,J,Z,2,3,4,5)
 580 average riders
7. Flatbush Av-Brooklyn College (2,5) --> Fulton St (A,C,J,Z,2,3,4,5)
 557 average riders
8. Bedford Av (L) --> Grand Central-42 St (S,4,5,6,7)
 555 average riders
9. Kings Hwy (B,Q) --> 47-50 Sts-Rockefeller Ctr (B,D,F,M)
 503 average riders
10. Sheepshead Bay (B,Q) --> 47-50 Sts-Rockefeller Ctr (B,D,F,M)
 478 average riders


15. What are the top 10 congested source-destination subway stations pair where Source is Bronx, Destination is Manhattan, Monday-Thursday 6am-7am?

In [26]:
def q3d(G):
    # Valid day of week, hours, and borough
    days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
    hours = range(6, 8)
    source = 'Bx'
    dest = 'M'

    riders = {}
    for start, end, info in G.edges(data=True):
        # Add ridership info to dictionary if edge matches the valid features
        if info['hour'] in hours and info['day'] in days and G.nodes[start]['borough'] == source and G.nodes[end]['borough'] == dest:
            riders[(start, end)] = riders.get((start, end), 0) + info['ridership']

    # Sort edges by ridership and take top 10
    top10 = sorted(riders, key=riders.get, reverse=True)[:10]
    top10_names = [(riders[(j, k)], G.nodes[j]['name'], G.nodes[k]['name']) for j, k in top10]

    # Print the top edges
    print(f'Top Station Pairs from 6am to 7am in from Monday to Thursday from Bronx to Manhattan:\n')
    [print(f'{i+1}. {start} --> {end}\n {int(ride)} average riders') for i, (ride, start, end) in enumerate(top10_names)]

q3d(G)

Top Station Pairs from 6am to 7am in from Monday to Thursday from Bronx to Manhattan:

1. Parkchester (6) --> Grand Central-42 St (S,4,5,6,7)
 562 average riders
2. Parkchester (6) --> 14 St-Union Sq (L,N,Q,R,W,4,5,6)
 402 average riders
3. Parkchester (6) --> 125 St (4,5,6)
 395 average riders
4. Parkchester (6) --> 68 St-Hunter College (6)
 383 average riders
5. Parkchester (6) --> 86 St (4,5,6)
 372 average riders
6. Parkchester (6) --> Lexington Av-53 St (E,M)/51 St (6)
 357 average riders
7. Parkchester (6) --> Fulton St (A,C,J,Z,2,3,4,5)
 346 average riders
8. Parkchester (6) --> Brooklyn Bridge-City Hall (4,5,6)/Chambers St (J,Z)
 320 average riders
9. 161 St-Yankee Stadium (B,D,4) --> 59 St-Columbus Circle (A,B,C,D,1)
 306 average riders
10. Woodlawn (4) --> 86 St (4,5,6)
 302 average riders
