# Centrality Measures - New York Subway System

Data 620 Week 4 

*February 27, 2022*

*By Alexander Ng and Philip Tanofsky*



## 1. Introduction

The New York City Subway is the most extensive rapid transit system in the world with 472 stations.  It delivered 1.72 billion rides a year in 2017 and is the busiest system in the Western world.
[List of metro systems](https://en.wikipedia.org/wiki/List_of_metro_systems).   The system runs through all five boroughs: Brooklyn, Bronx, Manhattan, Queens and Staten Island and runs 24/7 except during emergencies. [New York City Subway](https://en.wikipedia.org/wiki/New_York_City_Subway)


<img src="https://upload.wikimedia.org/wikipedia/commons/2/23/Official_New_York_City_Subway_Map_2013_vc.jpg" />

In this study, we will examine two questions related to subway network centrality.

In Part 4, we consider how network centrality affected station level ridership declines between 2018 before the Covid-pandemic to 2020 when Covid reduced ridership to historic lows.  Using the borough of each station as a categorical attribute, do we see variations in ridership declines that could be explain by remoteness?  For example, do the outer borough declines occur less because of fewer transportation options in outer Queens or Brooklyn compared to Manhattan?   Alexander Ng is the primary author of this section.

In Part 5, we explore how the subway network relates to the city's wealth distribution.   For example, how does subway network centrality measures compare to the household median income near each station?   By bucketing household median income into local quartiles, we construct low, moderate, high, very high income categories. Philip Tanofsky is the primary author of this section.


In this proposal, we outline the data sources and methods of analysis to be done in the following week's implementation

## 2. Data Sources

### Subway Graph

The latest available dataset to represent the NYC subway network is provided by the Metropolitan Transit Authority (MTA) in its developer website at (http://web.mta.info/developers/developer-data-terms.html#data) in  General Transit Feed Specification (GTFS), also known as GTFS static or static transit form.  GTFS datasets are publicly available online.   The GTFS dataset is comprised of multiple plain text files.   We will infer the desired network graph by joining these files to obtain the nodes and the edges.




### Census Data

Household Median Income is available from [censusreporter.org](www.censusreporter.org) from the US Census Bureau's American Community Survey (ACS) with 2015-2019 median income inflation adjusted to 2019 dollars.   We obtain the household median income (field B19013) as the census tract level.   For each subway station, we assign the median household income of the census tract containing the station based on the geocoded location provided by the GTFS dataset.

A single datafile in `geojson` format for the entire NYC metropolitan area encompasses all subway stations.  The map partitions the city at census tract level and provides the household median income.


### Ridership Data

The latest available subway ridership data from MTA is available at [Subway and bus ridership for 2020](https://new.mta.info/agency/new-york-city-transit/subway-bus-ridership-2020).
The datafile used is in Excel and lists ridership by aggregated station (https://new.mta.info/sites/default/files/inline-files/2020%20Subway%20Tables_ul.xlsx).   As the MTA points out, many stations are linked by transfer tunnels.   For example, the 14th Street A, C, E station is combined with the 8th Avenue L station since the rider can walk between the stations after crossing the turnstile at any one station.
These subtleties also affect our definition of the nodes.  

## 3.  Methodology

We describe the construction of the subway network in this section as there are some complexities arising from the GTFS format.

*  Stations are defined as `parent` and `child`.   For example, Times Square - 42 Street station is listed at `127`, `127N` and `127S`.
The identifier suffixes `N` and `S` identify the direction of the station (uptown or downtown).   We are only interested in the parent station `127` in this case.   However, the routes (and hence edges of the graph) are provided in terms of the child nodes `127N` or `127S`.

*  Edges are not defined between Stations directly but through a hierarchy of ancillary objects:
    +  `Routes` defines the train line.  For example, the `1` line is the Broadway - 7th Avenue Local with `route_id` 1.
    +  `Service` defines a category of scheduled stops along the route associated with a direction.  For example, `service_id` AFA21GEN-1037-Sunday-00.
    +  `Trip` defines an instance of `Service` with a stated start time say 6:00am.  For example, the `trip_id` is AFA21GEN-1037-Sunday-00_000600_1..S03R
    +  `StopTimes` define the specific child station node to be reached at a target arrival time by a `Trip`.  For example, AFA21GEN-1037-Sunday-00_000600_1..S03R begins at `101S` then goes to `103S`, `104S`.  This station ids correspond to Van Cortlandt Park - 242 Street, 238 Street and 231 Street respectively.

We think of an edge as defined by the pair of consecutive stations along a trip.
Thus, we see that edges are defined multiple times in the GTFS specification.   We define the set of edges to be the union of all edges for all Trips and treat multiple definitions of each edge as redundant.

*  Transfers.   When two stations are linked by a transfer tunnel such as stations `L01` (8th Avenue L Train station) and `A31` (A, C, E 14 Street Station),  we define them to be equivalent for the Covid ridership study because ridership is combined for stations which allow transfers.   However, stations linked by transfer remain distinct from the median household income study because they may belong to different census tracts and have different household median incomes.   To study Covid-impacted ridership, we have a total of 424 stations (after combining transfers).   To study median household income, we have 472 stations according to the MTA ridership webpage.


## 4.  Covid-19 Impact on Ridership

Using the geolocation of each station to obtain its borough as the categorical variable, we then consider the relative and absolute decline in ridership between 2018 and 2020 at each node.   Lastly, we consider the network centrality of each node looking at its eigenvector centrality.

We can also consider each station's distance to a single node as a centrality measure.  For example, using the Grand Central Terminal station as a central node, we can evaluate the graph theoretic distance from this station and compare that to the ridership declines.




## 5. Income Bracket Vs. Network Centrality

Certainly, the geographic location of a station would have played a key role in past decisions to determine subway line routes and station locations, but this evaluation will focus on the median income of those living near a given subway stop in order to ascertain a better understanding of wealth distribution across New York City in relationship to not only access to the subway system but also ease of access across the subway system. 

We will analyze the network centrality of the New York City Subway system by categorizing each station based on the median household income of the station's surrounding area as defined below.

Income brackets based on New York, New Jersey, Pennsylvania household income quartiles in 2021:
* **Low:** less than or equal to \\$36,000
* **Medium:** Between \\$36,001 and \$78,000
* **High:** Between \\$78,001 and \$150,000
* **Very high:** greater than \\$150,000  
Source: https://dqydj.com/income-by-city/

The network centrality metrics for the analysis will be degree centrality and eigenvector centrality.
* Degree centrality for a vertex $v$ is defined as $C_{D}(v)=\deg(v)$ for a given graph $G:=(V,E)$ with $|V|$ vertices and $|E|$ edges according to Wikipedia. (https://en.wikipedia.org/wiki/Centrality#Degree_centrality)
* Eigenvector centrality relies on a recursive approach that recomputes the scores of each node as a weighted sum of centralities of all nodes in a node’s neighborhood based on $v_i = \sum_{j\epsilon N} x_{i,j}*v_j$ before then normalizing the value of $v$ when $v$ stops changing, according to the textbook *Social Network Analysis for Startups*.

A hypothetical outcome of the evaluation could reveal subway stations with a higher degree centrality may likely be categorized with a low or medium median income whereas a station with a lower degree centrality may have a higher median income. Another possible outcome could be areas with higher median incomes surround subway stations with higher eigenvector centrality, implying the station doesn't have a high degree centrality but the station is close to other stations with higher degree centrality.

Another evaluation to consider regarding wealth distruction and network centrality is the pattern observed across the boroughs. The relationship between degree centrality and average income may be different across the five boroughs and thus not consistent across the entirety of New York City.  