<a href="https://colab.research.google.com/github/hawa1983/Capstone/blob/main/ibx_capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NYC Subway Station Catchment Population Analysis
Goal: Calculate the population within a 0.5-mile radius (catchment area) of each New York City subway station using 2020 census block group data. We will use Python libraries (Pandas, GeoPandas, Shapely) and the U.S. Census API to gather required data and perform spatial analysis. The steps include:
- Loading and cleaning the GTFS stops.txt to get subway station locations.
- Downloading Census TIGER/Line shapefiles for 2020 block group boundaries in NYC (the five boroughs).
- Using the Census API (ACS 5-Year 2020, variable B01003_001) to get total population for each block group.
- Merging population data with the block group geometries.
- Creating a 0.5-mile buffer around each station.
- Intersecting each station’s buffer with block group polygons and using area-weighted interpolation to estimate how much of each block group’s population lies within the buffer
walker-data.com

- Summing the weighted populations to get the total catchment population for each station, and outputting the results to a CSV.

## 1. Load and Prepare Subway Stops Data
First, load the GTFS stops.txt file into a Pandas DataFrame and filter it to get only subway stations (not individual entrances or platforms). In GTFS, subway stations are typically marked with location_type = 1 (parent stations), while individual stop platforms have location_type = 0 and a parent_station reference. We will keep only records where location_type = 1 to represent each station once. Then we convert the DataFrame to a GeoDataFrame using latitude and longitude coordinates.

## Methodology

This workflow implements a concise, reproducible pipeline for ingesting and preparing MTA GTFS stop data for geospatial analysis. The methodology emphasizes efficient I/O, correct use of GTFS semantics, and creation of a GIS-ready dataset.

### i. Source Data Acquisition

GTFS is the canonical data format used by transit agencies, and the MTA publishes it as a ZIP archive containing multiple CSV-like text files. The workflow retrieves this feed programmatically using `requests`. Pulling the archive directly from the authoritative source rather than relying on a cached local copy ensures freshness and traceability of inputs. Because the data volume is small, downloading to memory is both efficient and avoids I/O overhead.

### ii. In-Memory Decompression and Targeted Extraction

The GTFS archive contains many components (routes, trips, shapes, stop times, etc.), but only `stops.txt` is required for this task. Using `BytesIO` with `ZipFile` allows selective extraction in memory without persisting intermediate files. This approach simplifies pipeline construction and avoids filesystem dependencies that complicate reproducibility and deployment.

### iii. Granularity Selection via GTFS Metadata

GTFS provides a hierarchical representation of physical and logical transit elements. The `location_type` field encodes this structure:

* `0`: stop/entrance/platform (fine-grained physical points)
* `1`: station (logical grouping)

Many mapping or network-level analyses operate at the station level rather than the platform or entrance level. Therefore the workflow filters `location_type == 1` to isolate station entities. This avoids spatial redundancy, simplifies downstream joins, and aligns with how GTFS models stop relationships (`parent_station`).

### iv. Geospatial Structuring of the Dataset

To support spatial analysis, the script converts the filtered DataFrame into a GeoDataFrame:

* `points_from_xy` is used to construct Shapely Point geometries from the GTFS latitude/longitude fields.
* The coordinate reference system is explicitly defined as EPSG:4326 (WGS84), ensuring interoperability with other spatial layers and enabling CRS transformations when needed.

This step transforms raw GTFS tabular data into a GIS-compatible object that supports spatial indexing, spatial joins, visualization, and projection into New York–specific CRSs (e.g., EPSG:2263 for distance-based analyses).

### v. Summary for a Data Scientist

The methodology forms a clean, dependency-light ETL pipeline:

1. Fetch GTFS from the authoritative MTA endpoint to ensure freshness.
2. Decompress and parse only the necessary component (`stops.txt`) in memory.
3. Use GTFS metadata to select the conceptually appropriate level of granularity (stations vs. entrances).
4. Construct a geospatially explicit dataset suitable for spatial analytics and integration with other NYC geodata.

If you want, I can extend this into a formal ETL description, add pseudocode, or map how this integrates with downstream modeling tasks.


In [109]:
import pandas as pd
import geopandas as gpd
from io import BytesIO
from zipfile import ZipFile
import requests

# Download the GTFS ZIP from the MTA URL
url = "http://web.mta.info/developers/data/nyct/subway/google_transit.zip"
response = requests.get(url)
with ZipFile(BytesIO(response.content)) as z:
    # Read stops.txt directly from inside the ZIP
    with z.open("stops.txt") as f:
        stops_df = pd.read_csv(f)

# Filter for station-level entries
stations_df = stops_df[stops_df['location_type'] == 1].copy()

# Convert to GeoDataFrame
stations_gdf = gpd.GeoDataFrame(
    stations_df,
    geometry=gpd.points_from_xy(stations_df.stop_lon, stations_df.stop_lat),
    crs="EPSG:4326"
)

print("Number of stations loaded:", len(stations_gdf))
print(stations_gdf.head(3))

# Save the results to CSV
# stations_gdf.to_csv("stations_gdf.csv", index=False)


Number of stations loaded: 496
  stop_id                  stop_name   stop_lat   stop_lon  location_type  \
0     101  Van Cortlandt Park-242 St  40.889248 -73.898583            1.0   
3     103                     238 St  40.884667 -73.900870            1.0   
6     104                     231 St  40.878856 -73.904834            1.0   

  parent_station                    geometry  
0            NaN  POINT (-73.89858 40.88925)  
3            NaN  POINT (-73.90087 40.88467)  
6            NaN  POINT (-73.90483 40.87886)  


## 2. Retrieve Census Block Group Geometries for NYC

To determine populations around stations, we need the geographic boundaries of census block groups in NYC. Block groups are small areas used by the Census (each block group typically has 600–3,000 people
catalog.data.gov, making them fine-grained units for population data). We will use the TIGER/Line shapefiles for 2020 block groups provided by the U.S. Census Bureau catalog.data.gov.

We'll download the shapefile for all block groups in New York State and then filter it to the five boroughs (New York County/Manhattan, Bronx, Kings/Brooklyn, Queens, Richmond/Staten Island). Each of these boroughs corresponds to a county with FIPS codes 061, 005, 047, 081, 085 respectively (when prefixed with state FIPS 36 for New York). After loading the shapefile with GeoPandas, we filter by the county codes.

**Explanation:** We use the Census TIGER shapefile for 2020 block groups in New York. GeoPandas can read directly from a zip URL. The shapefile’s attribute table has fields like STATEFP20, COUNTYFP20, TRACTCE20, BLKGRPCE20, and GEOID20 (or similar), representing state, county, tract, block group codes and a concatenated GEOID. We filter by COUNTYFP20 values to get only the block groups in NYC’s five counties.

**Note:** The CRS of the loaded shapefile is likely a geographic coordinate system (NAD83). Before doing area calculations or buffering, we will need to project these geometries to a planar coordinate system with units in meters (discussed in a later step).

Below is a data-scientist-oriented methodology explanation for this workflow.

## Methodology

This workflow ingests TIGER/Line block-group geometries from the Census Bureau, filters them to the New York City counties, and prepares them for integration with demographic or spatial data products. The methodology emphasizes direct-to-GeoDataFrame ingestion, standards-compliant geographic identifiers, and efficient spatial subsetting.

### i. Direct Ingestion of TIGER/Line Shapefiles from the Census Source

The TIGER/Line dataset is the canonical source for U.S. Census geographic boundaries. Instead of downloading shapefiles manually or relying on local copies, the workflow reads the ZIP archive directly from the Census servers using `geopandas.read_file`.

`gpd.read_file` supports remote URLs and handles ZIP extraction internally, allowing for:

* zero intermediate files,
* fully automated and reproducible ingestion,
* direct loading into a GeoDataFrame with CRS-aware geometries.

This ensures that the geometries represent the official 2020 block-group boundaries.

### ii. Selection of NYC Block Groups via County FIPS Filtering

Block groups are nested within counties, and the TIGER dataset contains all block groups for the entire state of New York. To isolate only those within New York City, the workflow filters on the `COUNTYFP` attribute using the FIPS codes for the five NYC counties:

* 005 (Bronx)
* 047 (Kings/Brooklyn)
* 061 (New York/Manhattan)
* 081 (Queens)
* 085 (Richmond/Staten Island)

Using county FIPS codes rather than name-based filters ensures stability, since names can vary across datasets, but FIPS codes are standardized, immutable identifiers maintained by the federal government.

The filtering yields a subset of block-group polygons corresponding precisely to New York City’s geographic extent.

### iii. Creation of an NYC-Specific Block-Group GeoDataFrame

The resulting subset is copied into a new GeoDataFrame to avoid chained-assignment issues and to explicitly define the filtered dataset as an independent spatial layer. This GeoDataFrame contains:

* block-group geometries (polygon boundaries),
* associated TIGER attributes (state FIPS, county FIPS, tract ID, block-group ID, GEOIDs),
* a CRS standardized by TIGER (EPSG:4269, NAD83).

This makes the dataset directly compatible with Census ACS data (e.g., population estimates) and ready for spatial joins, overlays, or projection transformations.

### iv. Validation and Preview

The `print` and `head` operations serve as a quick validation step to confirm:

* the total number of block groups contained within NYC,
* the presence of expected fields,
* correct geometry types.

This is a standard check when constructing a reproducible spatial ETL pipeline.


### v. Summary

The methodology establishes a clean, reproducible ETL workflow for Census block-group geometries:

1. Read TIGER/Line shapefiles directly from the Census server using GeoPandas.
2. Subset the dataset using county FIPS codes to isolate New York City block groups.
3. Create a standalone GeoDataFrame containing only NYC geometries.
4. Validate structure and attributes for downstream integration.

The resulting layer is ready for spatial joins with ACS variables, neighborhood boundary overlays, or area-weighted calculations. If you want, I can extend this into a full spatial ETL description or show how to join it with the previously prepared population DataFrame.


In [110]:
# Corrected version
shapefile_url = "https://www2.census.gov/geo/tiger/TIGER2020/BG/tl_2020_36_bg.zip"
block_groups_gdf = gpd.read_file(shapefile_url)

# Filter for NYC’s five counties (boroughs)
nyc_county_fips = ["005", "047", "061", "081", "085"]  # Bronx, Brooklyn, Manhattan, Queens, Staten Island
block_groups_nyc = block_groups_gdf[block_groups_gdf["COUNTYFP"].isin(nyc_county_fips)].copy()

print("Total NYC block groups:", len(block_groups_nyc))
block_groups_nyc.head(2)


Total NYC block groups: 6807


Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,BLKGRPCE,GEOID,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
24,36,61,23900,1,360610239001,Block Group 1,G5030,S,27517,0,40.8322236,-73.9404112,"POLYGON ((-73.94112 40.83166, -73.94088 40.832..."
25,36,61,13900,1,360610139001,Block Group 1,G5030,S,23621,0,40.7688543,-73.9868884,"POLYGON ((-73.98806 40.76979, -73.98666 40.769..."


## 3. Retrieve Population Data for Block Groups (ACS 5-Year 2020)

Next, we gather population data for each block group. We use the American Community Survey (ACS) 5-Year 2020 data, specifically the variable B01003_001 which is the total population of the block group. We can query the Census API for this data. The API requires a valid Census API key (you can obtain one for free from the Census Bureau). Insert your API key in the code where indicated.

We'll query all block groups in the five NYC counties. The API call will specify the state (36 for New York) and each county FIPS. We can either make one call per county or attempt a single call with multiple counties. For simplicity, we'll loop over the counties.

**Explanation:** We call the ACS API for each county. The query requests the field B01003_001E (total population estimate) for all block groups in the specified state and county (for=block group:*&in=state:36&in=county:XYZ). The response is JSON with each row containing the population and geographic identifiers (state, county, tract, block group). We combine the results and create a GEOID that matches the one in the shapefile (concatenating state, county, tract, block group codes). We also convert the population to numeric type.

**Note:** Ensure you have a valid Census API key in place of "YOUR_CENSUS_API_KEY". Without a key, the API might rate-limit or reject the request. If you prefer, you could use the cenpy library to retrieve ACS data as well, but here we use direct requests for transparency.

Below is a methodology explanation written for a data scientist, parallel in style to the previous response.

## Methodology

This workflow retrieves population counts for all Census block groups in New York City using the Census Bureau’s ACS 5-Year API and prepares them for integration into geospatial or demographic analyses. The methodology emphasizes API-driven ETL, normalization of disparate county-level responses, and construction of standardized geographic identifiers (GEOIDs).

### i. Programmatic Access to ACS via the Census API

The ACS 5-Year dataset provides the most granular, stable population estimates available at the block-group level. Accessing it through the API, rather than static downloads, ensures up-to-date values and supports automated data pipelines.

The script uses an authenticated GET request to the 2020 ACS 5-year endpoint. The API query requests a single variable:

* `B01003_001E`: total population estimate

The query is formulated to return all block groups within a given county using the `for=block group:*` and `in=state:36` (New York) clauses. Iterating across county FIPS codes (the five NYC counties) is necessary because the Census API does not allow querying block groups across multiple counties in one call.

### ii. Iterative Extraction and Row-Level Parsing

For each county:

* A parameterized URL is constructed using the county FIPS code and API key.
* The API returns a JSON payload where the first row contains column names and subsequent rows contain values.
* The script converts the payload into a DataFrame per county.

This approach preserves the original Census field structure, avoids schema inconsistency, and maintains clarity in how attributes relate to geography (state, county, tract, block group).

### iii. Consolidation Across NYC Counties

Once each county’s block-group dataset is parsed, the DataFrames are concatenated into a unified dataset. This yields a complete block-group coverage for all five boroughs. Because each county is queried separately but defined under the same schema, row-wise concatenation is straightforward and preserves data integrity.

### iv. Variable Normalization and Type Enforcement

To prepare the dataset for quantitative analysis:

* The population variable is renamed to a human-readable form (`population`).
* The population field is cast to numeric, which is necessary because API responses return all values as strings.

This normalization ensures compatibility with statistical operations, aggregations, and merges.

### v. Construction of Standardized Geography Identifiers (GEOIDs)

Census block groups are most commonly referenced via their 12-digit GEOID, constructed as:

* state FIPS (2 digits)
* county FIPS (3 digits)
* tract code (6 digits)
* block group code (1 digit)

The script concatenates these components into a full GEOID string for each row. This step is essential because:

* GEOID is the join key used across TIGER/Line shapefiles and most Census-derived spatial datasets.
* Downstream spatial joins (e.g., merging with block-group polygons) require exact GEOID matching.

This ensures the dataset is immediately ready for integration with geospatial boundaries or further demographic enrichments.

### vi. Summary

The methodology implements a clean API-driven data ingestion pipeline:

1. Query the ACS 5-Year API for block-group population across NYC counties.
2. Parse API responses into uniform, structured DataFrames.
3. Consolidate all county-level data into one dataset.
4. Normalize variable names and enforce numeric types.
5. Construct standard 12-digit GEOIDs for interoperability with spatial boundaries.

This produces a complete, analysis-ready block-group population table suitable for merging with geospatial layers, modeling population distributions, or performing domain-level analytics.


In [111]:
import requests

# Your Census API key
API_KEY = "a4373ffe644694a60eeb7c0bae0bedcfd2d6ff78"

# Prepare a DataFrame to collect population data
pop_list = []
for county in nyc_county_fips:
    url = (
        "https://api.census.gov/data/2020/acs/acs5?get=B01003_001E"
        f"&for=block%20group:*&in=state:36&in=county:{county}&key={API_KEY}"
    )
    response = requests.get(url)
    data = response.json()
    # The first row of the response is the header
    columns = data[0]
    values = data[1:]
    df = pd.DataFrame(values, columns=columns)
    pop_list.append(df)

# Concatenate all counties data
pop_df = pd.concat(pop_list, ignore_index=True)
pop_df.rename(columns={"B01003_001E": "population"}, inplace=True)
# Convert population to numeric
pop_df["population"] = pd.to_numeric(pop_df["population"])
# Construct the full 12-digit GEOID for block group (state+county+tract+block group codes)
pop_df["GEOID"] = pop_df["state"] + pop_df["county"] + pop_df["tract"] + pop_df["block group"]
pop_df.head(3)


Unnamed: 0,population,state,county,tract,block group,GEOID
0,6600,36,5,100,1,360050001001
1,1542,36,5,200,2,360050002002
2,2830,36,5,2001,1,360050020011


## 4. Merge Population Data with Block Group Geometries

Now we join the population data with the GeoDataFrame of NYC block group geometries. This will give each block group polygon a population attribute. The join can be done on the GEOID field (the shapefile might have a field named GEOID or GEOID20). We'll ensure our DataFrame’s GEOID string matches the shapefile’s GEOID format.

**Explanation:** We perform a left join so that each block group polygon gets a population value from ACS. All NYC block groups should find a match in the population DataFrame. After this, block_groups_nyc contains geometry and population for each block group. We are now ready to perform spatial analysis.

Below is a data-scientist-oriented methodology explanation for this workflow, written in the same style as the previous responses.

## Methodology (Data Scientist Version)

This workflow harmonizes geographic identifiers between TIGER/Line block group geometries and ACS-derived population data, then integrates the two datasets via a relational merge. The methodology emphasizes identifier normalization, schema resolution, and creation of a unified spatial–demographic data product.

### i. Identifier Field Resolution Across TIGER Versions

TIGER/Line shapefiles may expose the block-group GEOID under different field names depending on the release or state:

* `GEOID`
* `GEOID20` (or occasionally `GEOID10` for older vintages)

Before joining, the workflow checks which version exists in the loaded GeoDataFrame and assigns the appropriate field name to `geo_field`. This allows the pipeline to operate robustly even when TIGER schemas change slightly year-to-year.

The selected field is cast to string to ensure consistent data types. This step is critical because merges on identifiers can fail silently when keys differ in type (e.g., integer vs. string), especially in mixed Census-derived datasets.

### ii. Controlled Relational Join with ACS Population Data

The ACS-derived population DataFrame (`pop_df`) contains:

* a standardized 12-digit block group `GEOID`,
* the variable `population` (B01003_001E).

The workflow performs a left join:

* `left`: TIGER block group geometries
* `right`: ACS population table
* `left_on`: the resolved TIGER GEOID column (`geo_field`)
* `right_on`: `"GEOID"` in `pop_df`

A left join ensures that every NYC block group polygon is retained even if population is missing (which is rare but possible due to sampling or ACS suppression). This is important for maintaining full geographic coverage.

### iii. Construction of a Unified Spatial–Demographic Dataset

After the merge, the resulting GeoDataFrame contains:

* complete block group geometries,
* TIGER metadata,
* the ACS population attribute merged at the correct level of geographic granularity.

This yields a single, analysis-ready block-group dataset capable of supporting:

* spatial modeling,
* demographic normalization (e.g., population density),
* integration with additional ACS variables,
* multilevel geospatial joins.

The post-merge inspection (`columns.tolist()` and `head`) serves as a validation step to confirm that the join succeeded and that column naming conflicts were resolved appropriately.

### iv. Summary for a Data Scientist

The methodology establishes a structured identifier-normalization and merge workflow:

1. Detect the appropriate GEOID field in the TIGER dataset and cast it to a consistent type.
2. Perform a controlled left join to merge ACS population data onto block-group geometries.
3. Produce a unified GeoDataFrame containing both spatial boundaries and population attributes.
4. Validate schema integrity after the merge.

This provides a fully integrated spatial–demographic block-group layer suitable for modeling, mapping, and further enrichment. If desired, I can extend this into a complete documentation block describing the full spatial ETL pipeline from raw APIs to final geospatial datasets.


In [112]:
# Ensure the GEOID field in block_groups_nyc matches the format (could be GEOID or GEOID20)
geo_field = "GEOID" if "GEOID" in block_groups_nyc.columns else "GEOID20"
block_groups_nyc[geo_field] = block_groups_nyc[geo_field].astype(str)

# Merge population into block group geodataframe
block_groups_nyc = block_groups_nyc.merge(pop_df[["GEOID", "population"]],
                                          left_on=geo_field, right_on="GEOID",
                                          how="left")
print("Columns in block_groups_nyc:", block_groups_nyc.columns.tolist())
block_groups_nyc.head(2)


Columns in block_groups_nyc: ['STATEFP', 'COUNTYFP', 'TRACTCE', 'BLKGRPCE', 'GEOID', 'NAMELSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER', 'INTPTLAT', 'INTPTLON', 'geometry', 'population']


Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,BLKGRPCE,GEOID,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry,population
0,36,61,23900,1,360610239001,Block Group 1,G5030,S,27517,0,40.8322236,-73.9404112,"POLYGON ((-73.94112 40.83166, -73.94088 40.832...",1538
1,36,61,13900,1,360610139001,Block Group 1,G5030,S,23621,0,40.7688543,-73.9868884,"POLYGON ((-73.98806 40.76979, -73.98666 40.769...",2059


## 5. Buffer Each Station by 0.5 Miles

Each subway station is considered the center of a catchment area with a 0.5-mile radius. To construct this, we create a circular buffer of 0.5 miles around each station point. Important: We must project our data to a planar coordinate system (with distance in consistent units) before buffering. Working in latitude/longitude (degrees) would give incorrect distances and areas
geopandas.org
. We'll use a projection in feet or meters appropriate for NYC (for example, EPSG:32618 is UTM zone 18N which covers New York in meters, or EPSG:2263 which is NY State Plane in feet). Here, we'll use a meter-based CRS so the buffer distance can be given in meters.

We will transform both the station points and block group polygons to the same projected CRS.

**Explanation:** We use GeoDataFrame.to_crs() to convert geographies to UTM Zone 18N which uses meters. The buffer distance is calculated as 0.5 * 1609.34 (since 1 mile is ~1609.34 meters). Using GeoSeries.buffer() generates a polygon around each point with the given radius. Now stations_proj has an extra column buffer_geom containing the 0.5-mile radius polygon for each station.

Below is a data-scientist-oriented methodology explanation for this workflow, written in the same analytical style as the previous ones.

## Methodology

This workflow prepares transit station geometries and Census block-group polygons for spatial proximity analysis by projecting them into a planar coordinate reference system (CRS) and generating uniform buffer zones around subway stations. The methodology emphasizes CRS selection, metric-space transformations, and the creation of geometrically consistent service-area representations.

### i. Reprojection into a Planar, Metric CRS

Accurate distance and area calculations require a planar CRS because geographic coordinate systems (e.g., EPSG:4326/WGS84) are angular, not metric. The workflow uses:

* **EPSG:32618** — UTM Zone 18N

This CRS is suitable because:

1. It is a **conformal, projected CRS** defined in meters.
2. It covers the New York region accurately.
3. It minimizes distortion for local-scale analyses (tens of miles or less).

Both datasets—the point-based subway stations and polygon-based block groups—are reprojected into the same CRS to ensure geometries are compatible for distance-based spatial operations.

### ii. Construction of Metric Buffers Around Stations

The analysis requires a 0.5-mile service or influence area around each station. Buffering is performed in the projected CRS where units are meters. The workflow:

1. Converts 0.5 miles to meters
   `0.5 * 1609.34 ≈ 804.67 meters`
2. Applies a geometric buffer operation to each station point:
   `geometry.buffer(buffer_distance)`

The buffer produces a circular polygon representing the area reachable within a 0.5-mile radius (as the crow flies). This is commonly used in urban analytics to approximate catchment areas, transit-access regions, or neighborhood service radii.

### iii. Embedding Buffer Geometries into the Spatial Dataset

The buffer polygons are stored in a new column, `buffer_geom`, within the reprojected stations GeoDataFrame. Maintaining both the original point geometry and the buffer geometry supports:

* proximity analysis (intersections with block groups),
* visualization of service areas,
* downstream aggregations (e.g., population within walking distance of transit),
* sensitivity analysis if the buffer radius changes.

Storing buffers as separate geometry columns rather than replacing the primary geometry conforms to best practices for multi-geometry spatial modeling.

### iV Summary for a Data Scientist

The methodology implements a standard metric reprojection and service-area construction workflow:

1. Reproject all layers into a common planar CRS (UTM Zone 18N) for accurate metric distance calculations.
2. Compute a half-mile buffer around each transit station in meters.
3. Store buffer geometries alongside original geometries to support downstream spatial joins and coverage analyses.

The resulting station-layer now contains metrically accurate catchment areas suitable for population-access calculations, coverage modeling, or equity analyses.


In [113]:
# Project both stations and block groups to a planar CRS for accurate distance/area calculations
projected_crs = "EPSG:32618"  # WGS 84 / UTM Zone 18N (meters)
stations_proj = stations_gdf.to_crs(projected_crs)
block_groups_proj = block_groups_nyc.to_crs(projected_crs)

# Add a buffer geometry of 0.5 miles around each station (0.5 mile ≈ 804.67 meters)
buffer_distance = 0.5 * 1609.34  # miles to meters conversion
stations_proj["buffer_geom"] = stations_proj.geometry.buffer(buffer_distance)

stations_proj.head(2)[["stop_id", "stop_name", "buffer_geom"]]


Unnamed: 0,stop_id,stop_name,buffer_geom
0,101,Van Cortlandt Park-242 St,"POLYGON ((593591.236 4527046.49, 593587.361 45..."
3,103,238 St,"POLYGON ((593404.955 4526535.53, 593401.081 45..."


## 6. Intersect Buffers with Block Groups and Calculate Area-Weighted Population

For each station’s buffer, we find which block group polygons it overlaps and how much of each. The population within the buffer can be estimated by assuming population is uniformly distributed across each block group’s area. Then the fraction of a block group’s area that falls inside the buffer is the fraction of that block group’s population we count towards the station. This is the area-weighted interpolation approach
walker-data.com
 – using area of overlap as weights to allocate population from polygons to the buffer region.

We will iterate over each station buffer, intersect it with the block group geometries, compute area overlaps, and sum the weighted populations:

Explanation: We iterate through each station’s buffer polygon. Using a spatial index speeds up the search by first filtering block groups whose bounding box intersects the buffer’s bounds. Then we filter to actual intersections. For each intersecting block group, we calculate the polygon of overlap (intersection_area) and divide by the block group’s total area to get area_weight. We multiply this weight by the block group’s population to get the portion of the population inside the buffer. Finally, summing these contributions yields the catchment population for the station.

This method assumes population is uniformly distributed within a block group, which is a common assumption for area-weighted interpolation
walker-data.com
. Note that block groups are small (hundreds to a few thousand people
catalog.data.gov
), so this assumption is reasonably accurate at the city scale, though it could over/underestimate in some cases (e.g., if a buffer cuts through a block group with uneven population distribution).

We also included each station’s stop_id, stop_name, and original latitude/longitude (if needed, we could get lat/lon from the original data before projection to avoid any numeric changes due to projection). The catchment population is rounded to one decimal place in this example; you may keep it as an integer if desired.

## Methodology

This workflow estimates the population within a half-mile catchment area around each subway station by intersecting station buffers with Census block-group polygons and allocating population through area-weighted interpolation. The approach combines spatial indexing, geometric overlays, and proportional allocation to create a station-level population measure.

### i. Precomputation of Block-Group Areas

Before performing any overlay operations, the full area of each block-group polygon is computed in the projected CRS. Because both the block-group geometries and station buffers exist in a metric, planar coordinate system, the resulting areas are expressed in square meters. Precomputing these areas is essential for efficiency: each block group’s area is reused many times, and repeatedly calculating polygon areas during the station loop would dramatically increase runtime.

### ii. Spatial Index Construction

A spatial index is built on the block-group layer. This index provides efficient bounding-box intersection queries, which are used to identify candidate block groups that might intersect a station’s buffer. Without a spatial index, each station buffer would need to be compared against every block group, resulting in prohibitively slow O(N_stations × N_blockgroups) behavior. The index reduces this to a logarithmic search plus a small number of actual candidates, making large-scale spatial overlay computations tractable.

### iii. Station-Level Catchment Computation Using Area-Weighted Interpolation

For each station, the workflow retrieves that station’s buffer polygon and performs a two-stage intersection process:

1. **Candidate selection:** The spatial index is used to find block groups whose bounding boxes intersect the buffer’s bounding box. This filters the dataset to only the relevant nearby polygons.
2. **Exact geometry intersection:** A precise geometric intersection check is applied to the candidate set to identify block groups that truly intersect the buffer area.

If no block groups intersect the buffer, the catchment population is zero. When intersections do occur, the workflow computes the exact overlap area between each block group and the buffer. The fraction of each block group that lies within the buffer—the area weight—is calculated by dividing the overlap area by the block group’s total area.

Population is then allocated proportionally using the standard assumption of uniform population distribution within each block group. Each block group contributes its total population multiplied by this area weight. Summing the contributions across all intersecting block groups yields the estimated population within the half-mile buffer of that station.

### iv. Assembly of Station-Level Results

For each station, relevant attributes such as station identifier, station name, and projected coordinates are combined with the computed catchment population into a results structure. After processing all stations, these records are assembled into a consolidated DataFrame representing station-level population catchments across the entire network.

### v. Summary for a Data Scientist

This methodology implements a scalable, GIS-aware population allocation pipeline:

1. Precompute polygon areas in a metric CRS.
2. Use a spatial index to efficiently retrieve candidate polygons for each buffer.
3. Apply exact geometric overlay to determine intersection areas.
4. Perform area-weighted interpolation to estimate population within station catchments.
5. Aggregate all results into a unified station-level population dataset.

The resulting output is suitable for accessibility modeling, transit equity analysis, service planning, or integration into broader urban analytics workflows.


In [114]:
from shapely.geometry import Polygon

# Precompute each block group's full area (in projected CRS units, e.g., square meters)
block_groups_proj["area"] = block_groups_proj.geometry.area

catchment_results = []  # list to collect population results for each station

# Use spatial index for efficiency to find candidate block groups for each buffer
bg_sindex = block_groups_proj.sindex

for idx, station in stations_proj.iterrows():
    station_id = station["stop_id"]
    station_name = station["stop_name"]
    buffer_poly = station["buffer_geom"]

    # Find block groups whose bounding box intersects the buffer's bounding box (spatial index pre-filter)
    possible_matches_index = list(bg_sindex.intersection(buffer_poly.bounds))
    possible_bgs = block_groups_proj.iloc[possible_matches_index]
    # Refine: take only those that actually intersect the buffer geometry
    intersecting_bgs = possible_bgs[possible_bgs.intersects(buffer_poly)].copy()
    if intersecting_bgs.empty:
        catchment_pop = 0
    else:
        # Compute intersection polygon area for each overlapping block group
        intersecting_bgs["intersection_area"] = intersecting_bgs.geometry.intersection(buffer_poly).area
        # Calculate area weight (fraction of block group area within buffer)
        intersecting_bgs["area_weight"] = intersecting_bgs["intersection_area"] / intersecting_bgs["area"]
        # Estimate population in buffer = area weight * total block group population
        intersecting_bgs["pop_in_buffer"] = intersecting_bgs["area_weight"] * intersecting_bgs["population"]
        # Sum up the population contributions from all intersecting block groups
        catchment_pop = intersecting_bgs["pop_in_buffer"].sum()

    catchment_results.append({
        "stop_id": station_id,
        "stop_name": station_name,
        "lat": station.geometry.y,   # Note: station.geometry is projected; use original if needed
        "lon": station.geometry.x,
        "catchment_population": round(catchment_pop, 1)  # rounding to 1 decimal (optional)
    })

# Convert results to DataFrame
catchment_df = pd.DataFrame(catchment_results)
catchment_df.head(5)


Unnamed: 0,stop_id,stop_name,lat,lon,catchment_population
0,101,Van Cortlandt Park-242 St,4527046.0,592786.565798,16760.3
1,103,238 St,4526536.0,592600.285284,30776.3
2,104,231 St,4525886.0,592274.388599,43777.7
3,106,Marble Hill-225 St,4525404.0,591859.291238,38729.2
4,107,215 St,4524830.0,591407.264497,29386.7


## 7. Output Results to CSV

Finally, we save the results to a CSV file with the requested columns: stop_id, stop_name, lat, lon, catchment_population. This CSV can be used for further analysis or visualization.

The CSV file station_catchment_population.csv will contain one row per station, for example:

Each station’s catchment_population is the sum of area-weighted block group population within 0.5 miles of that station. This completes the computation of subway station catchment populations. You can further utilize this data for analysis such as ranking stations by surrounding population or mapping the results.

Sources:

NYC subway station locations from GTFS stops.txt (user-provided data).

U.S. Census Bureau TIGER/Line Shapefiles for 2020 Block Groups
catalog.data.gov
.

American Community Survey 5-Year 2020 data for block group populations (variable B01003_001).

Methodology of area-weighted interpolation for spatial data
walker-data.com
geopandas.org
 (assuming uniform population distribution over block group areas
walker-data.com
).

Definition and typical size of census Block Groups
catalog.data.gov
.

In [115]:
# Save the results to CSV
catchment_df.to_csv("station_catchment_population.csv", index=False)

print("Saved station_catchment_population.csv with columns:", catchment_df.columns.tolist())


Saved station_catchment_population.csv with columns: ['stop_id', 'stop_name', 'lat', 'lon', 'catchment_population']


# Jobs Within 0.5 Miles of NYC Subway Stations (LODES 2021 and Census Block Groups)

**Introduction**

This analysis computes the number of jobs within a 0.5-mile radius of each New York City subway station. We use the LODES Workplace Area Characteristics (WAC) data (2021, New York) for employment counts, and the 2020 Census block group boundaries for spatial analysis. The methodology mirrors a previous population analysis by using area-weighted interpolation – i.e. allocating job counts from block group polygons to station buffer areas proportional to the area of overlap
gis.stackexchange.com
. We produce a CSV listing each station’s total jobs and jobs in key industry “supersectors” (e.g. retail, education, health, manufacturing). All data used are open-source: the U.S. Census LEHD program for LODES employment
census.gov
, TIGER/Line for block group shapefiles, and the MTA’s GTFS feed for station locations.

**Data sources:**

- LODES WAC 2021 (NY): Provides job counts by workplace location at the Census block level, including total employment and breakdowns by industry sector. The WAC file for New York 2021 (LODES version 8.1) contains a record for each census block’s jobs, with fields such as C000 (total jobs) and CNSxx for NAICS sector categories
lehd.ces.census.gov
lehd.ces.census.gov
. For example, CNS07 = Retail Trade (NAICS 44-45), CNS15 = Educational Services (NAICS 61), CNS16 = Health Care and Social Assistance (NAICS 62), and CNS05 = Manufacturing (NAICS 31-33)
lehd.ces.census.gov
lehd.ces.census.gov
. We will use these fields for our supersector job counts.

- 2020 Block Group Shapefiles (NYC): We obtain the TIGER/Line 2020 shapefile for Census block groups in New York State and filter to the five NYC counties (Bronx, Kings, New York, Queens, Richmond)
en.wikipedia.org
. Block groups are subdivisions of census tracts (generally 600–3,000 people) and share the first digit of their 4-digit census block codes
catalog.data.gov
. Block group “0” codes represent water-only areas with no population
catalog.data.gov
 (and presumably no jobs), which we will exclude.

- GTFS stops.txt (NYC Subway): Provided by the user, containing subway stop locations. We will filter for records with location_type = 1, which correspond to station coordinates (as opposed to individual stop platforms)
gtfs.org
. These station points will serve as the centers of our 0.5-mile buffers.

The analysis proceeds through data download/preparation, constructing station buffers, spatially joining job data to geography, performing area-weighted interpolation, and aggregating results by station.

## 1. Download Latest LODES WAC Data for NY (2021)

First, we download the Workplace Area Characteristics (WAC) file for New York 2021 from the Census LEHD site
census.gov
- According to Census, the LODES 8.1 release includes 2021 data for most states
census.gov
- The file is a CSV (gzipped) with records by workplace census block. We will use Python’s requests to fetch the file and pandas to load the data.

Important: We limit the columns to those needed for efficiency. Specifically, we keep:

- w_geocode: Workplace Census Block (15-digit code)

- C000: Total number of jobs lehd.ces.census.gov

- CNS05: Manufacturing jobs (NAICS 31-33) lehd.ces.census.gov

- CNS07: Retail jobs (NAICS 44-45) lehd.ces.census.gov

- CNS15: Education jobs (NAICS 61) lehd.ces.census.gov

- CNS16: Health jobs (NAICS 62) lehd.ces.census.gov

These cover the total and the example supersectors requested. (Additional sectors could be included similarly if needed.)

## 1. Load LODES WAC Jobs (Block-Level)

In [116]:
import pandas as pd

def load_lodes_wac_jobs(lodes_path: str, job_cols=None) -> pd.DataFrame:
    """
    Load LODES Workplace Area Characteristics (WAC) jobs data for New York.

    Parameters
    ----------
    lodes_path : str
        Path or URL to the LODES WAC file (e.g., 'ny_wac_2021.csv.gz').
    job_cols : list of str, optional
        List of job columns to keep. If None, defaults to ['C000'] (total jobs).

    Returns
    -------
    pd.DataFrame
        DataFrame with at least:
        - 'w_geocode' : block-level GEOID (string)
        - job columns
    """
    if job_cols is None:
        job_cols = ["C000"]  # total jobs column in LODES WAC

    usecols = ["w_geocode"] + job_cols
    lodes_df = pd.read_csv(lodes_path, usecols=usecols, dtype={"w_geocode": "string"})

    # Ensure job columns are numeric
    for col in job_cols:
        lodes_df[col] = pd.to_numeric(lodes_df[col], errors="coerce").fillna(0)

    return lodes_df


## 2. Aggregate Jobs from Blocks to Block Groups

In [117]:
def aggregate_jobs_to_block_groups(lodes_df: pd.DataFrame,
                                   job_cols=None,
                                   geoid_col_out: str = "GEOID") -> pd.DataFrame:
    """
    Aggregate block-level LODES jobs to block-group level using GEOID truncation.

    Parameters
    ----------
    lodes_df : pd.DataFrame
        LODES WAC DataFrame with 'w_geocode' and job columns.
    job_cols : list of str, optional
        Job columns to aggregate. If None, uses all columns except 'w_geocode'.
    geoid_col_out : str, default 'GEOID'
        Name of the output GEOID column (block-group level).

    Returns
    -------
    pd.DataFrame
        DataFrame with:
        - geoid_col_out (block-group GEOID, 12 chars)
        - aggregated job columns
    """
    if job_cols is None:
        job_cols = [c for c in lodes_df.columns if c != "w_geocode"]

    df = lodes_df.copy()
    # Block-group GEOID is the first 12 digits of the 15-digit block GEOID
    df[geoid_col_out] = df["w_geocode"].str.slice(0, 12)

    grouped = (
        df.groupby(geoid_col_out, as_index=False)[job_cols]
        .sum()
    )

    return grouped


## 3. Attach Jobs to Block-Group GeoDataFrame

In [118]:
import geopandas as gpd

def attach_jobs_to_block_groups(block_groups_gdf: gpd.GeoDataFrame,
                                jobs_df: pd.DataFrame,
                                bg_geoid_col: str = "GEOID",
                                jobs_geoid_col: str = "GEOID") -> gpd.GeoDataFrame:
    """
    Merge aggregated jobs onto the block-group GeoDataFrame.

    Parameters
    ----------
    block_groups_gdf : gpd.GeoDataFrame
        Block-group geometries with a GEOID column.
    jobs_df : pd.DataFrame
        Block-group level jobs table with a GEOID column.
    bg_geoid_col : str, default 'GEOID'
        GEOID column name in block_groups_gdf.
    jobs_geoid_col : str, default 'GEOID'
        GEOID column name in jobs_df.

    Returns
    -------
    gpd.GeoDataFrame
        GeoDataFrame with job columns attached.
    """
    gdf = block_groups_gdf.copy()

    gdf[bg_geoid_col] = gdf[bg_geoid_col].astype(str)
    jobs_df[jobs_geoid_col] = jobs_df[jobs_geoid_col].astype(str)

    merged = gdf.merge(
        jobs_df,
        left_on=bg_geoid_col,
        right_on=jobs_geoid_col,
        how="left"
    )

    return merged


## 4. Generic Catchment Calculator (Area-Weighted for Any Attribute)

In [119]:
from shapely.geometry import Polygon
import numpy as np

def compute_catchment_for_attribute(
    stations_gdf: gpd.GeoDataFrame,
    block_groups_gdf: gpd.GeoDataFrame,
    attribute_name: str,
    buffer_col: str = "buffer_geom",
    area_col: str = "area",
    id_cols: tuple = ("stop_id", "stop_name"),
    bg_sindex=None
) -> pd.DataFrame:
    """
    Compute area-weighted catchment totals for a given attribute within station buffers.

    Parameters
    ----------
    stations_gdf : gpd.GeoDataFrame
        Stations with projected geometry and a buffer geometry column.
    block_groups_gdf : gpd.GeoDataFrame
        Block groups with projected geometry, precomputed area, and the attribute.
    attribute_name : str
        Name of the numeric attribute on block_groups_gdf (e.g., 'population', 'C000', 'jobs').
    buffer_col : str, default 'buffer_geom'
        Column name of buffer polygon geometry in stations_gdf.
    area_col : str, default 'area'
        Column name of polygon area in block_groups_gdf.
    id_cols : tuple, default ('stop_id', 'stop_name')
        Columns to carry through to the output.
    bg_sindex : spatial index, optional
        Precomputed spatial index for block_groups_gdf. If None, will build one.

    Returns
    -------
    pd.DataFrame
        DataFrame with one row per station and a `catchment_{attribute_name}` column.
    """
    if bg_sindex is None:
        bg_sindex = block_groups_gdf.sindex

    results = []

    for idx, station in stations_gdf.iterrows():
        buffer_poly = station[buffer_col]
        if buffer_poly is None or buffer_poly.is_empty:
            catchment_value = 0.0
        else:
            candidate_idx = list(bg_sindex.intersection(buffer_poly.bounds))
            if not candidate_idx:
                catchment_value = 0.0
            else:
                candidates = block_groups_gdf.iloc[candidate_idx]
                intersecting = candidates[candidates.intersects(buffer_poly)].copy()

                if intersecting.empty:
                    catchment_value = 0.0
                else:
                    # Exact intersection areas
                    intersecting["intersection_area"] = intersecting.geometry.intersection(buffer_poly).area
                    # Area weights
                    intersecting["area_weight"] = intersecting["intersection_area"] / intersecting[area_col]
                    # Weighted attribute
                    intersecting["weighted_attr"] = intersecting["area_weight"] * intersecting[attribute_name]
                    catchment_value = float(intersecting["weighted_attr"].sum())

        record = {
            id_cols[0]: station[id_cols[0]],
            id_cols[1]: station[id_cols[1]],
            f"catchment_{attribute_name}": catchment_value
        }

        # Optional: add coordinates from projected geometry (or original lat/lon if you prefer)
        record["x"] = station.geometry.x
        record["y"] = station.geometry.y

        results.append(record)

    return pd.DataFrame(results)


## 5. Jobs-Specific Wrapper

In [120]:
def compute_jobs_within_half_mile(
    stations_proj: gpd.GeoDataFrame,
    block_groups_proj: gpd.GeoDataFrame,
    jobs_col: str = "jobs",
    buffer_col: str = "buffer_geom",
    area_col: str = "area",
    id_cols: tuple = ("stop_id", "stop_name"),
) -> pd.DataFrame:
    """
    Compute jobs within 0.5 miles of each station using the generic catchment engine.

    Assumes:
    - stations_proj has a buffer geometry column (0.5 mile radius).
    - block_groups_proj has the given jobs_col and precomputed area.
    - Both layers are in the same metric CRS.

    Parameters
    ----------
    stations_proj : gpd.GeoDataFrame
        Projected stations with buffer geometry.
    block_groups_proj : gpd.GeoDataFrame
        Projected block groups with job counts.
    jobs_col : str, default 'jobs'
        Column name for jobs on block_groups_proj.
    buffer_col : str, default 'buffer_geom'
        Column name of buffer polygon geometry in stations_proj.
    area_col : str, default 'area'
        Column name of polygon area in block_groups_proj.
    id_cols : tuple, default ('stop_id', 'stop_name')
        Columns to carry through to the output.

    Returns
    -------
    pd.DataFrame
        DataFrame with one row per station and 'catchment_jobs'.
    """
    bg_sindex = block_groups_proj.sindex

    jobs_catchment_df = compute_catchment_for_attribute(
        stations_gdf=stations_proj,
        block_groups_gdf=block_groups_proj,
        attribute_name=jobs_col,
        buffer_col=buffer_col,
        area_col=area_col,
        id_cols=id_cols,
        bg_sindex=bg_sindex
    )

    # For convenience, rename catchment column to a fixed name
    jobs_catchment_df = jobs_catchment_df.rename(
        columns={f"catchment_{jobs_col}": "catchment_jobs"}
    )

    return jobs_catchment_df


In [121]:
import requests
from functools import reduce

# ---------------------------------------------------------------------
# 1. Load LODES WAC 2021 for NY with total jobs + all 20 industry sectors
# ---------------------------------------------------------------------

# LODES8 WAC file (New York, workplace area characteristics, 2021)
# lodes_path = "https://lehd.ces.census.gov/data/lodes/LODES8/ny/wac/ny_wac_2021.csv.gz"
lodes_path = "https://lehd.ces.census.gov/data/lodes/LODES8/ny/wac/ny_wac_S000_JT00_2021.csv.gz"


# All 20 industry columns CNS01..CNS20
industry_cols = [f"CNS{i:02d}" for i in range(1, 21)]

# Total jobs + industry columns
job_cols = ["C000"] + industry_cols

# Use Module 1: load_lodes_wac_jobs
lodes_df = load_lodes_wac_jobs(lodes_path, job_cols=job_cols)

# Use Module 2: aggregate_jobs_to_block_groups
jobs_bg_df = aggregate_jobs_to_block_groups(lodes_df, job_cols=job_cols)

# jobs_bg_df now has: GEOID, C000, CNS01..CNS20


# ---------------------------------------------------------------------
# 2. Attach jobs (total + by industry) to block-group geometries
# ---------------------------------------------------------------------

# Use Module 3: attach_jobs_to_block_groups
block_groups_nyc_with_jobs = attach_jobs_to_block_groups(block_groups_nyc, jobs_bg_df)

# At this point, block_groups_nyc_with_jobs contains:
# - geometry for each block group
# - GEOID
# - C000 (total jobs)
# - CNS01..CNS20 (jobs by industry)


# ---------------------------------------------------------------------
# 3. Project block groups to metric CRS and ensure area column
# ---------------------------------------------------------------------

projected_crs = "EPSG:32618"  # same CRS used for stations_proj and buffers

block_groups_proj = block_groups_nyc_with_jobs.to_crs(projected_crs)

if "area" not in block_groups_proj.columns:
    block_groups_proj["area"] = block_groups_proj.geometry.area


# ---------------------------------------------------------------------
# 4. Compute total jobs within 0.5 miles of each station
#    (using Module 5: compute_jobs_within_half_mile)
# ---------------------------------------------------------------------

# Here we tell Module 5 to use 'C000' as the jobs column
jobs_catchment_df = compute_jobs_within_half_mile(
    stations_proj=stations_proj,
    block_groups_proj=block_groups_proj,
    jobs_col="C000",           # total jobs column name
    buffer_col="buffer_geom",
    area_col="area",
    id_cols=("stop_id", "stop_name"),
)

# jobs_catchment_df has:
# - stop_id
# - stop_name
# - x, y (projected coordinates)
# - catchment_jobs  (total jobs within 0.5 miles)


# ---------------------------------------------------------------------
# 5. Compute jobs within 0.5 miles by industry (CNS01..CNS20)
#    using Module 4: compute_catchment_for_attribute
# ---------------------------------------------------------------------

bg_sindex = block_groups_proj.sindex
industry_catchment_dfs = []

for col in industry_cols:
    df_attr = compute_catchment_for_attribute(
        stations_gdf=stations_proj,
        block_groups_gdf=block_groups_proj,
        attribute_name=col,          # e.g. 'CNS01'
        buffer_col="buffer_geom",
        area_col="area",
        id_cols=("stop_id", "stop_name"),
        bg_sindex=bg_sindex
    )
    # Rename catchment_CNSxx -> catchment_cnsxx for cleaner column names
    df_attr = df_attr.rename(
        columns={f"catchment_{col}": f"catchment_{col.lower()}"}
    )
    industry_catchment_dfs.append(df_attr)

# ---------------------------------------------------------------------
# 6. Merge total jobs catchments with industry-specific catchments
# ---------------------------------------------------------------------

catchment_all = jobs_catchment_df.copy()

# Merge each industry catchment table by stop_id
for df_attr in industry_catchment_dfs:
    catchment_all = catchment_all.merge(
        df_attr.drop(columns=["stop_name", "x", "y"]),
        on="stop_id",
        how="left"
    )

# catchment_all now contains:
# - stop_id
# - stop_name
# - x, y (projected coordinates)
# - catchment_jobs        (total jobs within 0.5 miles)
# - catchment_cns01 ... catchment_cns20 (jobs by industry within 0.5 miles)

# Example: inspect the first few rows
catchment_all.head()


Unnamed: 0,stop_id,stop_name,catchment_jobs,x,y,catchment_cns01,catchment_cns02,catchment_cns03,catchment_cns04,catchment_cns05,...,catchment_cns11,catchment_cns12,catchment_cns13,catchment_cns14,catchment_cns15,catchment_cns16,catchment_cns17,catchment_cns18,catchment_cns19,catchment_cns20
0,101,Van Cortlandt Park-242 St,5323.15758,592786.565798,4527046.0,0.0,0.0,0.0,200.983764,5.45841,...,434.381465,34.831864,1.005934,117.129051,2355.88755,684.68095,23.103223,524.951167,93.706907,179.612473
1,103,238 St,7045.581151,592600.285284,4526536.0,0.0,0.0,0.0,415.832702,4.470685,...,723.955567,161.348894,7.242055,143.048887,1661.324412,1401.247679,33.425062,721.615802,196.735858,322.377761
2,104,231 St,7101.219383,592274.388599,4525886.0,0.0,0.0,0.0,365.129515,11.401835,...,538.257261,223.20682,12.335477,63.129821,1538.165648,1584.72961,44.252094,600.950456,192.764758,311.0
3,106,Marble Hill-225 St,6261.653527,591859.291238,4525404.0,0.0,0.0,0.0,173.322256,10.783227,...,448.600238,126.609629,10.96987,43.096075,1728.51328,1359.260802,87.039225,350.224826,214.288497,326.001417
4,107,215 St,5624.31127,591407.264497,4524830.0,0.0,0.0,0.0,27.462784,9.154062,...,151.143761,115.583955,105.0,97.642611,1034.550078,1132.283331,67.269092,458.959151,204.787778,656.916667


# Vehicle Availability near NYC Subway Stations (ACS 2020 Analysis)
## 1. Retrieve ACS 2020 Data for Household Vehicle Availability (B08201)

We first obtain American Community Survey (ACS) 5-Year 2020 estimates for table B08201: Household Size by Vehicles Available at the block group level. In particular, we need two variables from this table:

- B08201_001E – Total households api.census.gov

- B08201_002E – Households with no vehicle available api.census.gov

These correspond to the total number of households and the subset with zero vehicles, respectively (the table also includes households with 1, 2, 3, 4+ vehicles)
censusreporter.org
. We can use the CenPy or censusdata library to query the Census API for these variables at the block group geography in New York State. For example, using the censusdata library:

This will download the total households and zero-vehicle households for every block group in New York State. If using CenPy, one could similarly specify the product and table:

# Vehicle Availability near NYC Subway Stations (ACS 2020 Analysis)

## Pull ACS 2020 B08201 at block group level (NY)

Add this once (analog of your LODES download + aggregation):

## generic ACS block-group loader for NYC

## Next

In [122]:
import requests
import pandas as pd

def load_acs_tracts_nyc(
    api_key: str,
    variables: list,
    year: int = 2020,
    state_fips: str = "36",
    nyc_county_fips: list = None
) -> pd.DataFrame:
    """
    Load ACS 5-year data for specified variables at the tract level
    for NYC counties.

    Parameters
    ----------
    api_key : str
        Census API key.
    variables : list of str
        List of ACS variable codes (e.g., ['B08201_001E', 'B08201_002E']).
    year : int, default 2020
        ACS year.
    state_fips : str, default '36'
        State FIPS (36 = NY).
    nyc_county_fips : list of str, optional
        County FIPS codes for NYC. If None, uses the five boroughs.

    Returns
    -------
    pd.DataFrame
        Tract-level DataFrame with:
        - one col per requested variable (numeric)
        - 'state', 'county', 'tract'
        - 'TRACT_GEOID' = 11-digit state+county+tract
    """
    if nyc_county_fips is None:
        nyc_county_fips = ["005", "047", "061", "081", "085"]

    base_url = f"https://api.census.gov/data/{year}/acs/acs5"
    get_vars = ",".join(variables)

    dfs = []
    for county in nyc_county_fips:
        url = (
            f"{base_url}?get={get_vars}"
            f"&for=tract:*&in=state:{state_fips}&in=county:{county}"
            f"&key={api_key}"
        )
        resp = requests.get(url)
        resp.raise_for_status()
        data = resp.json()
        cols = data[0]
        rows = data[1:]
        df = pd.DataFrame(rows, columns=cols)
        dfs.append(df)

    df_all = pd.concat(dfs, ignore_index=True)

    # convert numeric vars
    for v in variables:
        df_all[v] = pd.to_numeric(df_all[v], errors="coerce").fillna(0)

    # 11-digit tract GEOID
    df_all["TRACT_GEOID"] = (
        df_all["state"] + df_all["county"] + df_all["tract"]
    )

    return df_all


Continue

In [126]:
import geopandas as gpd

def disaggregate_tract_vehicles_to_block_groups(
    block_groups_gdf: gpd.GeoDataFrame,
    veh_tract_df: pd.DataFrame,
    pop_col: str = "population"
) -> gpd.GeoDataFrame:
    """
    Downscale tract-level vehicle availability (hh_total, hh_0veh)
    to block groups using block-group population as weights.
    """
    gdf = block_groups_gdf.copy()

    # Make sure tract ID exists on block groups
    gdf["TRACT_GEOID"] = (
        gdf["STATEFP"] + gdf["COUNTYFP"] + gdf["TRACTCE"]
    )

    # Merge tract-level vehicle totals
    gdf = gdf.merge(
        veh_tract_df[["TRACT_GEOID", "hh_total", "hh_0veh"]],
        on="TRACT_GEOID",
        how="left"
    )

    # Tract-level population totals for weighting
    tract_pop = gdf.groupby("TRACT_GEOID")[pop_col].transform("sum")

    # Population share within tract (weights)
    weights = gdf[pop_col] / tract_pop.replace({0: pd.NA})
    weights = weights.fillna(0)

    # Disaggregate to block groups
    gdf["hh_total_bg"] = gdf["hh_total"] * weights
    gdf["hh_0veh_bg"] = gdf["hh_0veh"] * weights

    return gdf


In [123]:
veh_vars = ["B08201_001E", "B08201_002E"]  # total hh, no-vehicle hh

veh_tract_raw = load_acs_tracts_nyc(
    api_key=API_KEY,
    variables=veh_vars,
    year=2020,
    state_fips="36",
    nyc_county_fips=nyc_county_fips
)

veh_tract = veh_tract_raw[["TRACT_GEOID"] + veh_vars].copy()
veh_tract = veh_tract.rename(
    columns={
        "B08201_001E": "hh_total",
        "B08201_002E": "hh_0veh"
    }
)

print(veh_tract.head())
print(veh_tract[["hh_total", "hh_0veh"]].describe())


   TRACT_GEOID  hh_total  hh_0veh
0  36005019900      3145     2524
1  36005020000      1466      857
2  36005020100      1288      913
3  36005020200       875      497
4  36005020400       994      354
         hh_total      hh_0veh
count  2327.00000  2327.000000
mean   1371.59046   751.487752
std     904.81293   761.749639
min       0.00000     0.000000
25%     749.00000   201.500000
50%    1208.00000   525.000000
75%    1790.00000  1065.500000
max    8078.00000  6076.000000


In [128]:
# Disaggregate tract-level households to block groups using population weights

block_groups_proj_with_veh = disaggregate_tract_vehicles_to_block_groups(
    block_groups_gdf=block_groups_proj,  # your existing projected BG layer with 'population'
    veh_tract_df=veh_tract,
    pop_col="population"
)

# Ensure area exists for catchment calculations
if "area" not in block_groups_proj_with_veh.columns:
    block_groups_proj_with_veh["area"] = block_groups_proj_with_veh.geometry.area


  weights = weights.fillna(0)


In [129]:
bg_sindex = block_groups_proj_with_veh.sindex

# Total households within 0.5 miles
hh_total_catch = compute_catchment_for_attribute(
    stations_gdf=stations_proj,
    block_groups_gdf=block_groups_proj_with_veh,
    attribute_name="hh_total_bg",
    buffer_col="buffer_geom",
    area_col="area",
    id_cols=("stop_id", "stop_name"),
    bg_sindex=bg_sindex
)

# Zero-vehicle households within 0.5 miles
hh_0veh_catch = compute_catchment_for_attribute(
    stations_gdf=stations_proj,
    block_groups_gdf=block_groups_proj_with_veh,
    attribute_name="hh_0veh_bg",
    buffer_col="buffer_geom",
    area_col="area",
    id_cols=("stop_id", "stop_name"),
    bg_sindex=bg_sindex
)

# Assemble final vehicle-availability table

veh_catch = hh_total_catch[
    ["stop_id", "stop_name", "x", "y", "catchment_hh_total_bg"]
].rename(columns={"catchment_hh_total_bg": "catchment_hh_total"})

veh_catch = veh_catch.merge(
    hh_0veh_catch[["stop_id", "catchment_hh_0veh_bg"]].rename(
        columns={"catchment_hh_0veh_bg": "catchment_hh_0veh"}
    ),
    on="stop_id",
    how="left"
)

veh_catch["share_hh_0veh"] = (
    veh_catch["catchment_hh_0veh"] / veh_catch["catchment_hh_total"]
)

veh_catch.head()


Unnamed: 0,stop_id,stop_name,x,y,catchment_hh_total,catchment_hh_0veh,share_hh_0veh
0,101,Van Cortlandt Park-242 St,592786.565798,4527046.0,6608.535281,2849.703971,0.431216
1,103,238 St,592600.285284,4526536.0,12454.784914,6141.814385,0.493129
2,104,231 St,592274.388599,4525886.0,17059.822583,9846.745263,0.577189
3,106,Marble Hill-225 St,591859.291238,4525404.0,15435.797888,9332.703173,0.604614
4,107,215 St,591407.264497,4524830.0,11176.962351,7651.186986,0.68455


In [125]:
if "area" not in block_groups_proj_with_veh.columns:
    block_groups_proj_with_veh["area"] = block_groups_proj_with_veh.geometry.area

bg_sindex = block_groups_proj_with_veh.sindex

hh_total_catch = compute_catchment_for_attribute(
    stations_gdf=stations_proj,
    block_groups_gdf=block_groups_proj_with_veh,
    attribute_name="hh_total",
    buffer_col="buffer_geom",
    area_col="area",
    id_cols=("stop_id", "stop_name"),
    bg_sindex=bg_sindex
)

hh_0veh_catch = compute_catchment_for_attribute(
    stations_gdf=stations_proj,
    block_groups_gdf=block_groups_proj_with_veh,
    attribute_name="hh_0veh",
    buffer_col="buffer_geom",
    area_col="area",
    id_cols=("stop_id", "stop_name"),
    bg_sindex=bg_sindex
)

veh_catch = hh_total_catch[
    ["stop_id", "stop_name", "x", "y", "catchment_hh_total"]
].merge(
    hh_0veh_catch[["stop_id", "catchment_hh_0veh"]],
    on="stop_id",
    how="left"
)

veh_catch["share_hh_0veh"] = (
    veh_catch["catchment_hh_0veh"] / veh_catch["catchment_hh_total"]
)

veh_catch.head()


Unnamed: 0,stop_id,stop_name,x,y,catchment_hh_total,catchment_hh_0veh,share_hh_0veh
0,101,Van Cortlandt Park-242 St,592786.565798,4527046.0,26815.943989,11702.900985,0.436416
1,103,238 St,592600.285284,4526536.0,50214.163885,24724.281993,0.492377
2,104,231 St,592274.388599,4525886.0,72637.54375,42179.069723,0.580679
3,106,Marble Hill-225 St,591859.291238,4525404.0,65410.264577,39532.278397,0.604374
4,107,215 St,591407.264497,4524830.0,51049.206155,35282.209472,0.691141
