# External Data Collection

This notebook downloads external datasets to support our Transportation Equity Demand Index (TEDI) analysis:
- **Census Demographics (2022)**: Income, vehicle ownership, population density
- **Weather Data**: Temperature and precipitation for 2024-2025

Note: Using 2022 census data is appropriate as demographic characteristics change slowly over time and provide consistent baseline measures for equity analysis across our 2024-2025 study period.

In [1]:
import os
import requests
import pandas as pd

# Set up external data directory
os.makedirs("../data/external", exist_ok=True)

# Get your Census API Key from: https://api.census.gov/data/key_signup.html
CENSUS_API_KEY = os.getenv("CENSUS_API_KEY") 

## Querying the ACS 2022 5-Year Census API

We use the following variables:
- `B19013_001E`: Median Household Income
- `B08201_001E`: Total Households
- `B08201_002E`: Households Without a Vehicle

These values are retrieved for all **PUMAs** in **New York State (state code 36)**.

In [2]:
# Define API request
base_url = "https://api.census.gov/data/2022/acs/acs5"
get_vars = [
    "NAME",
    "B19013_001E",  # Median Household Income
    "B08201_001E",  # Total Households
    "B08201_002E"   # Households Without a Vehicle
]
params = {
    "get": ",".join(get_vars),
    "for": "public use microdata area:*",
    "in": "state:36",  # NY State
    "key": CENSUS_API_KEY
}

# Query API
response = requests.get(base_url, params=params)
data = response.json()

# Convert to DataFrame
columns = data[0]
acs_df = pd.DataFrame(data[1:], columns=columns)

# Convert numeric columns
acs_df["B19013_001E"] = pd.to_numeric(acs_df["B19013_001E"], errors="coerce")
acs_df["B08201_001E"] = pd.to_numeric(acs_df["B08201_001E"], errors="coerce")
acs_df["B08201_002E"] = pd.to_numeric(acs_df["B08201_002E"], errors="coerce")

# Compute % of households without a vehicle
acs_df["Percent_No_Vehicle"] = 100 * acs_df["B08201_002E"] / acs_df["B08201_001E"]

## Save Dataset

We now save the cleaned ACS dataset to a CSV file in `../data/external/`.


In [3]:
# Rename and reorder for clarity
acs_df = acs_df.rename(columns={
    "B19013_001E": "Median_Income",
    "B08201_001E": "Total_Households",
    "B08201_002E": "No_Vehicle_Households",
    "public use microdata area": "PUMA"
})

acs_df = acs_df[[
    "PUMA", "NAME", "Median_Income", 
    "Total_Households", "No_Vehicle_Households", "Percent_No_Vehicle"
]]

# Save
acs_path = "../data/external/census_acs_puma_2022.csv"
acs_df.to_csv(acs_path, index=False)
print("Saved ACS data to:", acs_path)

Saved ACS data to: ../data/external/census_acs_puma_2022.csv


## Weather Data Collection
Weather impacts transportation demand and equity - rainy/cold days may increase taxi usage in areas with limited public transit.

In [4]:
from meteostat import Point, Daily
from datetime import datetime

# NYC coordinates
nyc = Point(40.7128, -74.0060, 10)  # lat, lon, altitude

print("Downloading NYC weather data using meteostat...")

Downloading NYC weather data using meteostat...


In [5]:
# Get weather data for our specific periods
# Training period: Jan-June 2024
start_2024 = datetime(2024, 1, 1)
end_2024 = datetime(2024, 6, 30)

# Testing period: Jan-June 2025
start_2025 = datetime(2025, 1, 1) 
end_2025 = datetime(2025, 6, 30)

# Fetch data
print("Fetching 2024 weather data...")
weather_2024 = Daily(nyc, start_2024, end_2024).fetch()

print("Fetching 2025 weather data...")
weather_2025 = Daily(nyc, start_2025, end_2025).fetch()

# Combine datasets
weather_combined = pd.concat([weather_2024, weather_2025])
print(f"✓ Retrieved {len(weather_combined)} days of weather data")

Fetching 2024 weather data...
Fetching 2025 weather data...
✓ Retrieved 363 days of weather data


In [7]:
weather_combined.columns

Index(['tavg', 'tmin', 'tmax', 'prcp', 'snow', 'wdir', 'wspd', 'wpgt', 'pres',
       'tsun'],
      dtype='object')

In [10]:
# Process and save weather data
weather_df = weather_combined.reset_index()
weather_df = weather_df.rename(columns={
    'time': 'date',
    'tavg': 'temperature_avg',
    'prcp': 'precipitation_mm',
    'snow': 'snow_mm'
})

# Keep only needed columns 
weather_final = weather_df[['date', 'temperature_avg', 'precipitation_mm', 'snow_mm']].copy()

# Save to file
weather_path = "../data/external/weather_2024_2025.csv" 
weather_final.to_csv(weather_path, index=False)

print(f"✓ Saved weather data: {len(weather_final)} days")
print(f"✓ Coverage: {weather_final['date'].min()} to {weather_final['date'].max()}")
print(weather_final.head())

✓ Saved weather data: 363 days
✓ Coverage: 2024-01-01 00:00:00 to 2025-06-30 00:00:00
        date  temperature_avg  precipitation_mm  snow_mm
0 2024-01-01              6.0               0.0      0.0
1 2024-01-02              3.5               0.0      0.0
2 2024-01-03              4.0               0.0      0.0
3 2024-01-04              4.3               0.0      0.0
4 2024-01-05              0.9               0.0      0.0
