In [12]:
import numpy as np
import pandas as pd
import requests

## Goals

Predict number of riders for a given day based on day of week, holiday, temperature, chance of rain, something like estimated PTO taken by people in that week. 

I'm not using week date as a predictor, that feels like it'd overfit or maybe interfere with the PTO taken part. I'll figure it out.

## WMATA Dataset

Downloaded September 8, 2024, with data up to last week, starting on Jan 1, 2023

In [5]:
wmata_full = pd.read_csv("Daily Ridership - Station-Level Full data.csv")
wmata_full.head()

Unnamed: 0,Date,Mode,Day of Week,Holiday,Station Name,Weekday / Saturday / Sunday,Entries Or Boardings
0,5/22/2021 12:00:00 AM,Rail,Sat,No,Addison Road,Saturday,15
1,5/29/2021 12:00:00 AM,Rail,Sat,No,Addison Road,Saturday,216
2,6/5/2021 12:00:00 AM,Rail,Sat,No,Addison Road,Saturday,255
3,6/12/2021 12:00:00 AM,Rail,Sat,No,Addison Road,Saturday,334
4,6/19/2021 12:00:00 AM,Rail,Sat,No,Addison Road,Saturday,271


In [8]:
# making the date column simpler, just dates and no times
wmata_full['Date'] = pd.to_datetime(wmata_full['Date']).dt.date
wmata_full['Date'] = pd.to_datetime(wmata_full['Date'])
wmata_full.head()

Unnamed: 0,Date,Mode,Day of Week,Holiday,Station Name,Weekday / Saturday / Sunday,Entries Or Boardings
0,2021-05-22,Rail,Sat,No,Addison Road,Saturday,15
1,2021-05-29,Rail,Sat,No,Addison Road,Saturday,216
2,2021-06-05,Rail,Sat,No,Addison Road,Saturday,255
3,2021-06-12,Rail,Sat,No,Addison Road,Saturday,334
4,2021-06-19,Rail,Sat,No,Addison Road,Saturday,271


In [9]:
wmata_full['Month'] = wmata_full['Date'].dt.month

In [14]:
# I need zip codes for each station, so that then I can look at weather in that zip code for the day
# Thankfully, MD has a dataset for all the stations in the DMV
metro_stations_full = pd.read_csv("Maryland_Transit_-_WMATA_Metro_Stops.csv")
metro_stations_full.head()

metro_stations = metro_stations_full[["NAME","ADDRESS","MetroLine"]]

# but they don't freaking have zip codes


In [16]:
# Replace with your OpenCage API key
API_KEY = '4a8a3ce6dc484e0685fd913d62432caf'
BASE_URL = 'https://api.opencagedata.com/geocode/v1/json'

def get_zip_code(address):
    params = {
        'q': address,
        'key': API_KEY
    }
    response = requests.get(BASE_URL, params=params)
    data = response.json()
    
    if data['results']:
        address = data['results'][0].get('components', {})
        return address.get('postcode')
    return None

# Assuming the addresses are in a column named 'ADDRESS'
metro_stations['ZipCode'] = metro_stations['ADDRESS'].apply(get_zip_code)

metro_stations.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metro_stations['ZipCode'] = metro_stations['ADDRESS'].apply(get_zip_code)


Unnamed: 0,NAME,ADDRESS,MetroLine,ZipCode
0,College Park-U of Md,"4931 CALVERT ROAD, COLLEGE PARK, MD","green, yellow",20740
1,Capitol Heights,"133 CENTRAL AVENUE, CAPITOL HEIGHTS, MD","blue, orange, silver",20019
2,Morgan Boulevard,"300 GARRETT MORGAN BLVD., LANDOVER, MD","blue, orange, silver",20785
3,Georgia Ave Petworth,"3700 GEORGIA AVENUE NW, WASHINGTON, DC","green, yellow",20009
4,Takoma,"327 CEDAR STREET NW, WASHINGTON, DC",red,20012


In [31]:
# we got zip codes! Let's see a list of unique ones.
metro_stations["ZipCode"].unique()
metro_stations[metro_stations["ZipCode"].isna()]
# we have 8 stations with no zip code, may need to enter those manually

manual_zipcodes = metro_stations[metro_stations['ZipCode'].isna()]
manual_zipcodes["ZipCode"] = ['20784','22211','22202','20745','20852','22202','22303','22303']
manual_zipcodes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  manual_zipcodes["ZipCode"] = ['20784','22211','22202','20745','20852','22202','22303','22303']


Unnamed: 0,NAME,ADDRESS,MetroLine,ZipCode
16,New Carrollton,"4700 GARDEN CITY DRIVE, NEW CARROLLTON, MD",orange,20784
22,Arlington|Cemetery,"1000 NORTH MEMORIAL DRIVE, ARLINGTON, VA",blue,22211
24,Pentagon,"2 SOUTH ROTARY ROAD, ARLINGTON, VA","blue, yellow",22202
29,Southern Ave,"1411 SOUTHERN AVENUE, TEMPLE HILLS, MD",green,20745
39,White Flint,"5500 MARINELLI ROAD, ROCKVILLE, MD",red,20852
58,Ronald Reagan Washington National Airport,"2400 S. SMITH BLVD., ARLINGTON, VA","blue, yellow",22202
76,Huntington,"2701 HUNTINGTON AVENUE, ALEXANDRIA, VA",yellow,22303
89,McLean,"1824 DOLLEY MADISON BOULIVARD MCLEAN, VA 22102",silver,22303


## Joining datasets

In [54]:
wmata_clean = wmata_full.dropna(subset=['Station Name'])
len(wmata_clean['Station Name'].unique())
# 98 stations
wmata_stations_list = sorted(wmata_clean['Station Name'].unique())
gis_stations_list = sorted(metro_stations['NAME'].unique())
print(len(wmata_stations_list))
print(len(gis_stations_list))

98
91


In [63]:
# the names of the stations don't always match
print(wmata_stations_list[0])
gis_stations_list[0]

'Addison Road Seat Pleasant'

In [57]:
for st in wmata_stations_list:
    if st not in gis_stations_list:
        print("Station " + st + " not in GIS dataset")

Station Addison Road not in GIS dataset
Station Archives not in GIS dataset
Station Arlington Cemetery not in GIS dataset
Station Ashburn not in GIS dataset
Station Downtown Largo not in GIS dataset
Station Dulles Airport not in GIS dataset
Station Dunn Loring not in GIS dataset
Station Foggy Bottom-GWU not in GIS dataset
Station Gallery Place not in GIS dataset
Station Georgia Ave-Petworth not in GIS dataset
Station Herndon not in GIS dataset
Station Hyattsville Crossing not in GIS dataset
Station Innovation Center not in GIS dataset
Station Judiciary Square not in GIS dataset
Station King St-Old Town not in GIS dataset
Station Loudoun Gateway not in GIS dataset
Station Mt Vernon Sq not in GIS dataset
Station Navy Yard-Ballpark not in GIS dataset
Station NoMa-Gallaudet U not in GIS dataset
Station North Bethesda not in GIS dataset
Station Potomac Yard not in GIS dataset
Station Reston Town Center not in GIS dataset
Station Shaw-Howard U not in GIS dataset
Station Stadium-Armory not in

## Getting a weather dataset