Machine learning

predict number of rides at a given station/area
determine how many stations will be needed if the amount of rides increases?

Goal 
improve bikeshare connectivity in Prince George, Maryland.
KPIS
Geographic & Accessibility KPIs
These show spatial patterns of bikeshare use.
Top Origin & Destination Stations – Identify the most frequently used stations.
Most Popular Routes – Determine the most common start-to-end station pairs.
Trips per Square Mile – Identify high- and low-density usage areas.
Coverage & Accessibility – Percentage of the city covered by bikeshare stations.
Ward-to-Ward Trip Flow – Count trips between different wards to see mobility patterns.


In [1]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
import statistics as st
import seaborn as sns
import datetime 
from geopy import distance
import folium
from folium.plugins import MarkerCluster
from folium.features import GeoJsonTooltip
from branca.colormap import LinearColormap
from collections import Counter
import json
from shapely.geometry import Point
import geopandas as gpd
from shapely.geometry import shape
from shapely.wkt import loads 

Predicting Ride Demand (Number of Rides per Station per Hour/Day)

Features: Station proximity, metro distance,, time of day, weekday/weekend. In/out? (how many are starting and ending on a given station at a given time). Population? others: distance to city center, distance to major points of interest (tourist attractions, business districts, universities) distance to parks & recreational areas, distance to high density residential areas

Model: Poisson Regression, Time Series Regression (ARIMA, XGBoost), Neural Networks.

Insight: Helps predict peak usage times and optimize bike availability.

the df should have station name, time of day or weekday, distance to the closest station, distance to metro. First I should see if there's a correlation between this

Im starting with the premise that not only weather or time, but station location influence the amount of rides

In [2]:
data_types = {
    "rideable_type": "category", 
    "start_station_name": "category", 
    "end_station_name": "category", 
    "member_casual":"category",
    # "ride_id":"uint32",
    "time_of_day":"category",
    "trip_type":"category"}

In [3]:
prince_george = pd.read_csv("prince_georgy_cabi.csv",parse_dates= ["started_at", "ended_at"],dtype=data_types, low_memory=False)

1. Defining the Problem
Target Variable: Number of rides per station per time unit (hour/day/week).
Features:
Station Proximity: Average distance to nearest stations.
Metro Distance: Distance to the nearest metro station.
Time Factors: Hour of the day, day of the week, season.
Area: DC vs. Maryland (as a categorical variable).
2. Data Preprocessing
Aggregate the data by station and time interval (hourly or daily).
Encode categorical features (e.g., one-hot encoding for area).
Normalize numerical features like distance.
Handle missing values if any exist.
3. Choose a Model
Linear Regression (for simple relationships).
Poisson Regression (good for count data like demand).
Random Forest / XGBoost (for non-linear relationships and interactions).
4. Model Training & Evaluation
Train the model on historical data.
Use RMSE or Mean Absolute Error (MAE) for evaluation.
Tune hyperparameters (if using tree-based models).

In [4]:
prince_george.columns
prince_george = prince_george.drop(columns=["Unnamed: 0","AREA_COVER", "index_right",'ACREAGE',
       'IMPRT_DATE', 'SHAPE_AREA', 'SHAPE_LEN'])

In [5]:
prince_george_fixed = prince_george[prince_george["start_station_name"].isin(['1301 McCormick Dr / Wayne K. Curry Admin Bldg',
 '40th Ave & Bladensburg Rd',
 'Baltimore Ave & Jefferson St',
 'Baltimore Ave & Van Buren St / Riverdale Park Station',
 'Baltimore Avenue and Hotel Drive at UMD',
 'Bladensburg Waterfront Park',
 'Bowdoin Ave & Calvert Rd/ College Park Metro',
 'Bowdoin Ave & Calvert Rd/ College Park Station',
 'Capitol Heights Metro',
 'Chillum Rd & Riggs Rd / Riggs Plaza',
 'Crescent Rd & Ridge Rd',
 'Fleet St & Waterfront St',
 'Greenbelt Station Parkway',
 'Guilford Drive & Rowalt Drive / UMD',
 'Hyattsville Library / Adelphi Rd & Toledo Rd',
 "Largo Rd & Campus Way / Prince Georges's Comm Col",
 'Largo Town Center Metro',
 'National Harbor Carousel',
 'New Hampshire Ave & East-West Hwy',
 'Northwestern High School',
 'Oglethorpe St & 42nd Ave',
 'Oxon Hill Park & Ride',
 'Perry & 35th St',
 "Prince George's Plaza Metro",
 'Queens Chapel & Hamilton St',
 'Rhode Island Ave & 39th St / Brentwood Arts Exchange',
 'Rhode Island Avenue /Charles Armentrout Drive - Melrose Skate Park ',
 'Riggs Rd & East West Hwy',
 'Riverdale Park Town Center',
 'Roosevelt Center & Crescent Rd',
 'Southern Ave Metro',
 'Tanger Outlets',
 'The Mall at Prince Georges',
 'Walker Mill Road/ Walker Mill Regional Park ',
 'West Hyattsville Metro'])|prince_george["start_station_name"].isna()]

In [6]:
prince_george_fixed.isna().sum()

rideable_type                 0
started_at                    0
ended_at                      0
start_station_name        62863
end_station_name          63953
member_casual                 0
start_lat                     0
start_lng                     0
end_lat                     168
end_lng                     168
trip_duration_minutes     77832
time_of_day               77832
year                          0
geometry                      0
WARD                     130316
NAME_left                130316
COUNTY                        0
area                          0
NAME_right                    0
dtype: int64

In [7]:
prince_george_fixed["rideable_type"].value_counts()

rideable_type
electric_bike    93496
classic_bike     32922
docked_bike       3898
Name: count, dtype: int64

In [8]:
ebikes = prince_george[prince_george["rideable_type"] == "electric_bike"]

In [9]:
docked = prince_george[(prince_george["rideable_type"] == "classic_bike")&(prince_george["rideable_type"] == "docked_bike")]

In [10]:
docked.isna().sum()

rideable_type            0
started_at               0
ended_at                 0
start_station_name       0
end_station_name         0
member_casual            0
start_lat                0
start_lng                0
end_lat                  0
end_lng                  0
trip_duration_minutes    0
time_of_day              0
year                     0
geometry                 0
WARD                     0
NAME_left                0
COUNTY                   0
area                     0
NAME_right               0
dtype: int64

In [11]:
ebikes.isna().sum()

rideable_type                0
started_at                   0
ended_at                     0
start_station_name       62863
end_station_name         63445
member_casual                0
start_lat                    0
start_lng                    0
end_lat                      0
end_lng                      0
trip_duration_minutes    66921
time_of_day              66921
year                         0
geometry                     0
WARD                     93566
NAME_left                93566
COUNTY                       0
area                         0
NAME_right                   0
dtype: int64

all the station missing values correspond to ebikes.

In [12]:
# unique stations
avg_lat_per_station = prince_george_fixed.groupby("start_station_name", as_index=False, observed=False)["start_lat"].mean()
avg_lng_per_station = prince_george_fixed.groupby("start_station_name", as_index=False, observed = False)["start_lng"].mean()

pg_unique_stations= avg_lat_per_station.merge(avg_lng_per_station)
pg_unique_stations = pg_unique_stations.dropna(subset=["start_lat","start_lng"])
len(pg_unique_stations)

35

In [13]:
from sklearn.neighbors import NearestNeighbors

# Assume df_stations has columns: ["station_name", "latitude", "longitude"]
coords = pg_unique_stations[["start_lat", "start_lng"]].values

# Use Nearest Neighbors to find closest stations
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(coords)
distances, _ = nbrs.kneighbors(coords)

# Exclude distance to itself (first column is 0)
pg_unique_stations["avg_nearest_distance"] = np.mean(distances[:, 1:], axis=1)

#REVIEW - When using NearestNeighbors from sklearn.neighbors with algorithm='ball_tree' or algorithm='kd_tree', the distance is measured in Euclidean units (i.e., degrees of latitude/longitude). This is not ideal for geographic distances because latitude and longitude are not uniform in scale.

In [14]:
pg_unique_stations

Unnamed: 0,start_station_name,start_lat,start_lng,avg_nearest_distance
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,38.908392,-76.843263,0.037088
21,40th Ave & Bladensburg Rd,38.935389,-76.949285,0.010079
32,Baltimore Ave & Jefferson St,38.955494,-76.940138,0.010131
33,Baltimore Ave & Van Buren St / Riverdale Park ...,38.969583,-76.937349,0.010064
34,Baltimore Avenue and Hotel Drive at UMD,38.986639,-76.936072,0.012159
35,Bladensburg Waterfront Park,38.934324,-76.938248,0.016052
36,Bowdoin Ave & Calvert Rd/ College Park Metro,38.978103,-76.928879,0.008393
37,Bowdoin Ave & Calvert Rd/ College Park Station,38.978106,-76.928876,0.008395
38,Capitol Heights Metro,38.888527,-76.913163,0.05584
39,Chillum Rd & Riggs Rd / Riggs Plaza,38.961747,-76.995907,0.024209


list of unique stations 2021-2024 in prince george . As determined in dc_maryland_updated_2024
['1301 McCormick Dr / Wayne K. Curry Admin Bldg',
 '40th Ave & Bladensburg Rd',
 'Baltimore Ave & Jefferson St',
 'Baltimore Ave & Van Buren St / Riverdale Park Station',
 'Baltimore Avenue and Hotel Drive at UMD',
 'Bladensburg Waterfront Park',
 'Bowdoin Ave & Calvert Rd/ College Park Metro',
 'Bowdoin Ave & Calvert Rd/ College Park Station',
 'Capitol Heights Metro',
 'Chillum Rd & Riggs Rd / Riggs Plaza',
 'Crescent Rd & Ridge Rd',
 'Fleet St & Waterfront St',
 'Greenbelt Station Parkway',
 'Guilford Drive & Rowalt Drive / UMD',
 'Hyattsville Library / Adelphi Rd & Toledo Rd',
 "Largo Rd & Campus Way / Prince Georges's Comm Col",
 'Largo Town Center Metro',
 'National Harbor Carousel',
 'New Hampshire Ave & East-West Hwy',
 'Northwestern High School',
 'Oglethorpe St & 42nd Ave',
 'Oxon Hill Park & Ride',
 'Perry & 35th St',
 "Prince George's Plaza Metro",
 'Queens Chapel & Hamilton St',
 'Rhode Island Ave & 39th St / Brentwood Arts Exchange',
 'Rhode Island Avenue /Charles Armentrout Drive - Melrose Skate Park ',
 'Riggs Rd & East West Hwy',
 'Riverdale Park Town Center',
 'Roosevelt Center & Crescent Rd',
 'Southern Ave Metro',
 'Tanger Outlets',
 'The Mall at Prince Georges',
 'Walker Mill Road/ Walker Mill Regional Park ',
 'West Hyattsville Metro']

In [15]:
# using Havesian distance instead of Eucledian Units
from sklearn.neighbors import BallTree
import numpy as np
import pandas as pd

# Convert degrees to radians (needed for haversine)
pg_unique_stations[["lat_rad", "lon_rad"]] = np.radians(pg_unique_stations[["start_lat", "start_lng"]])

# Fit Nearest Neighbors model using Haversine distance
coords = pg_unique_stations[["lat_rad", "lon_rad"]].values
tree = BallTree(coords, metric="haversine")

# Find nearest stations (excluding itself)
distances, _ = tree.query(coords, k=5)  # 5 nearest neighbors

# Convert distances from radians to kilometers (Earth radius ≈ 6371 km)
pg_unique_stations["avg_nearest_distance_km"] = distances[:, 1:].mean(axis=1) * 6371

In [16]:
pg_unique_stations

Unnamed: 0,start_station_name,start_lat,start_lng,avg_nearest_distance,lat_rad,lon_rad,avg_nearest_distance_km
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,38.908392,-76.843263,0.037088,0.67908,-1.341168,3.552301
21,40th Ave & Bladensburg Rd,38.935389,-76.949285,0.010079,0.679551,-1.343018,0.953318
32,Baltimore Ave & Jefferson St,38.955494,-76.940138,0.010131,0.679902,-1.342859,1.059654
33,Baltimore Ave & Van Buren St / Riverdale Park ...,38.969583,-76.937349,0.010064,0.680148,-1.34281,1.006029
34,Baltimore Avenue and Hotel Drive at UMD,38.986639,-76.936072,0.012159,0.680445,-1.342788,1.283055
35,Bladensburg Waterfront Park,38.934324,-76.938248,0.016052,0.679532,-1.342826,1.478655
36,Bowdoin Ave & Calvert Rd/ College Park Metro,38.978103,-76.928879,0.008393,0.680296,-1.342662,0.807663
37,Bowdoin Ave & Calvert Rd/ College Park Station,38.978106,-76.928876,0.008395,0.680296,-1.342662,0.807814
38,Capitol Heights Metro,38.888527,-76.913163,0.05584,0.678733,-1.342388,5.443042
39,Chillum Rd & Riggs Rd / Riggs Plaza,38.961747,-76.995907,0.024209,0.680011,-1.343832,2.231618


# calculating distance to metro

In [17]:
with open ("Maryland_Transit_-_WMATA_Metro_Stops (1).geojson") as i:
    metro_stations = json.loads(i.read())
metro_features = metro_stations['features']

# # convert metro_stations (dict) into geo dataframe so as to get the geometry that we will use to map 
metro_stations_gdf = gpd.GeoDataFrame(
    pd.DataFrame([feature['properties'] for feature in metro_features]),  # Extract properties as attributes
    geometry=[shape(feature['geometry']) for feature in metro_features],  # Convert geometries
    crs="EPSG:4326" )

In [18]:
with open ("Maryland_Transit_-_MARC_Trains_Stations.geojson") as i:
    train_stations = json.loads(i.read())
train_features = train_stations['features']

# # convert metro_stations (dict) into geo dataframe so as to get the geometry that we will use to map 
train_stations_gdf = gpd.GeoDataFrame(
    pd.DataFrame([feature['properties'] for feature in train_features]),  # Extract properties as attributes
    geometry=[shape(feature['geometry']) for feature in train_features],  # Convert geometries
    crs="EPSG:4326" )

In [19]:
#maryland boundaries
with open ("Maryland_Physical_Boundaries_-_County_Boundaries_(Detailed).geojson") as i:
    maryland = json.loads(i.read())

features = maryland["features"]

#GDF
maryland_gdf = gpd.GeoDataFrame(
    pd.DataFrame([feature['properties'] for feature in features]),  # Extract properties as attributes
    geometry=[shape(feature['geometry']) for feature in features],  # Convert geometries
    crs="EPSG:4326")

In [20]:
metro_stations_gdf.head(1)

Unnamed: 0,OBJECTID,GIS_ID,NAME,WEB_URL,ADDRESS,MetroLine,geometry
0,1,mstn_1,College Park-U of Md,http://www.wmata.com/rail/station_detail.cfm?s...,"4931 CALVERT ROAD, COLLEGE PARK, MD","green, yellow",POINT (-76.92812 38.97862)


In [21]:
metro_stations_gdf = metro_stations_gdf[["NAME","MetroLine","geometry"]]

In [22]:
geometry = [Point(xy) for xy in zip(pg_unique_stations['start_lng'], pg_unique_stations['start_lat'])]
pg_unique_stations_gdf = gpd.GeoDataFrame(pg_unique_stations, geometry=geometry, crs="EPSG:4326")

# pg_unique_stations["geometry"] = pg_unique_stations.apply(lambda row: Point(row["start_lat"], row["start_lng"]), axis=1)
# pg_unique_stations_gdf = gpd.GeoDataFrame(pg_unique_stations, geometry="geometry", crs="EPSG:4326")  # WGS84

In [23]:
pg_unique_stations_gdf

Unnamed: 0,start_station_name,start_lat,start_lng,avg_nearest_distance,lat_rad,lon_rad,avg_nearest_distance_km,geometry
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,38.908392,-76.843263,0.037088,0.67908,-1.341168,3.552301,POINT (-76.84326 38.90839)
21,40th Ave & Bladensburg Rd,38.935389,-76.949285,0.010079,0.679551,-1.343018,0.953318,POINT (-76.94928 38.93539)
32,Baltimore Ave & Jefferson St,38.955494,-76.940138,0.010131,0.679902,-1.342859,1.059654,POINT (-76.94014 38.95549)
33,Baltimore Ave & Van Buren St / Riverdale Park ...,38.969583,-76.937349,0.010064,0.680148,-1.34281,1.006029,POINT (-76.93735 38.96958)
34,Baltimore Avenue and Hotel Drive at UMD,38.986639,-76.936072,0.012159,0.680445,-1.342788,1.283055,POINT (-76.93607 38.98664)
35,Bladensburg Waterfront Park,38.934324,-76.938248,0.016052,0.679532,-1.342826,1.478655,POINT (-76.93825 38.93432)
36,Bowdoin Ave & Calvert Rd/ College Park Metro,38.978103,-76.928879,0.008393,0.680296,-1.342662,0.807663,POINT (-76.92888 38.9781)
37,Bowdoin Ave & Calvert Rd/ College Park Station,38.978106,-76.928876,0.008395,0.680296,-1.342662,0.807814,POINT (-76.92888 38.97811)
38,Capitol Heights Metro,38.888527,-76.913163,0.05584,0.678733,-1.342388,5.443042,POINT (-76.91316 38.88853)
39,Chillum Rd & Riggs Rd / Riggs Plaza,38.961747,-76.995907,0.024209,0.680011,-1.343832,2.231618,POINT (-76.99591 38.96175)


In [24]:
# pg_unique_stations_gdf = pg_unique_stations_gdf.to_crs(epsg=4326)  # Maryland State Plane
# metro_stations_gdf = metro_stations_gdf.to_crs(epsg=4326)
# EPSG:4326 (Latitude/Longitude) → Degrees (Not Good for Distance)
# EPSG:3857 or EPSG:26985 (Projected) → Meters/KM (Best for Distance Calculations)

pg_unique_stations_gdf = pg_unique_stations_gdf.to_crs(epsg=26985)
metro_stations_gdf = metro_stations_gdf.to_crs(epsg=26985)

In [26]:
from shapely.ops import nearest_points

def find_nearest_metro(bike_station, metro_stations):
    """Find the nearest metro station and return its distance in meters."""
    nearest_metro = nearest_points(bike_station, metro_stations.union_all())[1]
    return bike_station.distance(nearest_metro)  # Output in meters

# Compute nearest metro distance for each bikeshare station
pg_unique_stations_gdf["distance_to_metro_meters"] = pg_unique_stations_gdf["geometry"].apply(
    lambda x: find_nearest_metro(x, metro_stations_gdf)
)

# Convert to kilometers for better readability
pg_unique_stations_gdf["distance_to_metro_km"] = pg_unique_stations_gdf["distance_to_metro_meters"] / 1000

# Check results
pg_unique_stations_gdf[["start_station_name", "distance_to_metro_km"]].head()


Unnamed: 0,start_station_name,distance_to_metro_km
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,0.385641
21,40th Ave & Bladensburg Rd,2.800774
32,Baltimore Ave & Jefferson St,1.750083
33,Baltimore Ave & Van Buren St / Riverdale Park ...,1.282804
34,Baltimore Avenue and Hotel Drive at UMD,1.125836


In [27]:
from scipy.spatial import cKDTree
import numpy as np

# Extract coordinates
bike_coords = np.array(list(pg_unique_stations_gdf.geometry.apply(lambda x: (x.x, x.y))))
metro_coords = np.array(list(metro_stations_gdf.geometry.apply(lambda x: (x.x, x.y))))

# Create KDTree for fast nearest-neighbor search
metro_tree = cKDTree(metro_coords)

# Find nearest metro station for each bikeshare station
distances, indices = metro_tree.query(bike_coords)

# Store distance in meters (since projected CRS is used)
pg_unique_stations_gdf["distance_to_metro_meters"] = distances
pg_unique_stations_gdf["distance_to_metro_km"] = distances / 1000  # Convert to km

print(pg_unique_stations_gdf[["start_station_name", "distance_to_metro_km"]].head())

                                   start_station_name  distance_to_metro_km
5       1301 McCormick Dr / Wayne K. Curry Admin Bldg              0.385641
21                          40th Ave & Bladensburg Rd              2.800774
32                       Baltimore Ave & Jefferson St              1.750083
33  Baltimore Ave & Van Buren St / Riverdale Park ...              1.282804
34            Baltimore Avenue and Hotel Drive at UMD              1.125836


In [52]:

avg_lat = pg_unique_stations["start_lat"].mean()
avg_lng = pg_unique_stations["start_lng"].mean()

m=folium.Map(location=[avg_lat, avg_lng],   
                zoom_start=12,              
                max_zoom=26,                
                min_zoom=2)  


# # Add bikeshare stations (blue)
# for idx, row in pg_unique_stations_gdf.iterrows():
#     folium.CircleMarker(
#         location=[row.geometry.y, row.geometry.x],
#         radius=3,
#         color="blue",
#         fill=True,
#         fill_color="blue",
#         fill_opacity=0.6,
#         popup=f"Bikeshare Station: {row.start_station_name}",
#     ).add_to(m)

l1 = folium.GeoJson(
    pg_unique_stations_gdf,  
    overlay= True, 
    control = True,
    show = True,
    name= "Cabi Stations",
    marker=folium.CircleMarker(radius=3, fill_color="blue", fill_opacity=1, color="black", weight=1),
    tooltip=folium.GeoJsonTooltip(fields=["start_station_name"],
                                  aliases=["Station: "]),
    popup=folium.GeoJsonPopup(fields=["start_station_name"]),
    highlight_function=lambda x: {"fillOpacity": 0.6},
    zoom_on_click=False,
).add_to(m)

l2 = folium.GeoJson(
    metro_stations_gdf,  
    overlay= True, 
    control = True,
    show = True,
    name= "Metro Stations",
    marker=folium.Marker(radius=4,icon= folium.Icon(color="red", icon="train", prefix="fa")),
    tooltip=folium.GeoJsonTooltip(fields=["NAME"],
                                  aliases=["Metro Station: "]),
    popup=folium.GeoJsonPopup(fields=["NAME"]),
    highlight_function=lambda x: {"fillOpacity": 0.8},
    zoom_on_click=False,
).add_to(m)

l3 = folium.GeoJson(
    train_stations_gdf,  
    overlay= True, 
    control = True,
    show = True,
    name= "Train Stations",
    marker=folium.Marker(radius=4,icon= folium.Icon(color="red", icon="train", prefix="fa")),
    tooltip=folium.GeoJsonTooltip(fields=["Name"],
                                  aliases=["Train Station: "]),
    popup=folium.GeoJsonPopup(fields=["Name"]),
    highlight_function=lambda x: {"fillOpacity": 0.8},
    zoom_on_click=False,
).add_to(m)
# # Add metro stations (red)
# for idx, row in metro_stations_gdf.iterrows():
#     folium.Marker(
#         location=[row.geometry.y, row.geometry.x],
#         icon=folium.Icon(color="red", icon="train", prefix="fa"),
#         popup=f"Metro Station: {row.NAME}",
#     ).add_to(m)

# for idx, row in train_stations_gdf.iterrows():
#     folium.Marker(
#         location=[row.geometry.y, row.geometry.x],
#         icon=folium.Icon(color="red", icon="train", prefix="fa"),
#         popup=f"Train Station: {row.Name}",
#     ).add_to(m)

# Show the map
m.add_child(folium.LayerControl())
m


# POI (points of interest)

In [32]:
poi_data = {
    "POI Name": [
        "National Harbor",
        "Six Flags America",
        "MGM National Harbor Resort & Casino",
        "Gaylord National Resort & Convention Center",
        "University of Maryland, College Park",
        "College Park Aviation Museum",
        "Oxon Cove Park and Oxon Hill Farm",
        "Montpelier Mansion",
        "Lake Artemesia",
        "Dinosaur Park"
    ],
    "Latitude": [38.78417, 38.90251, 38.79555, 38.78072, 38.98692, 38.97485, 38.80500, 39.06984, 38.99067, 39.00000],
    "Longitude": [-77.01639, -76.77130, -77.00856, -77.01599, -76.94255, -76.92233, -77.01611, -76.85025, -76.92233, -76.88000]}

In [33]:
geometry = [Point(xy) for xy in zip(poi_data['Longitude'], poi_data['Latitude'])]
poi_gdf = gpd.GeoDataFrame(poi_data, geometry=geometry, crs="EPSG:4326")

In [34]:
poi_gdf

Unnamed: 0,POI Name,Latitude,Longitude,geometry
0,National Harbor,38.78417,-77.01639,POINT (-77.01639 38.78417)
1,Six Flags America,38.90251,-76.7713,POINT (-76.7713 38.90251)
2,MGM National Harbor Resort & Casino,38.79555,-77.00856,POINT (-77.00856 38.79555)
3,Gaylord National Resort & Convention Center,38.78072,-77.01599,POINT (-77.01599 38.78072)
4,"University of Maryland, College Park",38.98692,-76.94255,POINT (-76.94255 38.98692)
5,College Park Aviation Museum,38.97485,-76.92233,POINT (-76.92233 38.97485)
6,Oxon Cove Park and Oxon Hill Farm,38.805,-77.01611,POINT (-77.01611 38.805)
7,Montpelier Mansion,39.06984,-76.85025,POINT (-76.85025 39.06984)
8,Lake Artemesia,38.99067,-76.92233,POINT (-76.92233 38.99067)
9,Dinosaur Park,39.0,-76.88,POINT (-76.88 39)


In [35]:
pg_unique_stations_gdf = pg_unique_stations_gdf.to_crs(epsg=26985)
poi_gdf = poi_gdf.to_crs(epsg=26985)

In [36]:
from shapely.ops import nearest_points

def find_nearest_poi(bike_station, poi):
    """Find the nearest metro station and return its distance in meters."""
    nearest_poi = nearest_points(bike_station, poi.union_all())[1]
    return bike_station.distance(nearest_poi)  # Output in meters

# Compute nearest poi distance for each bikeshare station
pg_unique_stations_gdf["distance_to_poi_meters"] = pg_unique_stations_gdf["geometry"].apply(
    lambda x: find_nearest_poi(x, poi_gdf)
)

# Convert to kilometers for better readability
pg_unique_stations_gdf["distance_to_poi_km"] = pg_unique_stations_gdf["distance_to_poi_meters"] / 1000

# Check results
pg_unique_stations_gdf[["start_station_name", "distance_to_poi_km"]].head()

Unnamed: 0,start_station_name,distance_to_poi_km
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,6.275948
21,40th Ave & Bladensburg Rd,4.964679
32,Baltimore Ave & Jefferson St,2.645513
33,Baltimore Ave & Van Buren St / Riverdale Park ...,1.426825
34,Baltimore Avenue and Hotel Drive at UMD,0.562109


In [37]:
pg_unique_stations_gdf.head()

Unnamed: 0,start_station_name,start_lat,start_lng,avg_nearest_distance,lat_rad,lon_rad,avg_nearest_distance_km,geometry,distance_to_metro_meters,distance_to_metro_km,distance_to_poi_meters,distance_to_poi_km
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,38.908392,-76.843263,0.037088,0.67908,-1.341168,3.552301,POINT (413594.048 137848.45),385.640842,0.385641,6275.948063,6.275948
21,40th Ave & Bladensburg Rd,38.935389,-76.949285,0.010079,0.679551,-1.343018,0.953318,POINT (404396.781 140834.92),2800.774297,2.800774,4964.678507,4.964679
32,Baltimore Ave & Jefferson St,38.955494,-76.940138,0.010131,0.679902,-1.342859,1.059654,POINT (405188.392 143067.257),1750.082593,1.750083,2645.512964,2.645513
33,Baltimore Ave & Van Buren St / Riverdale Park ...,38.969583,-76.937349,0.010064,0.680148,-1.34281,1.006029,POINT (405429.015 144631.399),1282.803766,1.282804,1426.825009,1.426825
34,Baltimore Avenue and Hotel Drive at UMD,38.986639,-76.936072,0.012159,0.680445,-1.342788,1.283055,POINT (405538.376 146524.925),1125.836326,1.125836,562.108635,0.562109


In [38]:
station_features = pg_unique_stations_gdf[["start_station_name","avg_nearest_distance_km","distance_to_metro_km","distance_to_poi_km"]]

In [39]:
station_features

Unnamed: 0,start_station_name,avg_nearest_distance_km,distance_to_metro_km,distance_to_poi_km
5,1301 McCormick Dr / Wayne K. Curry Admin Bldg,3.552301,0.385641,6.275948
21,40th Ave & Bladensburg Rd,0.953318,2.800774,4.964679
32,Baltimore Ave & Jefferson St,1.059654,1.750083,2.645513
33,Baltimore Ave & Van Buren St / Riverdale Park ...,1.006029,1.282804,1.426825
34,Baltimore Avenue and Hotel Drive at UMD,1.283055,1.125836,0.562109
35,Bladensburg Waterfront Park,1.478655,2.716629,4.705577
36,Bowdoin Ave & Calvert Rd/ College Park Metro,0.807663,0.087113,0.672631
37,Bowdoin Ave & Calvert Rd/ College Park Station,0.807814,0.086695,0.672557
38,Capitol Heights Metro,5.443042,0.156233,9.615494
39,Chillum Rd & Riggs Rd / Riggs Plaza,2.231618,1.228416,5.402427


# ML

##  1) Defining the problem
predict the daily ride demand (y) for each bikeshare station based on station location features (X). This is a regression problem since the target variable (ride demand) is continuous.

## 2) Prepare the Data

In [40]:
prince_george_fixed["start_station_name"].nunique()

35

In [41]:
prince_george_fixed["start_station_name"] = prince_george_fixed["start_station_name"].astype("object")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prince_george_fixed["start_station_name"] = prince_george_fixed["start_station_name"].astype("object")


In [42]:
prince_george_fixed["date"] = pd.to_datetime(prince_george_fixed["started_at"],format="ISO8601").dt.date
prince_george_fixed["month"] = pd.to_datetime(prince_george_fixed["started_at"],format="ISO8601").dt.month




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prince_george_fixed["date"] = pd.to_datetime(prince_george_fixed["started_at"],format="ISO8601").dt.date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prince_george_fixed["month"] = pd.to_datetime(prince_george_fixed["started_at"],format="ISO8601").dt.month


In [43]:
prince_george_fixed["start_station_name"].value_counts()

start_station_name
West Hyattsville Metro                                                 9752
National Harbor Carousel                                               7714
Prince George's Plaza Metro                                            5908
Baltimore Ave & Van Buren St / Riverdale Park Station                  3836
Riverdale Park Town Center                                             3652
Chillum Rd & Riggs Rd / Riggs Plaza                                    3644
Baltimore Ave & Jefferson St                                           3364
The Mall at Prince Georges                                             3055
Oglethorpe St & 42nd Ave                                               2584
Perry & 35th St                                                        2564
Tanger Outlets                                                         2559
Fleet St & Waterfront St                                               2510
Queens Chapel & Hamilton St                                          

In [53]:

monthly_rides = prince_george_fixed.groupby(["start_station_name", "month"],observed=False).size().reset_index(name="monthly_rides")

In [54]:
monthly_rides.sort_values(by="monthly_rides", ascending=False)

Unnamed: 0,start_station_name,month,monthly_rides
339,West Hyattsville Metro,10,1150
335,West Hyattsville Metro,6,1023
338,West Hyattsville Metro,9,1006
336,West Hyattsville Metro,7,1001
151,National Harbor Carousel,6,995
...,...,...,...
291,Southern Ave Metro,1,10
42,Bladensburg Waterfront Park,12,7
1,1301 McCormick Dr / Wayne K. Curry Admin Bldg,2,5
329,Walker Mill Road/ Walker Mill Regional Park,12,1


In [55]:
full_df = monthly_rides.merge(station_features, on="start_station_name", how="left")

In [56]:
full_df

Unnamed: 0,start_station_name,month,monthly_rides,avg_nearest_distance_km,distance_to_metro_km,distance_to_poi_km
0,1301 McCormick Dr / Wayne K. Curry Admin Bldg,1,10,3.552301,0.385641,6.275948
1,1301 McCormick Dr / Wayne K. Curry Admin Bldg,2,5,3.552301,0.385641,6.275948
2,1301 McCormick Dr / Wayne K. Curry Admin Bldg,3,28,3.552301,0.385641,6.275948
3,1301 McCormick Dr / Wayne K. Curry Admin Bldg,4,23,3.552301,0.385641,6.275948
4,1301 McCormick Dr / Wayne K. Curry Admin Bldg,5,20,3.552301,0.385641,6.275948
...,...,...,...,...,...,...
337,West Hyattsville Metro,8,923,1.556320,0.136318,4.144417
338,West Hyattsville Metro,9,1006,1.556320,0.136318,4.144417
339,West Hyattsville Metro,10,1150,1.556320,0.136318,4.144417
340,West Hyattsville Metro,11,822,1.556320,0.136318,4.144417


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
full_df[["avg_nearest_distance_km", "distance_to_metro_km","distance_to_poi_km"]] = scaler.fit_transform(full_df[["avg_nearest_distance_km", "distance_to_metro_km","distance_to_poi_km"]])


In [61]:
# Defining features (X) and target (y)
X = full_df[["avg_nearest_distance_km", "distance_to_metro_km", "distance_to_poi_km", "month"]]
y = full_df["monthly_rides"]

In [62]:
#train test/split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
#train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [66]:
# Evaluate model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
print("R² Score:", r2_score(y_test, y_pred))

Mean Absolute Error: 38.548840579710145
R² Score: 0.9159432382962864
