# 🤖 Notebook: Machine Learning Modeling

This notebook trains machine learning models to predict ride demand for Capital Bikeshare stations.

## ✅ Goals:
- Load processed dataset with distance-based features
- Define target and features
- Train/test split
- Train machine learning models
- Evaluate model performance
- Analyze feature importance


In [None]:
# Importing necessary libraries 
# 📚 Data Science Libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# 📦 Standard Library
import sys
from pathlib import Path

# 📊 Data Analysis 
import pandas as pd
import numpy as np

# 🛠️ Project-Specific Modules
sys.path.append(str(Path().resolve().parent / "src"))

from paths import  INTERIM_DIR, PROCESSED_DIR
from helpers_folium import load_geojson_as_gdf, load_bikeshare_data, create_centered_map

## Prepare the data

In [37]:
# load Prince George's County station features
pg_station_features = pd.read_parquet(INTERIM_DIR / "station_features_2021_to_2024.parquet")

# load Prince George's County df
pg_df = pd.read_parquet(PROCESSED_DIR / "prince_george.parquet")

In [38]:
# adjusting the started_at column to datetime format
pg_df["date"] = pd.to_datetime(pg_df["started_at"],format="ISO8601").dt.date
pg_df["month"] = pd.to_datetime(pg_df["started_at"],format="ISO8601").dt.month
pg_df["year"] = pd.to_datetime(pg_df["started_at"],format="ISO8601").dt.year
pg_df["dow"] = pd.to_datetime(pg_df["started_at"],format="ISO8601").dt.day_of_week


# Grouping by Station and Week


In [40]:
pg_df["started_at"] = pd.to_datetime(pg_df["started_at"],format="ISO8601")

In [41]:
pg_df["year_week"] = pg_df["started_at"].dt.strftime("%Y-%U")

In [42]:
# Group by station and week, counting rides per group
weekly_rides = pg_df.groupby(["start_station_name", "year_week"], observed=False).agg(
    avg_rides=("member_casual", "count")  # Counting total rides per station per week./ Using a random column to check amount of rides
).reset_index()

In [43]:
pg_df.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual,start_lat,start_lng,end_lat,end_lng,...,WARD,NAME_left,COUNTY,area,NAME_right,date,month,dow,Population_Density,year_week
0,electric_bike,2022-01-01 01:14:29,2022-01-01 01:18:46,Capitol Heights Metro,,member,38.888528,-76.913045,38.88,-76.92,...,,,Prince George's,Maryland,TOWN OF CAPITOL HEIGHTS,2022-01-01,1,5,2250,2022-00
1,classic_bike,2022-01-01 06:27:29,2022-01-01 06:50:59,Chillum Rd & Riggs Rd / Riggs Plaza,The Mall at Prince Georges,member,38.961737,-76.995922,38.968842,-76.954171,...,,,Prince George's,Maryland,CHILLUM,2022-01-01,1,5,4358,2022-00
2,electric_bike,2022-01-01 08:08:08,2022-01-01 08:14:01,Baltimore Ave & Jefferson St,,casual,38.955485,-76.940117,38.97,-76.94,...,,,Prince George's,Maryland,CITY OF HYATTSVILLE,2022-01-01,1,5,2571,2022-00
3,classic_bike,2022-01-01 09:51:55,2022-01-01 10:18:21,The Mall at Prince Georges,Chillum Rd & Riggs Rd / Riggs Plaza,member,38.968842,-76.954171,38.961737,-76.995922,...,,,Prince George's,Maryland,CITY OF HYATTSVILLE,2022-01-01,1,5,2571,2022-00
4,electric_bike,2022-01-01 10:28:21,2022-01-01 10:33:19,,Prince George's Plaza Metro,casual,38.96,-76.95,38.965742,-76.954803,...,,,Prince George's,Maryland,CITY OF HYATTSVILLE,2022-01-01,1,5,2571,2022-00


In [44]:
# Merge station features with grouped weekly rides
final_df = weekly_rides.merge(pg_station_features, on="start_station_name", how="left")

# RANDOM FOREST FOR WEEKLY RIDES (based on 2021-2024)

In [47]:
# Split 'year_week' into numeric 'year' and 'week' columns
final_df["year"] = final_df["year_week"].apply(lambda x: int(x.split("-")[0]))
final_df["week"] = final_df["year_week"].apply(lambda x: int(x.split("-")[1]))

# Drop the original 'year_week' column
final_df.drop(columns=["year_week"], inplace=True)

In [51]:
final_df.head()

Unnamed: 0,start_station_name,avg_rides,avg_distance_nearest_station_km,distance_to_metro_km,distance_to_poi_km,distance_to_cc_km,pop_density,year,week
0,1301 McCormick Dr / Wayne K. Curry Admin Bldg,3,3.552301,0.385641,6.275948,2.786083,1666,2022,1
1,1301 McCormick Dr / Wayne K. Curry Admin Bldg,2,3.552301,0.385641,6.275948,2.786083,1666,2022,2
2,1301 McCormick Dr / Wayne K. Curry Admin Bldg,1,3.552301,0.385641,6.275948,2.786083,1666,2022,8
3,1301 McCormick Dr / Wayne K. Curry Admin Bldg,1,3.552301,0.385641,6.275948,2.786083,1666,2022,9
4,1301 McCormick Dr / Wayne K. Curry Admin Bldg,1,3.552301,0.385641,6.275948,2.786083,1666,2022,11


In [None]:
# Defining features (X) and target (y)
X = final_df[["avg_distance_nearest_station_km", "distance_to_metro_km", "distance_to_poi_km","year","week","distance_to_cc_km","pop_density"]]
y = final_df ["avg_rides"]
#train test/split


# Extract station names before splitting
station_names = final_df["start_station_name"]

#REVIEW - # Split data into train and test. retrieving stations names with with station_test
X_train, X_test, y_train, y_test, station_train, station_test = train_test_split(
    X, y, station_names, test_size=0.2, random_state=42)

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
print("R² Score:", r2_score(y_test, y_pred))

Mean Absolute Error: 4.490835351089589
R² Score: 0.8704908261441793


In [54]:
# interpreting MAE

print("Mean Weekly Rides:", y.mean())
print("Standard Deviation of weekly Rides:", y.std())

Mean Weekly Rides: 16.336401065633325
Standard Deviation of weekly Rides: 19.259434281589357


Interpreting MAE

- If MAE is much smaller than the mean, your model is relatively accurate.
- If MAE is close to or larger than the mean, predictions may not be reliable.

Example Interpretation:

- If Mean Daily Rides = 10, but MAE = 2, this means your predictions are on average 2 rides off, which is good (~20% error).
- If Mean Daily Rides = 5, and MAE = 2, that’s a high error (~40%).