## DS340-H Final Capstone Project
Jennifer Ruffin

**_Research Questions_**
1. How does station-level demand (in terms of trip origins  and destinations) vary across different months and times of day?
2. what are the resulting peak usage periods for the most popular stations?

In this notebook, I will employ the machine learning methods Logistic Regression and Random Forests as well as the probabilisitc based regression methods Negative Binomial Regression and Poisson Regression to predict which stations are more likely to be used on a given day. 

In [2]:
import pandas as pd
bikeData = pd.read_csv('/Users/jenniferruffin/Desktop/finalcapstone.csv')  

In [3]:
bikeData['started_time'] = pd.to_datetime(bikeData['started_time'], errors='coerce')
bikeData['ended_time'] = pd.to_datetime(bikeData['ended_time'], errors='coerce')

print(bikeData.dtypes)

Unnamed: 0                     int64
start_station_name            object
end_station_name              object
started_date                  object
started_time          datetime64[ns]
ended_date                    object
ended_time            datetime64[ns]
trip_duration                float64
startmonth                    object
endmonth                      object
startTOD                      object
endTOD                        object
start_hour                     int64
end_hour                       int64
start_day_of_week             object
end_day_of_week               object
dtype: object


### Strategy 1: Probabilistc-Based Regression Models for Predicting Start Counts (Poisson and Negative Binomial...thanks MATH/STAT 220!)

Step 1: Impport necessary libraries


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error

Step 2: Aggregate data to station-hour level for start counts

In [16]:
hourly_starts = bikeData.groupby(['start_station_name', 'startmonth', 'start_hour', 'start_day_of_week']).size().reset_index(name='start_count')
hourly_starts = hourly_starts.rename(columns={'start_hour': 'hour','startmonth': 'month', 'start_day_of_week': 'day_of_week'})

print(hourly_starts.head(10)) # visual representation of new dataframe to being used for these models

    start_station_name   month  hour day_of_week  start_count
0   Broadway and Cabot   April    14      Monday            2
1   Broadway and Cabot  August     1     Tuesday            1
2   Broadway and Cabot  August     2      Friday            1
3   Broadway and Cabot  August     2    Saturday            1
4   Broadway and Cabot  August     2    Thursday            2
5   Broadway and Cabot  August     2     Tuesday            1
6   Broadway and Cabot  August     3   Wednesday            1
7   Broadway and Cabot  August     5   Wednesday            1
8   Broadway and Cabot  August     6      Friday            2
9   Broadway and Cabot  August     6      Monday            1


Step 3: Defining and Encoding predictors and target variable

_Because I had categorical predictors, I decided to encode them so running the models would be easier_

In [10]:
categorical_cols = ['start_station_name', 'month', 'day_of_week']
encoders = {}
for col in categorical_cols:
    encoders[col] = LabelEncoder()
    hourly_starts[col + '_encoded'] = encoders[col].fit_transform(hourly_starts[col])

# Add hour as a predictor
hourly_starts['hour_of_day'] = hourly_starts['hour']

# defining predictors and target
predictors = ['start_station_name_encoded', 'month_encoded', 'day_of_week_encoded',
                   'hour_of_day'] # Add more predictors if available
target = 'start_count'

Step 4: Split data into testing and training data

In [11]:
# Split data
train_p, test_p = train_test_split(hourly_starts, test_size=0.2, random_state=0)

Step 5: Run Regressions

- Block 1: Poisson
- Block 2: Negative Binomial

In [13]:
# Poisson
formula = f"{target} ~ " + " + ".join(predictors)
poisson_model = smf.glm(formula=formula, data=train_p, family=sm.families.Poisson()).fit()

print(poisson_model.summary())

# Make predictions 
predictions_poisson = poisson_model.predict(test_p)
print("\nPoisson Regression Predictions (first 10):")
print(predictions_poisson.head(10))

# Evaluate predictions, using mean absolute error here
from sklearn.metrics import mean_absolute_error
mae_poisson = mean_absolute_error(test_p[target], predictions_poisson.round()) # Round predictions to integers
print(f"Poisson Regression Mean Absolute Error: {mae_poisson}")

                 Generalized Linear Model Regression Results                  
Dep. Variable:            start_count   No. Observations:               309016
Model:                            GLM   Df Residuals:                   309011
Model Family:                 Poisson   Df Model:                            4
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.9427e+06
Date:                Tue, 08 Apr 2025   Deviance:                   2.8351e+06
Time:                        14:45:32   Pearson chi2:                 4.29e+06
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2489
Covariance Type:            nonrobust                                         
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept           

In [14]:
# Negative Binomial Regression
negativebinomial_model = smf.glm(formula=formula, data=train_p,
                                  family=sm.families.NegativeBinomial()).fit()

print("\nNegative Binomial Regression Summary:")
print(negativebinomial_model.summary())

predictions_nb = negativebinomial_model.predict(test_p)
mae_nb = mean_absolute_error(test_p[target], predictions_nb.round())
print(f"Negative Binomial Regression Mean Absolute Error: {mae_nb}")




Negative Binomial Regression Summary:
                 Generalized Linear Model Regression Results                  
Dep. Variable:            start_count   No. Observations:               309016
Model:                            GLM   Df Residuals:                   309011
Model Family:        NegativeBinomial   Df Model:                            4
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -9.7238e+05
Date:                Tue, 08 Apr 2025   Deviance:                   3.0349e+05
Time:                        14:46:26   Pearson chi2:                 4.52e+05
No. Iterations:                     9   Pseudo R-squ. (CS):            0.03982
Covariance Type:            nonrobust                                         
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------

### Strategy 2: Predicting Peak Usage Times with ML Strategies Logistic Regression and Random Forests

Step 1: Import necessary libraries 

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

Step 2: Aggregate hourly start counts for each station

In [17]:
hourly_station_usage = bikeData.groupby(['start_station_name', 'start_hour','start_day_of_week','startmonth']).size().reset_index(name='hourly_starts')
hourly_station_usage = hourly_station_usage.rename(columns={'start_hour': 'hour','start_day_of_week': 'day_of_week', 'startmonth': 'month'})

print(hourly_station_usage.head(10))

    start_station_name  hour day_of_week   month  hourly_starts
0   Broadway and Cabot     0      Friday     May              1
1   Broadway and Cabot     0    Thursday    July              1
2   Broadway and Cabot     0   Wednesday    July              1
3   Broadway and Cabot     1      Sunday    June              1
4   Broadway and Cabot     1     Tuesday  August              1
5   Broadway and Cabot     1     Tuesday    June              1
6   Broadway and Cabot     2      Friday  August              1
7   Broadway and Cabot     2      Monday    July              1
8   Broadway and Cabot     2    Saturday  August              1
9   Broadway and Cabot     2    Thursday  August              2


Step 3: Defining a threshold for **"high usage"** 

_For this context, we will look at the top 10% of hourly counts for each station_

In [21]:
def define_hourly_high_usage(df, percentile=0.9):
    df['hourly_threshold'] = df.groupby('start_station_name')['hourly_starts'].transform(lambda x: x.quantile(percentile)) # CS 315 help, lambda
    df['is_peak'] = (df['hourly_starts'] >= df['hourly_threshold']).astype(int)
    df = df.drop(columns=['hourly_threshold'])
    return df

hourly_usage_with_peak = define_hourly_high_usage(hourly_station_usage.copy()) # copied to not modify original dataframe
print(hourly_usage_with_peak.head(10)) # 0 indicates not a peak, 1 indicates peak


    start_station_name  hour day_of_week   month  hourly_starts  is_peak
0   Broadway and Cabot     0      Friday     May              1        0
1   Broadway and Cabot     0    Thursday    July              1        0
2   Broadway and Cabot     0   Wednesday    July              1        0
3   Broadway and Cabot     1      Sunday    June              1        0
4   Broadway and Cabot     1     Tuesday  August              1        0
5   Broadway and Cabot     1     Tuesday    June              1        0
6   Broadway and Cabot     2      Friday  August              1        0
7   Broadway and Cabot     2      Monday    July              1        0
8   Broadway and Cabot     2    Saturday  August              1        0
9   Broadway and Cabot     2    Thursday  August              2        1


Step 4: Encoding and Defining Predictors -- similar to strategy 1

In [22]:
categorical_cols_peak = ['start_station_name', 'month', 'day_of_week']
encoders_peak = {}
for col in categorical_cols_peak:
    encoders_peak[col] = LabelEncoder()
    hourly_usage_with_peak[col + '_encoded'] = encoders_peak[col].fit_transform(hourly_usage_with_peak[col])

# Define predictors and target
predictors_peak = ['start_station_name_encoded', 'month_encoded', 'day_of_week_encoded', 'hour']
target_peak = 'is_peak'

Step 5: Splitting data into testing and training

In [23]:
train_peak, test_peak = train_test_split(hourly_usage_with_peak, test_size=0.2, random_state=0)

Step 5: Run Models

- Block 1: Logistic Regression
- Block 2: Random Forests

In [24]:
# Logistic Regression
model_peak = LogisticRegression(random_state=42)
model_peak.fit(train_peak[predictors_peak], train_peak[target_peak])

# Make predictions
predictions_peak_lr = model_peak.predict(test_peak[predictors_peak])

# Evaluate
print("\nLogistic Regression for Peak Usage Prediction:")
print("Accuracy:", accuracy_score(test_peak[target_peak], predictions_peak_lr))
print("Classification Report:\n", classification_report(test_peak[target_peak], predictions_peak_lr))


Logistic Regression for Peak Usage Prediction:
Accuracy: 0.8845899941751343
Classification Report:
               precision    recall  f1-score   support

           0       0.88      1.00      0.94     68339
           1       0.00      0.00      0.00      8916

    accuracy                           0.88     77255
   macro avg       0.44      0.50      0.47     77255
weighted avg       0.78      0.88      0.83     77255



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [25]:
# Random Forests
rf_peak = RandomForestClassifier(random_state=42)
rf_peak.fit(train_peak[predictors_peak], train_peak[target_peak])

# Make predictions
predictions_peak_rf = rf_peak.predict(test_peak[predictors_peak])

# Evaluate
print("\nRandom Forest for Peak Usage Prediction:")
print("Accuracy:", accuracy_score(test_peak[target_peak], predictions_peak_rf))
print("Classification Report:\n", classification_report(test_peak[target_peak], predictions_peak_rf))


Random Forest for Peak Usage Prediction:
Accuracy: 0.8495501909261537
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.92      0.92     68339
           1       0.34      0.32      0.33      8916

    accuracy                           0.85     77255
   macro avg       0.62      0.62      0.62     77255
weighted avg       0.85      0.85      0.85     77255

