## DS340-H Final Capstone Project
Jennifer Ruffin

**_Research Question: How does station-level demand (in terms of trip origins  and destinations) vary across different months and times of day, and what are the resulting peak usage periods for the most popular stations?_**

In this notebook, I will employ the methods Logistic Regression, Gradient Boosting, and Random Forests to predict which stations are more likely to be used on a given day. 



Read in data set

In [67]:
import pandas as pd

bikeData = pd.read_csv('/Users/jenniferruffin/Desktop/finalcapstone.csv')  

In [68]:
bikeData['started_time'] = pd.to_datetime(bikeData['started_time'], errors='coerce')
bikeData['ended_time'] = pd.to_datetime(bikeData['ended_time'], errors='coerce')

print(bikeData.dtypes)


Unnamed: 0                     int64
start_station_name            object
end_station_name              object
started_date                  object
started_time          datetime64[ns]
ended_date                    object
ended_time            datetime64[ns]
trip_duration                float64
startmonth                    object
endmonth                      object
startTOD                      object
endTOD                        object
start_hour                     int64
end_hour                       int64
start_day_of_week             object
end_day_of_week               object
dtype: object


#### Similar to what was done in my R code, below I am finding the total daily counts per station

In [69]:
# daily start counts per station
daily_starts = bikeData.groupby([bikeData['started_time'].dt.date, 'start_station_name']).size().reset_index(name='start_count')
daily_starts = daily_starts.rename(columns={'started_time': 'date', 'start_station_name': 'station'})

# daily end counts per station
daily_ends = bikeData.groupby([bikeData['ended_time'].dt.date, 'end_station_name']).size().reset_index(name='end_count')
daily_ends = daily_ends.rename(columns={'ended_time': 'date', 'end_station_name': 'station'})

# merge the two counts
total_activity = pd.merge(daily_starts, daily_ends, on=['date', 'station'], how='outer').fillna(0)

# get total daily counts
total_activity['total_activity'] = total_activity['start_count'] + total_activity['end_count']

print("Daily Activity per Station:")
print(total_activity.sort_values(by='total_activity', ascending=False).head(10))



Daily Activity per Station:
           date                                            station  \
332  2025-04-07                       MIT at Mass Ave / Amherst St   
131  2025-04-07              Central Square at Mass Ave / Essex St   
267  2025-04-07                Harvard Square at Mass Ave/ Dunster   
139  2025-04-07        Charles Circle - Charles St at Cambridge St   
329  2025-04-07                    MIT Pacific St at Purrington St   
39   2025-04-07                                 Ames St at Main St   
152  2025-04-07  Christian Science Plaza - Massachusetts Ave at...   
349  2025-04-07                          Mass Ave/Lafayette Square   
89   2025-04-07                        Boylston St at Fairfield St   
331  2025-04-07                                      MIT Vassar St   

     start_count  end_count  total_activity  
332      57004.0      56010        113014.0  
131      47530.0      47527         95057.0  
267      40938.0      42356         83294.0  
139      34146.0 

The 'total_activity' counts will now be used with the function defineHighUsage(). As a reminder, the point of this notebook is to predict which stations are more likely to be used
on a given day. For a station to be a high usage station, I am setting the percentile threshold to 0.8 which is the top 20 percentile. I will use these specific sations for 
the predictions.

In [70]:
def defineHighUsage(daily_counts, percentile=0.8):
    def get_threshold(group):
        threshold = group['total_activity'].quantile(percentile)
        return threshold

    thresholds = daily_counts.groupby('date').apply(get_threshold).reset_index(name='threshold')
    daily_counts_merged = pd.merge(daily_counts, thresholds, on='date')
    daily_counts_merged['high_usage'] = (daily_counts_merged['total_activity'] >= daily_counts_merged['threshold']).astype(int)
    return daily_counts_merged

Calling our function to see the stations in descending order, so highest total activity to lowest

In [71]:
total_activity_with_usage = defineHighUsage(total_activity, percentile=0.8)

print(total_activity_with_usage.sort_values(by='total_activity', ascending=False).head(10))

           date                                            station  \
332  2025-04-07                       MIT at Mass Ave / Amherst St   
131  2025-04-07              Central Square at Mass Ave / Essex St   
267  2025-04-07                Harvard Square at Mass Ave/ Dunster   
139  2025-04-07        Charles Circle - Charles St at Cambridge St   
329  2025-04-07                    MIT Pacific St at Purrington St   
39   2025-04-07                                 Ames St at Main St   
152  2025-04-07  Christian Science Plaza - Massachusetts Ave at...   
349  2025-04-07                          Mass Ave/Lafayette Square   
89   2025-04-07                        Boylston St at Fairfield St   
331  2025-04-07                                      MIT Vassar St   

     start_count  end_count  total_activity  threshold  high_usage  
332      57004.0      56010        113014.0    20210.0           1  
131      47530.0      47527         95057.0    20210.0           1  
267      40938.0      

  thresholds = daily_counts.groupby('date').apply(get_threshold).reset_index(name='threshold')


#### **_I encountered a problem: Now, I have two dataframes with information that needs to be used altogether._**

So, I merged my original dataset (bikeData) with my dataframe with my high usage information (total_activity_with_usage)

In [72]:
bikeData['start_date'] = bikeData['started_time'].dt.date
bikeData_merged = pd.merge(bikeData, total_activity_with_usage,
                           left_on=['start_date', 'start_station_name'],
                           right_on=['date', 'station'],
                           how='left')

print("NaNs in high_usage after merge:", bikeData_merged['high_usage'].isnull().sum())
print(bikeData_merged.head())

# Replace the original bikeData with the merged one
bikeData = bikeData_merged.drop(columns=['start_date', 'date', 'station'])

NaNs in high_usage after merge: 962
   Unnamed: 0  start_station_name                         end_station_name  \
0           1  Ames St at Main St    Central Square at Mass Ave / Essex St   
1           2  Ames St at Main St    Central Square at Mass Ave / Essex St   
2           3  One Memorial Drive  Kennedy-Longfellow School 158 Spring St   
3           4  Ames St at Main St                      Brookline Town Hall   
4           5  Mass Ave T Station                         Chinatown T Stop   

  started_date        started_time  ended_date          ended_time  \
0   2024-01-31 2025-04-07 12:16:49  2024-01-31 2025-04-07 12:21:02   
1   2024-01-12 2025-04-07 08:14:16  2024-01-12 2025-04-07 08:19:48   
2   2024-01-29 2025-04-07 15:00:05  2024-01-29 2025-04-07 15:05:47   
3   2024-01-09 2025-04-07 16:33:40  2024-01-09 2025-04-07 17:00:41   
4   2024-01-23 2025-04-07 10:19:21  2024-01-23 2025-04-07 10:31:39   

   trip_duration startmonth endmonth  ... start_day_of_week end_day_of_wee

In [73]:

print(bikeData.dtypes)

Unnamed: 0                     int64
start_station_name            object
end_station_name              object
started_date                  object
started_time          datetime64[ns]
ended_date                    object
ended_time            datetime64[ns]
trip_duration                float64
startmonth                    object
endmonth                      object
startTOD                      object
endTOD                        object
start_hour                     int64
end_hour                       int64
start_day_of_week             object
end_day_of_week               object
start_count                  float64
end_count                    float64
total_activity               float64
threshold                    float64
high_usage                   float64
dtype: object


#### **_We've now prepped our dataset and established high usage. So, now let's use the models_**

*Import Libraries*

In [74]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import f1_score, accuracy_score, classification_report, confusion_matrix
import numpy as np

Label Encoding to handle our categorical variables

In [87]:
categorical_cols = ['start_day_of_week', 'endTOD', 'startTOD', 'startmonth']
encoders = {}
for col in categorical_cols:
    encoders[col] = LabelEncoder()
    bikeData[col + '_encoded'] = encoders[col].fit_transform(bikeData[col])

numerical_cols = [] # Add numerical predictors if you have them (e.g., lagged usage)
scaler = None
if numerical_cols:
    scaler = StandardScaler()
    bikeData[numerical_cols] = scaler.fit_transform(bikeData[numerical_cols])

# Define features (X) and target (y)
feature_cols = [col + '_encoded' for col in categorical_cols] + numerical_cols
X = bikeData[feature_cols]
y = bikeData['high_usage']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikeData[col + '_encoded'] = encoders[col].fit_transform(bikeData[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikeData[col + '_encoded'] = encoders[col].fit_transform(bikeData[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikeData[col + '_encoded'] = encoders[col].fit_transform(bike

Split data into training and testing data

In [90]:
print(bikeData.head)

<bound method NDFrame.head of          Unnamed: 0              start_station_name  \
0                 1              Ames St at Main St   
1                 2              Ames St at Main St   
2                 3              One Memorial Drive   
3                 4              Ames St at Main St   
4                 5              Mass Ave T Station   
...             ...                             ...   
3179249     3179250      Washington St at Temple Pl   
3179250     3179251  Boston City Hall - 28 State St   
3179251     3179252  Boston City Hall - 28 State St   
3179252     3179253  Boston City Hall - 28 State St   
3179253     3179254           EF - North Point Park   

                                     end_station_name started_date  \
0               Central Square at Mass Ave / Essex St   2024-01-31   
1               Central Square at Mass Ave / Essex St   2024-01-12   
2             Kennedy-Longfellow School 158 Spring St   2024-01-29   
3                            

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [89]:
scaler = StandardScaler()
trainX_scaled = scaler.fit_transform(X_train)  
testX_scaled = scaler.transform(X_test) 

# Train the classifier
clf = LogisticRegression(random_state=0).fit(trainX_scaled, y_train)

# Predict on the testing data
y_pred = clf.predict(testX_scaled)

# Calculate the F1 score
f1 = f1_score(y_test, y_pred, average=None)

print("F1 Score:", f1)


F1 Score: [0.         0.74694898]
