The dataset from the New York City Department of Transportation provides a daily record of the number of bicycles crossing into or out of Manhattan via the East River bridges over a period of 9 months. The dataset comprises 210 entries and contains 11 columns. These columns include features such as the date, day of the week, high and low temperatures, precipitation, and the count of bicycles crossing each of the four major East River bridges: Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. Additionally, there is a column named "Total" representing the total number of bicycles crossing all bridges combined. The dataset is a mix of numerical and categorical data types. The objective is to utilize this dataset to build a Random Forest regression algorithm that can accurately predict the number of bicycles crossing the bridges in New York City based on the provided features.




Dataset link:
https://www.kaggle.com/datasets/new-york-city/nyc-east-river-bicycle-crossings/data

To improve the RandomForestRegressor model's accuracy, feature scaling (StandardScaler), feature selection (SelectFromModel), and parameter tuning were used. The optimal accuracy was achieved with these best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}.

### Step 1: Import Required Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble  import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel

import warnings
warnings.filterwarnings('ignore')

### Step 2: Load Dataset:

In [2]:

# load to DataFrame
df_bicycle = pd.read_csv("nyc-east-river-bicycle-counts.csv")
df_bicycle

Unnamed: 0.1,Unnamed: 0,Date,Day,High Temp (°F),Low Temp (°F),Precipitation,Brooklyn Bridge,Manhattan Bridge,Williamsburg Bridge,Queensboro Bridge,Total
0,0,2016-04-01 00:00:00,2016-04-01 00:00:00,78.1,66.0,0.01,1704.0,3126,4115.0,2552.0,11497
1,1,2016-04-02 00:00:00,2016-04-02 00:00:00,55.0,48.9,0.15,827.0,1646,2565.0,1884.0,6922
2,2,2016-04-03 00:00:00,2016-04-03 00:00:00,39.9,34.0,0.09,526.0,1232,1695.0,1306.0,4759
3,3,2016-04-04 00:00:00,2016-04-04 00:00:00,44.1,33.1,0.47 (S),521.0,1067,1440.0,1307.0,4335
4,4,2016-04-05 00:00:00,2016-04-05 00:00:00,42.1,26.1,0,1416.0,2617,3081.0,2357.0,9471
...,...,...,...,...,...,...,...,...,...,...,...
205,205,2016-04-26 00:00:00,2016-04-26 00:00:00,60.1,46.9,0.24,1997.0,3520,4559.0,2929.0,13005
206,206,2016-04-27 00:00:00,2016-04-27 00:00:00,62.1,46.9,0,3343.0,5606,6577.0,4388.0,19914
207,207,2016-04-28 00:00:00,2016-04-28 00:00:00,57.9,48.0,0,2486.0,4152,5336.0,3657.0,15631
208,208,2016-04-29 00:00:00,2016-04-29 00:00:00,57.0,46.9,0.05,2375.0,4178,5053.0,3348.0,14954


In [3]:
print("Shape of DataFrame:", df_bicycle.shape)

df_bicycle.info()

Shape of DataFrame: (210, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           210 non-null    int64  
 1   Date                 210 non-null    object 
 2   Day                  210 non-null    object 
 3   High Temp (°F)       210 non-null    float64
 4   Low Temp (°F)        210 non-null    float64
 5   Precipitation        210 non-null    object 
 6   Brooklyn Bridge      210 non-null    float64
 7   Manhattan Bridge     210 non-null    int64  
 8   Williamsburg Bridge  210 non-null    float64
 9   Queensboro Bridge    210 non-null    float64
 10  Total                210 non-null    int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 18.2+ KB


In [4]:
# Check for null values in each column
null_counts = df_bicycle.isnull().sum()

# Sort null value counts in descending order
null_counts_sorted = null_counts.sort_values(ascending=False)

print("Null value counts in each column:")
print(null_counts)

Null value counts in each column:
Unnamed: 0             0
Date                   0
Day                    0
High Temp (°F)         0
Low Temp (°F)          0
Precipitation          0
Brooklyn Bridge        0
Manhattan Bridge       0
Williamsburg Bridge    0
Queensboro Bridge      0
Total                  0
dtype: int64


In [5]:
df_bicycle['Precipitation'].unique()

array(['0.01', '0.15', '0.09', '0.47 (S)', '0', '0.2', 'T', '0.16',
       '0.24', '0.05'], dtype=object)

The "Precipitation" column  contain different types of precipitation measurements, including numerical values (e.g., '0.01', '0.15') and textual representations (e.g., 'T' for trace amounts, '0.47 (S)' for snow)

In [6]:
# Convert textual representations to numerical values
df_bicycle['Precipitation'] = df_bicycle['Precipitation'].replace({'T': 0.01})  # Replace 'T' with trace amount
df_bicycle['Precipitation'] = df_bicycle['Precipitation'].str.extract(r'(\d+\.\d+)').astype(float)  # Extract numeric part

# Impute missing or invalid values
df_bicycle['Precipitation'] = df_bicycle['Precipitation'].replace({'0': 0.00})  # Replace '0' with '0.00'


In [7]:
# Separate features and target variable
X = df_bicycle.drop(columns=["Unnamed: 0", "Date", "Day", "Total"])  # Exclude non-numeric and target columns
y = df_bicycle["Total"]

### Step 3: Split Dataset:

In [8]:
# Split the dataset into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### Step 4: Feature scaling

In [9]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Step 5: Feature selection

In [10]:
# Feature selection
random_forest = RandomForestRegressor(random_state=42)
feature_selector = SelectFromModel(random_forest)
X_train_selected = feature_selector.fit_transform(X_train_scaled, y_train)
X_test_selected = feature_selector.transform(X_test_scaled)

### Step 6: Create Random Forest Regressor

In [11]:
# Create Random Forest Regressor
regressor = RandomForestRegressor(random_state=42)

### Step 7: Parameter Tuning

To add parameter tuning to the Random Forest Regressor, we use a technique called grid search.

In [12]:
# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [13]:
# Perform grid search
grid_search = GridSearchCV(estimator=regressor, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


### Step 8: Train the model with the best parameters

In [14]:
# Train the model with the best parameters
best_random_forest_regressor  = RandomForestRegressor(**best_params, random_state=42)
best_random_forest_regressor.fit(X_train, y_train)

### Step 9: Make predictions

In [15]:
# Make predictions
y_pred = best_random_forest_regressor.predict(X_test)

### Step 10: Evaluate the model

In [16]:
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)

Mean Absolute Error: 1309.7711476133945
Mean Squared Error: 2450804.4242117056
Root Mean Squared Error: 1565.504527049253
R-squared: 0.9116316267863822


The Mean Absolute Error (MAE) of 1309.77 suggests that, on average, the model's predictions deviate by approximately 1309.77 bicycles. Similarly, the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) values of 2450804.42 and 1565.50, respectively, reflect relatively low prediction errors compared to the scale of the target variable. Additionally, the high R-squared (R²) value of 0.9116 indicates that the model explains approximately 91.2% of the variance in the target variable, signifying strong explanatory power.