# (Re-sampled) Feature Engineering - Bike Share System in the SF Bay Area

Author: Owen Hsu

**Note:**<br/>
Due to the computational limitations of the local machine and Google Colab, I resampled the dataset for this project. We can consider fitting the original dataset in the future when sufficient computational resources are available. Please refer to the revised version titled "(Re-sampled) Feature Engineering - Bike Share System in the SF Bay Area."

## Table of content

1. Feature Overview
2. Loading and Setup
3. Assessment
4. Categorical Variables
5. Scaling and Train/Test Split

## Feature Overview

| Columns                    | Description                                                        |
|:---------------------------|:------------------------------------------------------------------:|
| Time                       | Recorded Date and Time                                             |
| Station ID                 | Station ID                                                         |
| Dock Number                | The number of docks at the station                                 |
| Mean Dew Point             | The average dew point of the date                                  |
| Mean Humidity              | The average humidity of the date                                   |
| Mean Sea Level Pressure    | The average sea level pressure of the date                         |
| Mean Visibility            | The average visibility of the date                                 |
| Mean Wind Speed            | The average wind speed of the date                                 |
| Precipitation Inches       | Determine whether the date of each record has precipitation        |
| Cloud Cover                | The fraction of the sky covered by clouds on given date            |
| Zip Code                   | The zip code for the weather record                                |
| Usage Rate Category        | Usage Rate Category of the station                                 |
| Holiday                    | Determine whether the date of each record is a holiday or not      |
| Weekend                    | Determine whether the date of each record is a weekend or not      |
| Hour                       | Recorded Time (Hour)                                               |


## Loading and Setup

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler 
# Filter warnings
from warnings import filterwarnings
filterwarnings('ignore')

  from pandas.core import (


In [2]:
# Load the dataset
df = pd.read_parquet('data/BikeData_after_EDA.parquet')

## Assessment

In [3]:
# Set display options to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
# Print the shape of the data
df.shape

(7337194, 15)

In [5]:
# Look at the first 5 rows
df.head()

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,hour
0,2014-05-01 00:00:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
1,2014-05-01 00:05:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
2,2014-05-01 00:10:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
3,2014-05-01 00:15:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
4,2014-05-01 00:20:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0


In [6]:
# Get a quick overview of dataset variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7337194 entries, 0 to 7337193
Data columns (total 15 columns):
 #   Column                          Dtype         
---  ------                          -----         
 0   time                            datetime64[ns]
 1   station_id                      int64         
 2   dock_count                      int64         
 3   mean_dew_point_f                float64       
 4   mean_humidity                   float64       
 5   mean_sea_level_pressure_inches  float64       
 6   mean_visibility_miles           float64       
 7   mean_wind_speed_mph             float64       
 8   precipitation_inches            int64         
 9   cloud_cover                     float64       
 10  zip_code                        int64         
 11  usage_rate_category             object        
 12  weekend                         int64         
 13  holiday                         int64         
 14  hour                            int32         
dty

In [7]:
# Sanity check to make sure we don't have any NAs
df.isna().sum()

time                              0
station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
usage_rate_category               0
weekend                           0
holiday                           0
hour                              0
dtype: int64

In [8]:
# Get a statistical summary of the dataset
df.describe()

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,hour
count,7337194,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0
mean,2014-10-30 13:33:29.071412224,42.99356,17.65689,49.68303,68.59634,30.01985,9.598144,6.647163,0.1599805,3.262926,94339.87,0.2853681,0.02747494,11.5
min,2014-05-01 00:00:00,2.0,11.0,13.0,24.0,29.63,4.0,0.0,0.0,0.0,94041.0,0.0,0.0,0.0
25%,2014-07-31 01:55:00,24.0,15.0,46.0,63.0,29.93,10.0,4.0,0.0,1.0,94107.0,0.0,0.0,6.0
50%,2014-10-30 20:45:00,42.0,15.0,50.0,69.0,30.0,10.0,6.0,0.0,3.0,94107.0,0.0,0.0,11.0
75%,2015-01-29 20:30:00,64.0,19.0,55.0,75.0,30.11,10.0,9.0,0.0,5.0,94301.0,1.0,0.0,17.0
max,2015-04-30 23:55:00,84.0,27.0,65.0,96.0,30.41,20.0,23.0,1.0,8.0,95113.0,1.0,1.0,23.0
std,,23.99402,3.982221,6.848807,10.73235,0.1291384,1.313508,3.303102,0.366588,2.229327,424.8337,0.4515896,0.1634628,6.920271


#### Numerical Features

In [9]:
# Identify the numerical columns
num_cols = df.select_dtypes(include=['number']).columns.tolist()

# Show the list of numerical columns
num_cols

['station_id',
 'dock_count',
 'mean_dew_point_f',
 'mean_humidity',
 'mean_sea_level_pressure_inches',
 'mean_visibility_miles',
 'mean_wind_speed_mph',
 'precipitation_inches',
 'cloud_cover',
 'zip_code',
 'weekend',
 'holiday',
 'hour']

In [10]:
# Count the number of numerical columns
print(f'Number of numerical columns: {len(num_cols)}')

Number of numerical columns: 13


#### Categorical Features

In [11]:
# Identify the categorical columns
cate_cols = df.select_dtypes(include='object').columns.tolist()

# Show the list of categorical columns
cate_cols

['usage_rate_category']

In [12]:
# Count the number of categorical columns
print(f'Number of numerical columns: {len(cate_cols)}')

Number of numerical columns: 1


#### Datetime Features

In [13]:
# Identify the datetime columns
dat_cols = df.select_dtypes(include=['datetime']).columns.tolist()

# Show the list of numerical columns
dat_cols

['time']

In [14]:
# Count the number of datetime columns
print(f'Number of datetime columns: {len(dat_cols)}')

Number of datetime columns: 1


## Categorical Variables

#### Mapping the target variable to numerical values

We only have one categorical variable in our dataset. Let's proceed with manually encoding this variable.

In [15]:
# Get the unique values in the 'usage_rate_category' column
df['usage_rate_category'].unique()

array(['Median', 'High', 'Low'], dtype=object)

We will convert column 'usage_rate_category' into a numerical one, where 1 represents low usage, 2 represents median usage, and 3 represents high usage.

In [16]:
# Convert the 'usage_rate_category' column to a numerical one
df['usage_rate_category'] = df['usage_rate_category'].replace({'Low': 0, 'Median': 1, 'High': 2}).astype(int)

# Verify the conversion
df.sample(10)

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,hour
756310,2014-10-12 12:25:00,80,15,43.0,53.0,29.94,10.0,5.0,0,0.0,95113,1,1,0,12
5188609,2014-09-29 12:50:00,67,27,56.0,77.0,29.99,10.0,8.0,0,5.0,94107,1,0,0,12
123711,2014-05-27 13:15:00,16,15,46.0,59.0,30.03,10.0,8.0,0,0.0,95113,1,0,0,13
5001871,2014-09-10 06:45:00,57,15,56.0,66.0,29.9,9.0,5.0,0,1.0,94107,0,0,0,6
2518889,2014-06-23 12:05:00,28,23,52.0,68.0,29.96,10.0,7.0,0,2.0,94041,1,0,0,12
3903372,2014-05-24 13:20:00,50,23,51.0,71.0,29.99,10.0,10.0,0,3.0,94107,1,1,0,13
177950,2014-06-08 21:10:00,11,19,53.0,55.0,29.77,10.0,5.0,0,0.0,95113,0,1,0,21
2485491,2014-06-06 12:55:00,31,15,54.0,70.0,29.84,10.0,7.0,0,2.0,94041,1,0,0,12
3108476,2015-04-12 10:35:00,30,15,47.0,58.0,30.03,10.0,5.0,0,0.0,94041,1,1,0,10
2459067,2014-05-24 18:55:00,30,15,53.0,73.0,29.99,10.0,8.0,0,0.0,94041,1,1,0,18


In [17]:
# Verify if there is any remaining categorical columns in the dataset
df.select_dtypes(['object']).columns

Index([], dtype='object')

## Datetime Variables

#### Extracting datetime variables

We have one datetime variable. Let's extract useful date/time components such as year, month, day, and minute. Since we already have the 'hour' column, we will drop it first to ensure the date/time information is organized in the following order in our dataframe: year, month, day, hour, minute.

In [18]:
# Drop the column 'hour'
df = df.drop('hour', axis=1)

# Extract datetime components and create new columns
df['month'] = df['time'].dt.month
df['day'] = df['time'].dt.day
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute

# Verify the conversion
df.sample(15)

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,month,day,hour,minute
1793336,2014-06-27 21:05:00,23,15,55.0,77.0,29.98,10.0,6.0,0,1.0,94063,1,0,0,6,27,21,5
3226223,2014-06-25 20:50:00,38,15,58.0,71.0,29.99,14.0,7.0,0,5.0,94301,1,0,0,6,25,20,50
635075,2014-09-15 06:50:00,16,15,56.0,67.0,29.89,10.0,7.0,0,2.0,95113,0,0,0,9,15,6,50
6952696,2015-03-23 07:05:00,69,23,49.0,71.0,30.25,10.0,10.0,1,6.0,94107,0,0,0,3,23,7,5
6231978,2015-01-11 08:15:00,48,15,46.0,87.0,30.09,6.0,3.0,0,5.0,94107,1,1,0,1,11,8,15
7009691,2015-03-29 04:40:00,57,15,49.0,67.0,30.08,10.0,6.0,0,1.0,94107,1,1,0,3,29,4,40
4523221,2014-07-24 07:30:00,73,15,57.0,63.0,29.96,10.0,7.0,0,2.0,94107,1,0,0,7,24,7,30
3439268,2014-11-21 10:40:00,37,11,53.0,77.0,30.09,8.0,4.0,0,6.0,94301,1,0,0,11,21,10,40
5400722,2014-10-20 00:55:00,69,23,55.0,68.0,29.97,9.0,9.0,1,5.0,94107,0,0,0,10,20,0,55
4438272,2014-07-16 08:25:00,58,19,57.0,74.0,30.0,10.0,14.0,0,6.0,94107,1,0,0,7,16,8,25


In [19]:
# Drop the column 'time'
df = df.drop('time', axis=1)

In [20]:
# Look at the first 5 rows
df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,month,day,hour,minute
0,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,0
1,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,5
2,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,10
3,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,15
4,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,20


In [21]:
# Verify if there is any remaining datetime columns in the dataset
df.select_dtypes(['datetime']).columns

Index([], dtype='object')

In [22]:
# Check the shape of the dataset
df.shape

(7337194, 17)

In [23]:
# Get a quick overview of dataset variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7337194 entries, 0 to 7337193
Data columns (total 17 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   station_id                      int64  
 1   dock_count                      int64  
 2   mean_dew_point_f                float64
 3   mean_humidity                   float64
 4   mean_sea_level_pressure_inches  float64
 5   mean_visibility_miles           float64
 6   mean_wind_speed_mph             float64
 7   precipitation_inches            int64  
 8   cloud_cover                     float64
 9   zip_code                        int64  
 10  usage_rate_category             int64  
 11  weekend                         int64  
 12  holiday                         int64  
 13  month                           int32  
 14  day                             int32  
 15  hour                            int32  
 16  minute                          int32  
dtypes: float64(6), int32(4), in

Our current dataset contains 7,337,194 rows. If we apply SMOTE methods to address imbalanced classes, it will result in over 12 million rows. To manage the computational workload more efficiently, we have decided to resample our data to 30-minute intervals.

In [29]:
# Resampling our data to 30-minute intervals
resampled_df = df[df['minute'].isin([0, 30])]

In [30]:
# Check the shape of the dataset before train/test split
resampled_df.shape

(1222903, 17)

In [48]:
# Get a quick overview of dataset variables
resampled_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1222903 entries, 0 to 7337188
Data columns (total 17 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   station_id                      1222903 non-null  int64  
 1   dock_count                      1222903 non-null  int64  
 2   mean_dew_point_f                1222903 non-null  float64
 3   mean_humidity                   1222903 non-null  float64
 4   mean_sea_level_pressure_inches  1222903 non-null  float64
 5   mean_visibility_miles           1222903 non-null  float64
 6   mean_wind_speed_mph             1222903 non-null  float64
 7   precipitation_inches            1222903 non-null  int64  
 8   cloud_cover                     1222903 non-null  float64
 9   zip_code                        1222903 non-null  int64  
 10  usage_rate_category             1222903 non-null  int64  
 11  weekend                         1222903 non-null  int64  
 12  holid

In [31]:
# Verify the data selection
resampled_df.sample(10)

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,month,day,hour,minute
7121907,61,27,43.0,63.0,30.02,10.0,6.0,0,4.0,94107,1,0,0,4,9,20,0
975786,10,15,52.0,86.0,29.83,9.0,7.0,1,7.0,95113,1,1,0,11,29,22,30
6910221,61,27,49.0,64.0,30.05,10.0,6.0,0,2.0,94107,1,0,0,3,19,19,30
668105,14,19,59.0,70.0,29.97,10.0,6.0,0,4.0,95113,1,0,0,9,23,6,0
450355,13,15,60.0,65.0,29.99,10.0,7.0,1,4.0,95113,0,0,0,8,6,21,30
5286345,56,19,55.0,80.0,29.96,10.0,9.0,0,4.0,94107,1,0,0,10,9,21,30
206466,14,19,49.0,48.0,29.89,10.0,6.0,0,2.0,95113,1,1,0,6,14,21,30
2723981,32,11,48.0,39.0,30.05,10.0,3.0,0,0.0,94041,1,0,0,10,3,0,0
5764923,73,15,44.0,71.0,30.4,10.0,4.0,0,1.0,94107,1,0,0,11,25,15,0
3851300,42,15,49.0,65.0,29.96,10.0,16.0,0,5.0,94107,1,0,0,5,19,18,0


## Train/Test Split and Scaling

#### Train/Test Split

First, let's split the data into train and test sets by using the train_test_split function.

In [32]:
# Define the features and target variable
X = resampled_df.drop(columns='usage_rate_category')
y = resampled_df['usage_rate_category']

# Split the data so the test set contains 20% of the points, 
# (the training set will contain the rest of the points)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
# Show the shape of train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(978322, 16) (244581, 16) (978322,) (244581,)


#### Scaling Data

In [34]:
# Initialize a scaler
my_standard_scaler = StandardScaler()

# Fit the scaler on train data and then transform the train data
X_scaled_train = my_standard_scaler.fit_transform(X_train)

# Transform test data using the same scaler
X_scaled_test = my_standard_scaler.transform(X_test)

#### Handling Imbalanced Datasets

Let's take a look at our traget variables. 

In [35]:
# Get the percentage of each unique value in 'usage_rate_category' column
resampled_df['usage_rate_category'].value_counts() / len(resampled_df) * 100

usage_rate_category
1    67.185051
2    19.953749
0    12.861200
Name: count, dtype: float64

The data is imbalanced, with the majority of the usage rate category falling in class 2 (67%), followed by class 3 (20%), and class 1 (13%). 

We will apply SMOTE to the training set to address this issue since SMOTE can generate synthetic samples for the minority class.

In [36]:
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the X_scaled_train set
X_sm_train, y_sm_train = smote.fit_resample(X_scaled_train, y_train)

In [37]:
# Print the shape of the new X_train dataset
print(X_sm_train.shape)

(1971942, 16)


In [38]:
# Print the shape of the new y_train dataset
print(y_sm_train.shape)

(1971942,)


In [39]:
# Compare the original class distribution with resampled class distribution
print('Original Class Distribution:')
display(pd.Series(y_train).value_counts().sort_index())

print('\nResampled Class Distribution:')
display(pd.Series(y_sm_train).value_counts().sort_index())

Original Class Distribution:


usage_rate_category
0    125794
1    657314
2    195214
Name: count, dtype: int64


Resampled Class Distribution:


usage_rate_category
0    657314
1    657314
2    657314
Name: count, dtype: int64

#### Converting the Data into DataFrames

Since we have scaled and resampled our training and testing datasets, we can now proceed to convert them into training and testing DataFrames.

In [40]:
# Convert the resampled data (X_train_resampled and y_train_resampled) into a DataFrame
train_df = pd.DataFrame(np.column_stack((X_sm_train, y_sm_train)), columns=df.drop('usage_rate_category', axis=1).columns.tolist() + ['usage_rate_category'])

# Convert the test data (X_test and y_test) into a DataFrame
test_df = pd.DataFrame(np.column_stack((X_scaled_test, y_test)), columns=df.drop('usage_rate_category', axis=1).columns.tolist() + ['usage_rate_category'])

#### Train Data

In [41]:
# Look at the first 5 rows of the final train dataset
train_df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,month,day,hour,minute,usage_rate_category
0,-0.915834,-0.66691,-0.392057,1.249166,1.007851,-1.217929,-1.710234,-0.436715,-0.566383,-0.651909,-0.631934,-0.167764,-1.310991,-1.557247,0.649982,-1.000258,1.0
1,-0.040825,-0.66691,0.775875,0.130813,-0.927883,0.30591,0.71217,-0.436715,-0.117856,-0.54837,-0.631934,-0.167764,0.138593,-0.194967,-1.372796,-1.000258,2.0
2,0.542515,0.337471,0.483892,-0.241972,-1.2376,0.30591,1.620572,-0.436715,0.779197,-0.54837,-0.631934,-0.167764,-0.151324,-0.535537,0.072046,-1.000258,1.0
3,-1.41584,-0.66691,-1.55999,-0.894344,1.085281,0.30591,-1.104633,-0.436715,-1.463436,1.818895,1.582445,-0.167764,-1.600907,0.940267,-1.372796,-1.000258,1.0
4,1.542526,-0.66691,0.629884,0.130813,-1.160171,0.30591,-0.196231,-0.436715,-0.117856,1.818895,1.582445,-0.167764,0.718426,-0.98963,-1.083828,-1.000258,1.0


In [42]:
# Print the shape of the train data
train_df.shape

(1971942, 17)

In [43]:
# Sanity check to make sure we don't have any NAs
train_df.isna().sum()

station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
weekend                           0
holiday                           0
month                             0
day                               0
hour                              0
minute                            0
usage_rate_category               0
dtype: int64

#### Test Data

In [44]:
# Look at the first 5 rows of the final test dataset
test_df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,month,day,hour,minute,usage_rate_category
0,-0.707499,-0.66691,0.921867,0.969578,-0.927883,0.30591,-1.104633,-0.436715,-0.117856,-0.651909,1.582445,-0.167764,0.718426,1.39436,-0.072439,-1.000258,1.0
1,-0.457496,-1.671291,1.067858,0.596793,0.156128,-0.456009,0.106569,2.289821,0.33067,-0.703678,1.582445,-0.167764,1.008343,1.05379,1.083435,0.999742,1.0
2,0.125844,-0.66691,-0.68404,-0.98754,-0.773024,0.30591,1.317772,-0.436715,-1.01491,-0.54837,-0.631934,-0.167764,-0.44124,-1.103154,-1.372796,-1.000258,0.0
3,0.167511,0.337471,0.337901,1.249166,0.930422,0.30591,0.71217,-0.436715,-0.566383,-0.54837,-0.631934,-0.167764,-1.021074,1.280837,-0.650375,0.999742,2.0
4,0.459181,-0.66691,-0.246066,0.224009,0.233558,0.30591,-0.196231,-0.436715,-1.01491,-0.54837,-0.631934,-0.167764,-1.021074,0.259127,1.083435,-1.000258,0.0


In [45]:
# Print the shape of the test data
test_df.shape

(244581, 17)

In [46]:
# Sanity check to make sure we don't have any NAs
test_df.isna().sum()

station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
weekend                           0
holiday                           0
month                             0
day                               0
hour                              0
minute                            0
usage_rate_category               0
dtype: int64

In [47]:
# Save the final train and test datasets to Parquet format
train_df.to_parquet('data/resampled_train_dataset.parquet', index=False)
test_df.to_parquet('data/resampled_test_dataset.parquet', index=False)