# Feature Engineering - Bike Share System in the SF Bay Area

Author: Owen Hsu

## Table of content

1. Feature Overview
2. Loading and Setup
3. Assessment
4. Categorical Variables
5. Scaling and Train/Test Split

## Feature Overview

| Columns                    | Description                                                        |
|:---------------------------|:------------------------------------------------------------------:|
| Time                       | Recorded Date and Time                                             |
| Station ID                 | Station ID                                                         |
| Dock Number                | The number of docks at the station                                 |
| Mean Dew Point             | The average dew point of the date                                  |
| Mean Humidity              | The average humidity of the date                                   |
| Mean Sea Level Pressure    | The average sea level pressure of the date                         |
| Mean Visibility            | The average visibility of the date                                 |
| Mean Wind Speed            | The average wind speed of the date                                 |
| Precipitation Inches       | Determine whether the date of each record has precipitation        |
| Cloud Cover                | The fraction of the sky covered by clouds on given date            |
| Zip Code                   | The zip code for the weather record                                |
| Usage Rate Category        | Usage Rate Category of the station                                 |
| Holiday                    | Determine whether the date of each record is a holiday or not      |
| Weekend                    | Determine whether the date of each record is a weekend or not      |
| Hour                       | Recorded Time (Hour)                                               |


## Loading and Setup

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler 
# Filter warnings
from warnings import filterwarnings
filterwarnings('ignore')

  from pandas.core import (


In [2]:
# Load the dataset
df = pd.read_parquet('data/BikeData_after_EDA.parquet')

## Assessment

In [3]:
# Set display options to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
# Print the shape of the data
df.shape

(7337194, 15)

In [5]:
# Look at the first 5 rows
df.head()

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,hour
0,2014-05-01 00:00:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
1,2014-05-01 00:05:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
2,2014-05-01 00:10:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
3,2014-05-01 00:15:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0
4,2014-05-01 00:20:00,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,Median,0,0,0


In [6]:
# Get a quick overview of dataset variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7337194 entries, 0 to 7337193
Data columns (total 15 columns):
 #   Column                          Dtype         
---  ------                          -----         
 0   time                            datetime64[ns]
 1   station_id                      int64         
 2   dock_count                      int64         
 3   mean_dew_point_f                float64       
 4   mean_humidity                   float64       
 5   mean_sea_level_pressure_inches  float64       
 6   mean_visibility_miles           float64       
 7   mean_wind_speed_mph             float64       
 8   precipitation_inches            int64         
 9   cloud_cover                     float64       
 10  zip_code                        int64         
 11  usage_rate_category             object        
 12  weekend                         int64         
 13  holiday                         int64         
 14  hour                            int32         
dty

In [7]:
# Sanity check to make sure we don't have any NAs
df.isna().sum()

time                              0
station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
usage_rate_category               0
weekend                           0
holiday                           0
hour                              0
dtype: int64

In [8]:
# Get a statistical summary of the dataset
df.describe()

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,hour
count,7337194,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0,7337194.0
mean,2014-10-30 13:33:29.071412224,42.99356,17.65689,49.68303,68.59634,30.01985,9.598144,6.647163,0.1599805,3.262926,94339.87,0.2853681,0.02747494,11.5
min,2014-05-01 00:00:00,2.0,11.0,13.0,24.0,29.63,4.0,0.0,0.0,0.0,94041.0,0.0,0.0,0.0
25%,2014-07-31 01:55:00,24.0,15.0,46.0,63.0,29.93,10.0,4.0,0.0,1.0,94107.0,0.0,0.0,6.0
50%,2014-10-30 20:45:00,42.0,15.0,50.0,69.0,30.0,10.0,6.0,0.0,3.0,94107.0,0.0,0.0,11.0
75%,2015-01-29 20:30:00,64.0,19.0,55.0,75.0,30.11,10.0,9.0,0.0,5.0,94301.0,1.0,0.0,17.0
max,2015-04-30 23:55:00,84.0,27.0,65.0,96.0,30.41,20.0,23.0,1.0,8.0,95113.0,1.0,1.0,23.0
std,,23.99402,3.982221,6.848807,10.73235,0.1291384,1.313508,3.303102,0.366588,2.229327,424.8337,0.4515896,0.1634628,6.920271


#### Numerical Features

In [9]:
# Identify the numerical columns
num_cols = df.select_dtypes(include=['number']).columns.tolist()

# Show the list of numerical columns
num_cols

['station_id',
 'dock_count',
 'mean_dew_point_f',
 'mean_humidity',
 'mean_sea_level_pressure_inches',
 'mean_visibility_miles',
 'mean_wind_speed_mph',
 'precipitation_inches',
 'cloud_cover',
 'zip_code',
 'weekend',
 'holiday',
 'hour']

In [10]:
# Count the number of numerical columns
print(f'Number of numerical columns: {len(num_cols)}')

Number of numerical columns: 13


#### Categorical Features

In [11]:
# Identify the categorical columns
cate_cols = df.select_dtypes(include='object').columns.tolist()

# Show the list of categorical columns
cate_cols

['usage_rate_category']

In [12]:
# Count the number of categorical columns
print(f'Number of numerical columns: {len(cate_cols)}')

Number of numerical columns: 1


#### Datetime Features

In [13]:
# Identify the datetime columns
dat_cols = df.select_dtypes(include=['datetime']).columns.tolist()

# Show the list of numerical columns
dat_cols

['time']

In [14]:
# Count the number of datetime columns
print(f'Number of datetime columns: {len(dat_cols)}')

Number of datetime columns: 1


## Categorical Variables

#### Mapping the target variable to numerical values

We only have one categorical variable in our dataset. Let's proceed with manually encoding this variable.

In [15]:
# Get the unique values in the 'usage_rate_category' column
df['usage_rate_category'].unique()

array(['Median', 'High', 'Low'], dtype=object)

We will convert column 'usage_rate_category' into a numerical one, where 1 represents low usage, 2 represents median usage, and 3 represents high usage.

In [16]:
# Convert the 'usage_rate_category' column to a numerical one
df['usage_rate_category'] = df['usage_rate_category'].replace({'Low': 0, 'Median': 1, 'High': 2}).astype(int)

# Verify the conversion
df.sample(10)

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,hour
946560,2014-11-23 11:00:00,5,19,42.0,64.0,30.29,10.0,5.0,0,2.0,95113,1,1,0,11
4471036,2014-07-19 02:45:00,67,27,56.0,71.0,29.96,10.0,10.0,0,6.0,94107,1,1,0,2
221670,2014-06-18 18:30:00,3,15,42.0,43.0,29.93,10.0,7.0,0,0.0,95113,1,0,0,18
2135692,2014-12-14 23:40:00,83,15,47.0,86.0,30.08,9.0,1.0,1,6.0,94063,1,1,0,23
6736460,2015-03-02 00:25:00,50,23,44.0,69.0,29.86,10.0,5.0,1,5.0,94107,0,0,0,0
4451629,2014-07-17 17:30:00,69,23,56.0,68.0,29.99,10.0,13.0,0,4.0,94107,2,0,0,17
6535898,2015-02-10 14:55:00,55,23,45.0,71.0,30.17,10.0,4.0,0,3.0,94107,2,0,0,14
2145187,2014-12-19 22:55:00,25,15,52.0,89.0,30.15,9.0,6.0,0,8.0,94063,2,0,0,22
3072108,2015-03-25 01:35:00,30,15,50.0,65.0,30.22,10.0,5.0,0,0.0,94041,0,0,0,1
7146661,2015-04-11 06:30:00,82,15,48.0,70.0,30.05,10.0,7.0,0,4.0,94107,2,1,0,6


In [17]:
# Verify if there is any remaining categorical columns in the dataset
df.select_dtypes(['object']).columns

Index([], dtype='object')

## Datetime Variables

#### Extracting datetime variables

We have one datetime variable. Let's extract useful date/time components such as year, month, day, and minute. Since we already have the 'hour' column, we will drop it first to ensure the date/time information is organized in the following order in our dataframe: year, month, day, hour, minute.

In [18]:
# Drop the column 'hour'
df = df.drop('hour', axis=1)

# Extract datetime components and create new columns
df['month'] = df['time'].dt.month
df['day'] = df['time'].dt.day
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute

# Verify the conversion
df.sample(15)

Unnamed: 0,time,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,month,day,hour,minute
4031862,2014-06-05 16:50:00,82,15,51.0,74.0,29.85,9.0,11.0,0,3.0,94107,2,0,0,6,5,16,50
3884337,2014-05-22 11:05:00,56,19,50.0,72.0,29.98,10.0,11.0,0,6.0,94107,1,0,0,5,22,11,5
6843893,2015-03-12 12:10:00,76,19,51.0,72.0,30.22,10.0,5.0,0,3.0,94107,1,0,0,3,12,12,10
682392,2014-09-26 20:35:00,16,15,53.0,67.0,29.97,10.0,7.0,1,4.0,95113,1,0,0,9,26,20,35
5633318,2014-11-12 15:55:00,71,19,46.0,56.0,29.99,10.0,4.0,1,7.0,94107,1,0,0,11,12,15,55
2528686,2014-06-28 12:30:00,27,15,53.0,56.0,29.96,10.0,8.0,0,0.0,94041,1,1,0,6,28,12,30
6549190,2015-02-11 18:35:00,66,19,46.0,78.0,30.15,10.0,3.0,0,4.0,94107,2,0,0,2,11,18,35
786179,2014-10-19 06:25:00,8,15,56.0,77.0,29.9,8.0,5.0,0,4.0,95113,1,1,0,10,19,6,25
5576401,2014-11-07 00:50:00,47,19,54.0,81.0,30.18,5.0,4.0,1,3.0,94107,2,0,0,11,7,0,50
2824104,2014-11-22 15:35:00,29,23,54.0,77.0,30.14,10.0,7.0,1,5.0,94041,1,1,0,11,22,15,35


In [19]:
# Drop the column 'time'
df = df.drop('time', axis=1)

In [20]:
# Look at the first 5 rows
df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,usage_rate_category,weekend,holiday,month,day,hour,minute
0,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,0
1,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,5
2,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,10
3,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,15
4,2,27,45.0,41.0,30.06,10.0,6.0,0,3.0,95113,1,0,0,5,1,0,20


In [21]:
# Verify if there is any remaining datetime columns in the dataset
df.select_dtypes(['datetime']).columns

Index([], dtype='object')

In [22]:
# Double check the shape of the dataset before train/test split
df.shape

(7337194, 17)

In [23]:
# Get a quick overview of dataset variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7337194 entries, 0 to 7337193
Data columns (total 17 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   station_id                      int64  
 1   dock_count                      int64  
 2   mean_dew_point_f                float64
 3   mean_humidity                   float64
 4   mean_sea_level_pressure_inches  float64
 5   mean_visibility_miles           float64
 6   mean_wind_speed_mph             float64
 7   precipitation_inches            int64  
 8   cloud_cover                     float64
 9   zip_code                        int64  
 10  usage_rate_category             int64  
 11  weekend                         int64  
 12  holiday                         int64  
 13  month                           int32  
 14  day                             int32  
 15  hour                            int32  
 16  minute                          int32  
dtypes: float64(6), int32(4), in

## Train/Test Split and Scaling

#### Train/Test Split

First, let's split the data into train and test sets by using the train_test_split function.

In [24]:
# Define the features and target variable
X = df.drop(columns='usage_rate_category')
y = df['usage_rate_category']

# Split the data so the test set contains 20% of the points, 
# (the training set will contain the rest of the points)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
# Show the shape of train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(5869755, 16) (1467439, 16) (5869755,) (1467439,)


#### Scaling Data

In [26]:
# Initialize a scaler
my_standard_scaler = StandardScaler()

# Fit the scaler on train data and then transform the train data
X_scaled_train = my_standard_scaler.fit_transform(X_train)

# Transform test data using the same scaler
X_scaled_test = my_standard_scaler.transform(X_test)

#### Handling Imbalanced Datasets

Let's take a look at our traget variables. 

In [27]:
# Get the percentage of each unique value in 'usage_rate_category' column
df['usage_rate_category'].value_counts() / len(df) * 100

usage_rate_category
1    67.176239
2    19.979613
0    12.844147
Name: count, dtype: float64

The data is imbalanced, with the majority of the usage rate category falling in class 2 (67%), followed by class 3 (20%), and class 1 (13%). 

We will apply SMOTE to the training set to address this issue since SMOTE can generate synthetic samples for the minority class.

In [28]:
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the X_scaled_train set
X_sm_train, y_sm_train = smote.fit_resample(X_scaled_train, y_train)

In [29]:
# Print the shape of the new X_train dataset
print(X_sm_train.shape)

(11831433, 16)


In [30]:
# Print the shape of the new y_train dataset
print(y_sm_train.shape)

(11831433,)


In [31]:
# Compare the original class distribution with resampled class distribution
print('Original Class Distribution:')
display(pd.Series(y_train).value_counts().sort_index())

print('\nResampled Class Distribution:')
display(pd.Series(y_sm_train).value_counts().sort_index())

Original Class Distribution:


usage_rate_category
0     753475
1    3943811
2    1172469
Name: count, dtype: int64


Resampled Class Distribution:


usage_rate_category
0    3943811
1    3943811
2    3943811
Name: count, dtype: int64

#### Converting the Data into DataFrames

Since we have scaled and resampled our training and testing datasets, we can now proceed to convert them into training and testing DataFrames.

In [32]:
# Convert the resampled data (X_train_resampled and y_train_resampled) into a DataFrame
train_df = pd.DataFrame(np.column_stack((X_sm_train, y_sm_train)), columns=df.drop('usage_rate_category', axis=1).columns.tolist() + ['usage_rate_category'])

# Convert the test data (X_test and y_test) into a DataFrame
test_df = pd.DataFrame(np.column_stack((X_scaled_test, y_test)), columns=df.drop('usage_rate_category', axis=1).columns.tolist() + ['usage_rate_category'])

#### Train Data

In [33]:
# Look at the first 5 rows of the final train dataset
train_df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,month,day,hour,minute,usage_rate_category
0,-1.375345,-0.667162,-0.099799,-0.428081,0.38837,0.306077,-1.406683,-0.436337,-1.01458,1.820021,1.582616,-0.168116,-1.310568,-0.195224,1.228122,-1.303354,1.0
1,1.125367,0.336991,0.776477,0.410568,-0.15374,0.306077,1.015281,-0.436337,0.779536,-0.548136,1.582616,-0.168116,0.428313,0.14541,0.216731,-0.434267,1.0
2,1.125367,0.336991,-1.122121,0.410568,2.0147,-0.455251,-0.195701,-0.436337,-0.566051,-0.548136,-0.631865,-0.168116,-1.600382,0.713132,-1.372598,-0.723963,0.0
3,0.083403,-0.667162,-0.537937,1.715132,0.543259,-2.739235,-1.103938,-0.436337,0.779536,-0.548136,1.582616,-0.168116,-1.600382,-0.535857,0.939153,-0.434267,1.0
4,0.125082,-0.667162,0.776477,1.06285,-0.463517,0.306077,0.712536,-0.436337,0.331007,-0.548136,-0.631865,-0.168116,1.00794,-0.762946,0.650184,1.593602,1.0


In [34]:
# Print the shape of the train data
train_df.shape

(11831433, 17)

In [35]:
# Sanity check to make sure we don't have any NAs
train_df.isna().sum()

station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
weekend                           0
holiday                           0
month                             0
day                               0
hour                              0
minute                            0
usage_rate_category               0
dtype: int64

#### Test Data

In [36]:
# Look at the first 5 rows of the final test dataset
test_df.head()

Unnamed: 0,station_id,dock_count,mean_dew_point_f,mean_humidity,mean_sea_level_pressure_inches,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,cloud_cover,zip_code,weekend,holiday,month,day,hour,minute,usage_rate_category
0,0.75026,2.345298,0.922523,0.596934,0.001149,0.306077,-0.801192,2.291809,0.779536,-0.548136,-0.631865,-0.168116,1.00794,0.826677,-1.083629,0.724515,2.0
1,-0.416739,-0.667162,0.776477,1.249216,-0.540961,-0.455251,0.107045,2.291809,-0.117522,-0.703502,-0.631865,-0.168116,1.587567,-0.649401,-1.661567,-0.434267,1.0
2,-0.37506,1.341144,0.192293,-0.800813,0.853036,0.306077,0.40979,-0.436337,-0.117522,-0.091454,1.582616,-0.168116,-1.020755,1.394399,-0.216723,-1.303354,2.0
3,1.33376,0.336991,0.922523,0.317385,-0.540961,0.306077,0.712536,-0.436337,0.779536,-0.548136,-0.631865,-0.168116,0.138499,1.735032,1.661576,0.724515,1.0
4,-1.667095,-0.667162,-0.537937,-0.893996,-0.463517,0.306077,0.712536,-0.436337,0.779536,1.820021,-0.631865,-0.168116,-0.441128,0.372499,-0.361207,0.145124,1.0


In [37]:
# Print the shape of the test data
test_df.shape

(1467439, 17)

In [38]:
# Sanity check to make sure we don't have any NAs
test_df.isna().sum()

station_id                        0
dock_count                        0
mean_dew_point_f                  0
mean_humidity                     0
mean_sea_level_pressure_inches    0
mean_visibility_miles             0
mean_wind_speed_mph               0
precipitation_inches              0
cloud_cover                       0
zip_code                          0
weekend                           0
holiday                           0
month                             0
day                               0
hour                              0
minute                            0
usage_rate_category               0
dtype: int64

In [39]:
# Save the final train and test datasets to Parquet format
train_df.to_parquet('data/train_dataset.parquet', index=False)
test_df.to_parquet('data/test_dataset.parquet', index=False)