<a href="https://colab.research.google.com/github/mvince33/Coding-Dojo/blob/main/week10/feature_engineering_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering Exercise
- Michael Vincent
- 8/21/22

## Imports

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Load the data

In [2]:
# Load the data
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSWcs7d0Hz9D4QsdQrMVoYA1jH7uRiYk2SzPr0AH6gB0FyqphhumdJAM4ga-Ebg9vzfKGmW751pXHJ2/pub?output=csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


## Clean the data

In [3]:
# Check for missing values and duplicates
print('Duplicates:', df.duplicated().sum())
print('Missing Values:', df.isna().sum().sum())

Duplicates: 0
Missing Values: 0


## Process the data

In [4]:
# Make a copy of the data to perform
# feature engineering on.
fe_df = df.copy()
fe_df['datetime'] = pd.to_datetime(fe_df['datetime'])
fe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB


In [5]:
# Create new columns for month, day of the week, and hour
fe_df['month'] = fe_df['datetime'].dt.month_name()
fe_df['day'] = fe_df['datetime'].dt.day_name()
fe_df['hour'] = fe_df['datetime'].dt.hour.astype('object')

# Drop the datetime and season columns
# as they are now redundant.
fe_df.drop(columns = ['datetime', 'season'], inplace = True)

# Make sure the changes were made
fe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   holiday     10886 non-null  int64  
 1   workingday  10886 non-null  int64  
 2   weather     10886 non-null  int64  
 3   temp        10886 non-null  float64
 4   atemp       10886 non-null  float64
 5   humidity    10886 non-null  int64  
 6   windspeed   10886 non-null  float64
 7   casual      10886 non-null  int64  
 8   registered  10886 non-null  int64  
 9   count       10886 non-null  int64  
 10  month       10886 non-null  object 
 11  day         10886 non-null  object 
 12  hour        10886 non-null  object 
dtypes: float64(3), int64(7), object(3)
memory usage: 1.1+ MB


In [6]:
# Convert the temperatures to degrees Fahrenheit
fe_df[['temp', 'atemp']] = fe_df[['temp', 'atemp']].apply(lambda x: 1.8 * x + 32)
# Make sure the changes were applied
fe_df[['temp', 'atemp']].head()

Unnamed: 0,temp,atemp
0,49.712,57.911
1,48.236,56.543
2,48.236,56.543
3,49.712,57.911
4,49.712,57.911


In [7]:
# Make a new column temp_variance that is the 
# difference of temp and atemp.
fe_df['temp_variance'] = fe_df['temp'] - fe_df['atemp']
# Drop the atemp column as it is now redundant.
fe_df.drop(columns = 'atemp', inplace = True)
# Make sure the changes were made.
fe_df.head()

Unnamed: 0,holiday,workingday,weather,temp,humidity,windspeed,casual,registered,count,month,day,hour,temp_variance
0,0,0,1,49.712,81,0.0,3,13,16,January,Saturday,0,-8.199
1,0,0,1,48.236,80,0.0,8,32,40,January,Saturday,1,-8.307
2,0,0,1,48.236,80,0.0,5,27,32,January,Saturday,2,-8.307
3,0,0,1,49.712,75,0.0,3,10,13,January,Saturday,3,-8.199
4,0,0,1,49.712,75,0.0,0,1,1,January,Saturday,4,-8.199


In [8]:
# Drop the casual and registered columns from both data frames.
# (I'm not sure the justification for this, just following instructions.)
fe_df.drop(columns = ['casual', 'registered'], inplace = True)
df.drop(columns = ['casual', 'registered'], inplace = True)

# Make sure the columns were dropped.
display(fe_df.head())
display(df.head())

Unnamed: 0,holiday,workingday,weather,temp,humidity,windspeed,count,month,day,hour,temp_variance
0,0,0,1,49.712,81,0.0,16,January,Saturday,0,-8.199
1,0,0,1,48.236,80,0.0,40,January,Saturday,1,-8.307
2,0,0,1,48.236,80,0.0,32,January,Saturday,2,-8.307
3,0,0,1,49.712,75,0.0,13,January,Saturday,3,-8.199
4,0,0,1,49.712,75,0.0,1,January,Saturday,4,-8.199


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


## Modeling

In [9]:
# Set the target and features
X = df.drop(columns = 'count')
X_fe = fe_df.drop(columns = 'count')
y = df['count']

In [10]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
X_train_fe, X_test_fe = train_test_split(X_fe, random_state = 42)

In [11]:
# Make column selectors 
num_selector = make_column_selector(dtype_include = 'number')
cat_selector = make_column_selector(dtype_include = 'object')

In [12]:
# Make tuples for column transformers
num_tuple = (StandardScaler(), num_selector)
cat_tuple = (OneHotEncoder(sparse = False, handle_unknown = 'ignore' ), cat_selector)
# Make the column transformer
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')

### Linear regression

In [13]:
# Try a linear regression model
lr_pipe = make_pipeline(col_transformer, LinearRegression())

In [14]:
%%time
# Train the linear regression on the data without feature engineering
lr_pipe.fit(X_train, y_train)

CPU times: user 8min 1s, sys: 12.9 s, total: 8min 14s
Wall time: 4min 23s


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('linearregression', LinearRegression())])

In [15]:
# Evaluate the linear regression model without feature engineering
train_preds = lr_pipe.predict(X_train)
test_preds = lr_pipe.predict(X_test)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores = pd.DataFrame({'lr without fe': [train_mae, 
                                         train_mse, 
                                         train_rmse, 
                                         train_r2,
                                         test_mae,
                                         test_mse,
                                         test_rmse,
                                         test_r2]}, 
                      index = ['train mae',
                               'train mse',
                               'train rmse',
                               'train r2',
                               'test mae',
                               'test mse',
                               'test rmse',
                               'test r2'])
scores

Unnamed: 0,lr without fe
train mae,1.53503e-12
train mse,4.888818e-24
train rmse,2.211067e-12
train r2,1.0
test mae,125.222
test mse,26915.07
test rmse,164.0581
test r2,-1.033968


In [16]:
%%time
# Train the linear regression on the data with feature engineering
lr_pipe.fit(X_train_fe, y_train)

CPU times: user 64.6 ms, sys: 68.9 ms, total: 133 ms
Wall time: 70.7 ms


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('linearregression', LinearRegression())])

In [17]:
# Evaluate the linear regression model with feature engineering
train_preds = lr_pipe.predict(X_train_fe)
test_preds = lr_pipe.predict(X_test_fe)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores['lr with fe'] = [train_mae,
                        train_mse,
                        train_rmse,
                        train_r2,
                        test_mae,
                        test_mse,
                        test_rmse,
                        test_r2]
scores

Unnamed: 0,lr without fe,lr with fe
train mae,1.53503e-12,78.501125
train mse,4.888818e-24,11891.052049
train rmse,2.211067e-12,109.046101
train r2,1.0,0.431169
test mae,125.222,80.299653
test mse,26915.07,12284.660032
test rmse,164.0581,110.836186
test r2,-1.033968,0.37631


### Decision tree

In [18]:
# Construct a decision tree model
dt_pipe = make_pipeline(col_transformer, DecisionTreeRegressor())

In [19]:
%%time
# Train the decision tree on the data without feature engineering
dt_pipe.fit(X_train, y_train)

CPU times: user 8.05 s, sys: 86.5 ms, total: 8.14 s
Wall time: 8.08 s


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('decisiontreeregressor', DecisionTreeRegressor())])

In [20]:
# Evaluate the decision tree without feature engineering
train_preds = dt_pipe.predict(X_train)
test_preds = dt_pipe.predict(X_test)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores['dt without fe'] = [train_mae,
                           train_mse,
                           train_rmse,
                           train_r2,
                           test_mae,
                           test_mse,
                           test_rmse,
                           test_r2]
scores

Unnamed: 0,lr without fe,lr with fe,dt without fe
train mae,1.53503e-12,78.501125,0.0
train mse,4.888818e-24,11891.052049,0.0
train rmse,2.211067e-12,109.046101,0.0
train r2,1.0,0.431169,1.0
test mae,125.222,80.299653,117.729611
test mse,26915.07,12284.660032,31789.332843
test rmse,164.0581,110.836186,178.295633
test r2,-1.033968,0.37631,-0.624423


In [21]:
%%time
# Train the decision tree on the data with feature engineering
dt_pipe.fit(X_train_fe, y_train)

CPU times: user 103 ms, sys: 6.97 ms, total: 110 ms
Wall time: 110 ms


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('decisiontreeregressor', DecisionTreeRegressor())])

In [22]:
# Evaluate the decision tree with feature engineering
train_preds = dt_pipe.predict(X_train_fe)
test_preds = dt_pipe.predict(X_test_fe)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores['dt with fe'] = [train_mae,
                        train_mse,
                        train_rmse,
                        train_r2,
                        test_mae,
                        test_mse,
                        test_rmse,
                        test_r2]
scores

Unnamed: 0,lr without fe,lr with fe,dt without fe,dt with fe
train mae,1.53503e-12,78.501125,0.0,0.030867
train mse,4.888818e-24,11891.052049,0.0,1.497305
train rmse,2.211067e-12,109.046101,0.0,1.223644
train r2,1.0,0.431169,1.0,0.999954
test mae,125.222,80.299653,117.729611,62.663115
test mse,26915.07,12284.660032,31789.332843,9765.002572
test rmse,164.0581,110.836186,178.295633,98.818028
test r2,-1.033968,0.37631,-0.624423,0.700739


### Random forest

In [23]:
# Construct a random forest model
rf_pipe = make_pipeline(col_transformer, RandomForestRegressor())

In [24]:
%%time
# Fit the random forest on the data without feature engineering
rf_pipe.fit(X_train, y_train)

CPU times: user 6min 3s, sys: 408 ms, total: 6min 3s
Wall time: 6min 6s


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('randomforestregressor', RandomForestRegressor())])

In [25]:
# Evaluate the random forest without feature engineering
train_preds = rf_pipe.predict(X_train)
test_preds = rf_pipe.predict(X_test)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores['rf without fe'] = [train_mae,
                           train_mse,
                           train_rmse,
                           train_r2,
                           test_mae,
                           test_mse,
                           test_rmse,
                           test_r2]
scores

Unnamed: 0,lr without fe,lr with fe,dt without fe,dt with fe,rf without fe
train mae,1.53503e-12,78.501125,0.0,0.030867,37.831006
train mse,4.888818e-24,11891.052049,0.0,1.497305,3405.266008
train rmse,2.211067e-12,109.046101,0.0,1.223644,58.354657
train r2,1.0,0.431169,1.0,0.999954,0.836847
test mae,125.222,80.299653,117.729611,62.663115,101.915312
test mse,26915.07,12284.660032,31789.332843,9765.002572,25004.590061
test rmse,164.0581,110.836186,178.295633,98.818028,158.128397
test r2,-1.033968,0.37631,-0.624423,0.700739,-0.883001


In [26]:
%%time
# Fit the random forest on the data with feature engineering
rf_pipe.fit(X_train_fe, y_train)

CPU times: user 5.1 s, sys: 26 ms, total: 5.13 s
Wall time: 5.11 s


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b54310>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f8ac2b542d0>)])),
                ('randomforestregressor', RandomForestRegressor())])

In [27]:
# Evaluate the random forest with feature engineering
train_preds = rf_pipe.predict(X_train_fe)
test_preds = rf_pipe.predict(X_test_fe)
train_mae = mean_absolute_error(train_preds, y_train)
train_mse = mean_squared_error(train_preds, y_train)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(train_preds, y_train)
test_mae = mean_absolute_error(test_preds, y_test)
test_mse = mean_squared_error(test_preds, y_test)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(test_preds, y_test)
scores['rf with fe'] = [train_mae,
                        train_mse,
                        train_rmse,
                        train_r2,
                        test_mae,
                        test_mse,
                        test_rmse,
                        test_r2]
scores

Unnamed: 0,lr without fe,lr with fe,dt without fe,dt with fe,rf without fe,rf with fe
train mae,1.53503e-12,78.501125,0.0,0.030867,37.831006,18.160113
train mse,4.888818e-24,11891.052049,0.0,1.497305,3405.266008,770.633956
train rmse,2.211067e-12,109.046101,0.0,1.223644,58.354657,27.760295
train r2,1.0,0.431169,1.0,0.999954,0.836847,0.973786
test mae,125.222,80.299653,117.729611,62.663115,101.915312,48.394249
test mse,26915.07,12284.660032,31789.332843,9765.002572,25004.590061,5189.064198
test rmse,164.0581,110.836186,178.295633,98.818028,158.128397,72.035159
test r2,-1.033968,0.37631,-0.624423,0.700739,-0.883001,0.800812


## Conclusions

In [28]:
# Display the model metrics rounded to two decimal places.
scores.round(2)

Unnamed: 0,lr without fe,lr with fe,dt without fe,dt with fe,rf without fe,rf with fe
train mae,0.0,78.5,0.0,0.03,37.83,18.16
train mse,0.0,11891.05,0.0,1.5,3405.27,770.63
train rmse,0.0,109.05,0.0,1.22,58.35,27.76
train r2,1.0,0.43,1.0,1.0,0.84,0.97
test mae,125.22,80.3,117.73,62.66,101.92,48.39
test mse,26915.07,12284.66,31789.33,9765.0,25004.59,5189.06
test rmse,164.06,110.84,178.3,98.82,158.13,72.04
test r2,-1.03,0.38,-0.62,0.7,-0.88,0.8


The feature engineering certainly sped up training time. The linear regression and random forest models were quite slow without feature engineering. The linear regression took 8 minutes to train and the random forest took 4 minutes to train. The decision tree took only a few seconds to train on the data without feature engineering. All models were fit to the data with feature engineering in less than a second. The datetime feature in the original data was likely the culprit as the OneHotEncoder had to create many (hundreds? thousands?) of columns to encode the dates. 

The random forest with feature engineering performed the best on the test set overall. And since this model took a lot of time to train on the data without feature engineering, we see this problem as a good example of why feature engineering is important.