<a href="https://colab.research.google.com/github/juDEcorous/ML-Regression/blob/main/CORE_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CORE: Feature Engineering 
Name: Jude Maico Jr.

Your task is to engineer some new features to try to improve a model's ability to predict the total number of bike share rentals during a given hour of the day.

1. Import the data the drop the 'casual' and 'registered' columns. These are redundant with your target, 'count'.
2. Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:
    1. Name of the Month
    2. Name of the Day of the Week
    3. Hour of the Day
        1. Make sure all 3 new columns are 'object' datatype so they can be one-hot encoded later.
        2. Drop the 'datetime' and 'season' columns. These are now redundant.
3. The temperatures in the 'temp' and 'atemp' columns are in Celsius. Use `.apply()` and a Lambda function to convert them to Fahrenheit.
5. Create a new column, 'temp_variance' which shows how much warmer or colder the current temperature ('temp') is than the average temperate for that day of the year ('atemp').  If the current temperature is warmer than average ('atemp'), the value in 'temp_variance' should be positive. 
    1. Drop the 'atemp' column.
    
Optional:
- Use a predictive model of your choice and try to predict the 'count' of hourly bike-share users with both the original features and the engineered feature set you created.

- Remember to drop the 'casual' and 'registered' columns from both versions before modeling.

- Did these feature engineering choices improve your ability to predict the 'count'?

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression

In [8]:
df = pd.read_csv('/content/drive/MyDrive/Datas/bikeshare_train - bikeshare_train.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [9]:
df.duplicated().sum()

0

No duplicates and missing values seen.

In [10]:
df = df.drop(columns = ['casual', 'registered'])
df.sample(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
9031,2012-08-18 16:00:00,3,0,0,1,31.16,33.335,37,16.9979,641
1080,2011-03-09 13:00:00,1,0,1,2,13.12,14.395,76,23.9994,99
1578,2011-04-11 16:00:00,2,0,1,1,30.34,33.335,48,35.0008,235


In [11]:
# for optional assignment
df2 = df.copy()
df2.sample(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
3915,2011-09-14 4:00:00,3,0,1,1,24.6,28.03,83,6.0032,9
3840,2011-09-11 0:00:00,3,0,0,1,25.42,28.03,88,0.0,108
3206,2011-08-03 12:00:00,3,0,1,3,31.16,34.85,55,8.9981,161


In [12]:
df['datetime'] = pd.to_datetime(df['datetime'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


In [13]:
#name of the month
df['month'] = df['datetime'].dt.month_name()

#day of the week
df['week_day'] = df['datetime'].dt.day_name()

#time of the day
df['day_time'] = (df['datetime'].dt.hour % 24 + 4) // 4
df['day_time'].replace({1: 'Late Night',
                        2: 'Early Morning',
                        3: 'Morning',
                        4: 'Noon',
                        5: 'Evening',
                        6: 'Night'}, inplace=True)

df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,week_day,day_time
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,Late Night
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,Late Night
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,Late Night
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,Late Night
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,Early Morning


In [14]:
df.drop(columns = ['datetime', 'season'], inplace = True)
df.columns

Index(['holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity',
       'windspeed', 'count', 'month', 'week_day', 'day_time'],
      dtype='object')

In [15]:
df[['temp','atemp']] = df[['temp', 'atemp']].apply(lambda x: (1.8 * x) +32)
df.rename(columns = {'temp' : 'fahrenheit_temp'}, inplace = True)
df.head()

Unnamed: 0,holiday,workingday,weather,fahrenheit_temp,atemp,humidity,windspeed,count,month,week_day,day_time
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,Late Night
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,Late Night
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,Late Night
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,Late Night
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,Early Morning


In [16]:
median_temp = df['atemp'].median()

def thermal(temp):
    if temp > median_temp:
        return 'Warm'
    else:
        return 'Cold'
    
df['atemp'] = df['atemp'].apply(thermal)
df.rename(columns = {'atemp' : 'temp_variance'}, inplace = True)
df.sample(5)

Unnamed: 0,holiday,workingday,weather,fahrenheit_temp,temp_variance,humidity,windspeed,count,month,week_day,day_time
5917,0,1,1,61.52,Cold,43,31.0009,410,February,Thursday,Evening
7097,0,0,1,60.044,Cold,66,8.9981,56,April,Saturday,Late Night
9424,0,0,1,71.852,Warm,64,0.0,117,September,Sunday,Late Night
6645,0,1,1,67.424,Cold,67,6.0032,5,March,Wednesday,Early Morning
8443,0,1,2,80.708,Warm,61,0.0,11,July,Friday,Early Morning


# Optional Assignment:

## Preparing Dataset of ML

In [17]:
data_types = df.dtypes
object_data_types = data_types[(data_types == 'object')]

for column in object_data_types.index:
  print(column)
  print('\n')
  print(f'Unique Values: \n{df[column].unique()}')
  print('\n')

temp_variance


Unique Values: 
['Cold' 'Warm']


month


Unique Values: 
['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August'
 'September' 'October' 'November' 'December']


week_day


Unique Values: 
['Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday']


day_time


Unique Values: 
['Late Night' 'Early Morning' 'Morning' 'Noon' 'Evening' 'Night']




In [18]:
data_types = df2.dtypes
object_data_types = data_types[(data_types == 'object')]

for column in object_data_types.index:
  print(column)
  print('\n')
  print(f'Unique Values: \n{df2[column].unique()}')
  print('\n')

datetime


Unique Values: 
['2011-01-01 0:00:00' '2011-01-01 1:00:00' '2011-01-01 2:00:00' ...
 '2012-12-19 21:00:00' '2012-12-19 22:00:00' '2012-12-19 23:00:00']




In [19]:
df.describe()

Unnamed: 0,holiday,workingday,weather,fahrenheit_temp,humidity,windspeed,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,0.028569,0.680875,1.418427,68.415548,61.88646,12.799395,191.574132
std,0.166599,0.466159,0.633839,14.024862,19.245033,8.164537,181.144454
min,0.0,0.0,1.0,33.476,0.0,0.0,1.0
25%,0.0,0.0,1.0,57.092,47.0,7.0015,42.0
50%,0.0,1.0,1.0,68.9,62.0,12.998,145.0
75%,0.0,1.0,2.0,79.232,77.0,16.9979,284.0
max,1.0,1.0,4.0,105.8,100.0,56.9969,977.0


In [20]:
df2.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,977.0


In [21]:
#final check on first model
df.info()
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   holiday          10886 non-null  int64  
 1   workingday       10886 non-null  int64  
 2   weather          10886 non-null  int64  
 3   fahrenheit_temp  10886 non-null  float64
 4   temp_variance    10886 non-null  object 
 5   humidity         10886 non-null  int64  
 6   windspeed        10886 non-null  float64
 7   count            10886 non-null  int64  
 8   month            10886 non-null  object 
 9   week_day         10886 non-null  object 
 10  day_time         10886 non-null  object 
dtypes: float64(2), int64(5), object(4)
memory usage: 935.6+ KB


Unnamed: 0,holiday,workingday,weather,fahrenheit_temp,temp_variance,humidity,windspeed,count,month,week_day,day_time
0,0,0,1,49.712,Cold,81,0.0,16,January,Saturday,Late Night
1,0,0,1,48.236,Cold,80,0.0,40,January,Saturday,Late Night
2,0,0,1,48.236,Cold,80,0.0,32,January,Saturday,Late Night


In [22]:
#final check on second model
df2.info()
df2.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   count       10886 non-null  int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 850.6+ KB


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32


Model 1: </br>
**Categorical Features**: holiday, workingday, weather </br>
**Nominal Features**: temp_variance, month, week_day, day_time </br>
**Numerical Features**: fahrenheit_temp, humidity, windspeed

Model 2: </br> 
**Categorical Features**: season, holiday, workingday, weather </br>
**Nominal Features**: datetime</br>
**Numerical Features**: temp, atemp, humidity, windspeed</br>

Target is a continuous number - **Regression Model** will be used.

## Model 1 (Featured Engineered)

In [23]:
target = 'count'
y = df[target]
X = df.drop(columns = [target])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [24]:
# Categorical features: already in numerical 
cat_cols = ['holiday', 'workingday', 'weather']
cat_scale = StandardScaler()
cat_tuples = (cat_scale, cat_cols)

In [25]:
# Nominal features
nom_cols = ['temp_variance', 'month', 'week_day', 'day_time']
ohe = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
nom_tuples = (ohe, nom_cols)

In [26]:
# Numerical features
num_cols = ['fahrenheit_temp', 'humidity', 'windspeed']
num_scaler = StandardScaler()
num_tuples = (num_scaler, num_cols)

In [27]:
preprocessor = make_column_transformer(cat_tuples, nom_tuples, num_tuples,
                                       remainder = 'drop')

In [28]:
def performance(model, X_train, X_test, y_train, y_test, 
                preprocessor_number, name):
    
    #pipeline
    model_pipe = make_pipeline(preprocessor_number, model)
    model_pipe.fit(X_train, y_train)
    
    #prediction
    model_test = model_pipe.predict(X_test)
    model_train = model_pipe.predict(X_train)
    
    #train score
    mae_train = mean_absolute_error(y_train, model_train)
    mse_train = mean_squared_error(y_train, model_train)
    rmse_train = np.sqrt(mse_train)
    r2_train = r2_score(y_train, model_train)     
    
    #prediction score
    mae_test = mean_absolute_error(y_test, model_test)
    mse_test = mean_squared_error(y_test, model_test)
    rmse_test = np.sqrt(mse_test)
    r2_test = r2_score(y_test, model_test)   
    
    print(f'{name} Trained Scores:')
    print(f'R^2: {r2_train:.3f} \nMAE: {mae_train:.3f}')
    print(f'MSE: {mse_train:.3f} \nRMSE: {rmse_train:.3f} \n')
          
    print(f'{name} Test Scores:')
    print(f'R^2: {r2_test:.3f} \nMAE: {mae_test:.3f}')
    print(f'MSE: {mse_test:.3f} \nRMSE: {rmse_test:.3f}')

In [29]:
linear = LinearRegression()
performance(linear, X_train, X_test, y_train, y_test,
            preprocessor, name = 'Engineered Linear Regression', )

Engineered Linear Regression Trained Scores:
R^2: 0.537 
MAE: 90.270
MSE: 15179.346 
RMSE: 123.204 

Engineered Linear Regression Test Scores:
R^2: 0.516 
MAE: 92.311
MSE: 15877.300 
RMSE: 126.005




## Model 2

In [30]:
y2 = df2[target]
X2 = df2.drop(columns = [target])
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, 
                                                        random_state = 42)

In [31]:
# Categorical features for second model
cat_cols2 = ['season', 'holiday', 'workingday', 'weather']
cat_scale2 = StandardScaler()
cat_tuples2 = (cat_scale2, cat_cols2)

In [32]:
# Nominal features for second model
nom_cols2 = ['datetime']
ohe2 = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
nom_tuples2 = (ohe2, nom_cols2)

In [33]:
# Numerical features for sencond model
num_cols2 = ['temp', 'atemp', 'humidity', 'windspeed']
num_scaler2 = StandardScaler()
num_tuples2 = (num_scaler2, num_cols2)

In [34]:
preprocessor2 = make_column_transformer(cat_tuples2, nom_tuples2, num_tuples2,
                                        remainder = 'drop')

In [35]:
linear2 = LinearRegression()
performance(linear2, X2_train, X2_test, y2_train, y2_test,
            preprocessor2, name = 'Defaul Model Linear Regression')



Defaul Model Linear Regression Trained Scores:
R^2: 1.000 
MAE: 0.000
MSE: 0.000 
RMSE: 0.000 

Defaul Model Linear Regression Test Scores:
R^2: 0.269 
MAE: 114.894
MSE: 23978.123 
RMSE: 154.849


Observations: </br>
- We have overfitting in the default dataset: we see that the model did perfect on training set while worst on the testing set.
- The testing set on Engineered model was able to predict 51.6% of our data set. A much higher prediction than the default data set. 
- Lowest MAE, MSE and RMSE are observed in testing for Engineered model. </br>

Over-all in this dataset the Engineered perform better than the defaulted dataset. 