# Task

Engineer some new features to try to improve a model's ability to predict the total number of bike share rentals during a given hour of the day.

- Import the data then drop the 'casual' and 'registered' columns. These are redundant with the target, 'count'.
- Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:
 - Name of the Month
 - Name of the Day of the Week
 - Hour of the Day
   - Make sure all 3 new columns are 'object' datatype so they can be one-hot encoded later.
   - Drop the 'datetime' and 'season' columns. These are now redundant.
  - The temperatures in the 'temp' and 'atemp' columns are in Celsius. Use `.apply()` and a Lambda function to convert them to Fahrenheit.
- Create a new column, 'temp_variance' which shows how much warmer or colder the current temperature ('temp') is than the average temperate for that day of the year ('atemp').  If the current temperature is warmer than average ('atemp'), the value in 'temp_variance' should be positive. 
 - Drop the 'atemp' column.

# Imports

In [1]:
import pandas as pd

# Data Loading

In [2]:
filename = 'https://raw.githubusercontent.com/jaytrey777/Feature-Engineering-Exercise--Core-/main/bikeshare_train.csv'
df = pd.read_csv(filename)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


# Data Prep

## Drop the 'casual' and 'registered' columns

In [3]:
df.drop(columns = ['casual', 'registered'], inplace = True)
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


## Transform the 'datetime' column into a datetime type and use it to create 3 new columns


In [4]:
df['datetime'] = pd.to_datetime(df['datetime']) # converts the colume to a datetime data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


## Create columns:
 - Name of the Month
 - Name of the Day of the Week
 - Hour of the Day

In [5]:
df['month'] = df['datetime'].dt.month_name() #gets the month from the dataframe
df['day of week'] = df['datetime'].dt.day_name() # gets the weekday rom the dataframe
df['hour of the day'] = df['datetime'].dt.hour #gets the hour of the day from the data frame
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,day of week,hour of the day
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


## Ensure all new columns are datatype 'object'

In [6]:
df.info() #printing to see which columns are correct

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   datetime         10886 non-null  datetime64[ns]
 1   season           10886 non-null  int64         
 2   holiday          10886 non-null  int64         
 3   workingday       10886 non-null  int64         
 4   weather          10886 non-null  int64         
 5   temp             10886 non-null  float64       
 6   atemp            10886 non-null  float64       
 7   humidity         10886 non-null  int64         
 8   windspeed        10886 non-null  float64       
 9   count            10886 non-null  int64         
 10  month            10886 non-null  object        
 11  day of week      10886 non-null  object        
 12  hour of the day  10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(7), object(2)
memory usage: 1.1+ MB


In [7]:
# only needed to change the hour of the day column
df['hour of the day'] = df['hour of the day'].astype(dtype = 'object') 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   datetime         10886 non-null  datetime64[ns]
 1   season           10886 non-null  int64         
 2   holiday          10886 non-null  int64         
 3   workingday       10886 non-null  int64         
 4   weather          10886 non-null  int64         
 5   temp             10886 non-null  float64       
 6   atemp            10886 non-null  float64       
 7   humidity         10886 non-null  int64         
 8   windspeed        10886 non-null  float64       
 9   count            10886 non-null  int64         
 10  month            10886 non-null  object        
 11  day of week      10886 non-null  object        
 12  hour of the day  10886 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(6), object(3)
memory usage: 1.1+ MB


## Drop the 'datetime' and 'season' columns

In [8]:
df.drop(columns = ['datetime', 'season'], inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   holiday          10886 non-null  int64  
 1   workingday       10886 non-null  int64  
 2   weather          10886 non-null  int64  
 3   temp             10886 non-null  float64
 4   atemp            10886 non-null  float64
 5   humidity         10886 non-null  int64  
 6   windspeed        10886 non-null  float64
 7   count            10886 non-null  int64  
 8   month            10886 non-null  object 
 9   day of week      10886 non-null  object 
 10  hour of the day  10886 non-null  object 
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB


## Convert 'temp' and 'atemp' from Celsius to Fahrenheit

In [9]:
df.head() #printing to see the before

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,day of week,hour of the day
0,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


In [10]:
df['temp'] = df['temp'].apply(lambda x: (x * 9/5) + 32 )
df['atemp'] = df['atemp'].apply(lambda x: (x * 9/5) + 32 )
df.head() # printing to see the result

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,day of week,hour of the day
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4


## Create a new column, 'temp_variance' which shows how much warmer or colder the current temperature ('temp') is than the average temperate for that day of the year ('atemp')

In [11]:
df['temp_variance'] = df['temp'] - df['atemp']
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,day of week,hour of the day,temp_variance
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0,-8.199
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1,-8.307
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2,-8.307
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3,-8.199
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4,-8.199


## Drop the 'atemp' column

In [12]:
df.drop(columns = 'atemp', inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   holiday          10886 non-null  int64  
 1   workingday       10886 non-null  int64  
 2   weather          10886 non-null  int64  
 3   temp             10886 non-null  float64
 4   humidity         10886 non-null  int64  
 5   windspeed        10886 non-null  float64
 6   count            10886 non-null  int64  
 7   month            10886 non-null  object 
 8   day of week      10886 non-null  object 
 9   hour of the day  10886 non-null  object 
 10  temp_variance    10886 non-null  float64
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB
