# Engineer and select features


This lab demonstrates the feature engineering process for building a regression model using bike rental demand prediction as an example. In machine learning predictions, effective feature engineering will lead to a more accurate model. We will use the Bike Rental UCI dataset as the input raw data for this experiment. This dataset is based on real data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States. The dataset contains 17,379 rows and 17 columns, each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) were included in this raw feature set, and the dates were categorized as holiday vs. weekday etc.

The field to predict is cnt which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour. Our main goal is to construct effective features in the training data, so we build two models using the same algorithm, but with two different datasets. Using the Split Data module in the visual designer, we split the input data in such a way that the training data contains records for the year 2011, and the testing data, records for 2012. Both datasets have the same raw data at the origin, but we added different additional features to each training set:

Set A = weather + holiday + weekday + weekend features for the predicted day
Set B = number of bikes that were rented in each of the previous 12 hours
We are building two training datasets by combining the feature set as follows:

Training set 1: feature set A only
Training set 2: feature sets A+B
For the model, we are using regression because the number of rentals (the label column) contains continuos real numbers. As the algorithm for the experiment, we will be using the Boosted Decision Tree Regression.

Importing the required libraries:
1. pandas: For reading and manipulating our dataset
2. numpy: Used for working on arrays
3. sklearn.tree: For importing our Decision Tree Regressor
4. sklearn.metrics: Importing this will enable us to use different metrics for evaluating our model

In [1]:
import pandas as pd
import numpy as np
# import the regressor 
from sklearn.tree import DecisionTreeRegressor  
from sklearn.metrics import mean_absolute_error

Reading the dataset via the web url provided

In [2]:
dataframe = pd.read_csv("https://introtomlsampledata.blob.core.windows.net/data/bike-rental/bike-rental-hour.csv")

Having a first glance at the dataset using head(n) where n is the number of rows from top you want to preview, default is 5. tail(n) will show you rows from bottom.  

In [3]:
dataframe.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Finding out the names and types of columns that we have in the dataset using dataframe.columns and dataframe.dtypes. Remember columns and dtypes are attributes and not methods, that is why there is no () after them.

In [4]:
dataframe.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [5]:
dataframe.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

Converting type of "season" and "weathersit" to categorical.

In [6]:
categoryVariableList = ["season","weathersit"]
for var in categoryVariableList:
    dataframe[var] = dataframe[var].astype("category")


In [7]:
dataframe.dtypes

instant          int64
dteday          object
season        category
yr               int64
mnth             int64
hr               int64
holiday          int64
weekday          int64
workingday       int64
weathersit    category
temp           float64
atemp          float64
hum            float64
windspeed      float64
casual           int64
registered       int64
cnt              int64
dtype: object

Dropping the "instant", "dteday", "casual" and "registered" columns (axis = 0 or ‘index’, 1 or ‘columns’).

In [8]:
dataframe = dataframe.drop(["instant", "dteday", "casual" ,"registered"], axis = 1)

In [9]:
dataframe.head()

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
2,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
3,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
4,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


Creating dataset for set A. 
Reason for creating a copy of dataframe and not directly using dataframesetA = dataframe: https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas#:~:text=Because%20if%20you%20don't,dataFrame%20to%20a%20different%20name.&text=It's%20necessary%20to%20mention%20that,depends%20on%20kind%20of%20indexing.&text=The%20rules%20about%20when%20a,are%20entirely%20dependent%20on%20NumPy.

In [10]:
dataframesetA = dataframe.copy()
dataframesetA.columns

Index(['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt'],
      dtype='object')

Creating dataset for set A+B.

Using the function that we have been provided.

np.arrange: Return evenly spaced values within a given interval. Values are generated within the half-open interval [start, stop) (in other words, the interval including start but excluding stop). For integer arguments the function is equivalent to the Python built-in range function, but returns an ndarray rather than a list.

column.shift: Shift index by desired number of periods (hours, in our case).

column.fillna: Fill NA/NaN values using the specified method (0, in our case).

In [11]:
def azureml_main(dataframe1 = None, dataframe2 = None):

    # Execution logic goes here
    #print(f'Input pandas.DataFrame #1: {dataframe1}')

    # If a zip file is connected to the third input port,
    # it is unzipped under "./Script Bundle". This directory is added
    # to sys.path. Therefore, if your zip file contains a Python file
    # mymodule.py you can import it using:
    # import mymodule

    for i in np.arange(1, 13):
        prev_col_name = 'cnt' if i == 1 else 'Rentals in hour -{}'.format(i-1)
        new_col_name = 'Rentals in hour -{}'.format(i)

        dataframe1[new_col_name] = dataframe1[prev_col_name].shift(1).fillna(0)

    # Return value must be of a sequence of pandas.DataFrame
    # E.g.
    #   -  Single return value: return dataframe1,
    #   -  Two return values: return dataframe1, dataframe2
    return dataframe1,

In [12]:
dataframesetAB = azureml_main(dataframe)[0]

In [13]:
dataframesetAB.columns

Index(['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt',
       'Rentals in hour -1', 'Rentals in hour -2', 'Rentals in hour -3',
       'Rentals in hour -4', 'Rentals in hour -5', 'Rentals in hour -6',
       'Rentals in hour -7', 'Rentals in hour -8', 'Rentals in hour -9',
       'Rentals in hour -10', 'Rentals in hour -11', 'Rentals in hour -12'],
      dtype='object')

In [14]:
dataframesetAB.head()

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,...,Rentals in hour -3,Rentals in hour -4,Rentals in hour -5,Rentals in hour -6,Rentals in hour -7,Rentals in hour -8,Rentals in hour -9,Rentals in hour -10,Rentals in hour -11,Rentals in hour -12
0,1,0,1,0,0,6,0,1,0.24,0.2879,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0,1,1,0,6,0,1,0.22,0.2727,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0,1,2,0,6,0,1,0.22,0.2727,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0,1,3,0,6,0,1,0.24,0.2879,...,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0,1,4,0,6,0,1,0.24,0.2879,...,40.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Splitting data into test and train for both set A and set A+B

In [15]:
dataframesetAtest = dataframesetA[dataframesetA["yr"] == 0]
dataframesetAtrain = dataframesetA[dataframesetA["yr"] != 0]
dataframesetAtest.shape,dataframesetAtrain.shape

((8645, 13), (8734, 13))

In [16]:
dataframesetABtest = dataframesetAB[dataframesetAB["yr"] == 0]
dataframesetABtrain = dataframesetAB[dataframesetAB["yr"] != 0]
dataframesetABtest.shape,dataframesetABtrain.shape

((8645, 25), (8734, 25))

Dropping "yr" column from all test and train datasets

In [17]:
dataframesetAtest = dataframesetAtest.drop(["yr"],axis = 1)
dataframesetAtrain = dataframesetAtrain.drop(["yr"],axis = 1)
dataframesetABtest = dataframesetABtest.drop(["yr"],axis = 1)
dataframesetABtrain = dataframesetABtrain.drop(["yr"],axis = 1)

In [18]:
dataframesetAtest.shape,dataframesetAtrain.shape

((8645, 12), (8734, 12))

In [19]:
dataframesetABtest.shape,dataframesetABtrain.shape

((8645, 24), (8734, 24))


Creating X and y for test and train set A 


In [20]:
dataframesetAtestX = dataframesetAtest.drop(["cnt"],axis = 1)
dataframesetAtesty = dataframesetAtest["cnt"]
dataframesetAtestX.shape,dataframesetAtesty.shape

((8645, 11), (8645,))

In [21]:
dataframesetAtrainX = dataframesetAtrain.drop(["cnt"],axis = 1)
dataframesetAtrainy = dataframesetAtrain["cnt"]
dataframesetAtrainX.shape,dataframesetAtrainy.shape

((8734, 11), (8734,))

Creating X and y for test and train set A + B

In [22]:
dataframesetABtestX = dataframesetABtest.drop(["cnt"],axis = 1)
dataframesetABtesty = dataframesetABtest["cnt"]
dataframesetABtestX.shape,dataframesetABtesty.shape

((8645, 23), (8645,))

In [23]:
dataframesetABtrainX = dataframesetABtrain.drop(["cnt"],axis = 1)
dataframesetABtrainy = dataframesetABtrain["cnt"]
dataframesetABtrainX.shape,dataframesetABtrainy.shape

((8734, 23), (8734,))

Creating a regressor object for set A and fitting it with X and y data.


In [24]:
regressorA = DecisionTreeRegressor(random_state = 0)  
regressorA.fit(dataframesetAtrainX, dataframesetAtrainy) 

DecisionTreeRegressor(random_state=0)

Creating a regressor object for set A + B and fitting it with X and y data.

In [25]:
regressorAB = DecisionTreeRegressor(random_state = 0)  
regressorAB.fit(dataframesetABtrainX, dataframesetABtrainy) 

DecisionTreeRegressor(random_state=0)

Calculating the Mean Absolute Error for both the sets.

In [26]:
maesetA = mean_absolute_error(dataframesetAtesty, regressorA.predict(dataframesetAtestX))

In [27]:
maesetAB = mean_absolute_error(dataframesetABtesty,regressorAB.predict(dataframesetABtestX))

Printing both the Mean Absolute Errors.

In [28]:
print("The Mean Absolute Error for set A: " + str(maesetA))
print("The Mean Absolute Error for set A + B: " + str(maesetAB))

The Mean Absolute Error for set A: 92.086003470214
The Mean Absolute Error for set A + B: 37.039329091960674
