<h1 id="tocheading">Table of Contents</h1>
<br />
<div id="toc"><ul class="toc"><li><a href="#1.-Read-in-data">1. Read in data</a><a class="anchor-link" href="#1.-Read-in-data">¶</a></li><li><a href="#2.-Clean-data">2. Clean data</a><a class="anchor-link" href="#2.-Clean-data">¶</a></li><ul class="toc"><li><a href="#2.1.-Check-for-missing-values">2.1. Check for missing values</a><a class="anchor-link" href="#2.1.-Check-for-missing-values">¶</a></li><li><a href="#2.2.-Remove-uninformative-features">2.2. Remove uninformative features</a><a class="anchor-link" href="#2.2.-Remove-uninformative-features">¶</a></li><li><a href="#2.3.-Remove-features-leaking-info-on-target">2.3. Remove features leaking info on target</a><a class="anchor-link" href="#2.3.-Remove-features-leaking-info-on-target">¶</a></li></ul><li><a href="#3.-Process-features">3. Process features</a><a class="anchor-link" href="#3.-Process-features">¶</a></li><ul class="toc"><li><a href="#3.1.-Min-max-scale-continuous-variables">3.1. Min-max scale continuous variables</a><a class="anchor-link" href="#3.1.-Min-max-scale-continuous-variables">¶</a></li><li><a href="#3.2.-One-hot-encode-nominal-data">3.2. One-hot encode nominal data</a><a class="anchor-link" href="#3.2.-One-hot-encode-nominal-data">¶</a></li></ul><li><a href="#4.-Predict-and-evaluate">4. Predict and evaluate</a><a class="anchor-link" href="#4.-Predict-and-evaluate">¶</a></li><li><a href="#5.-Closing-remarks">5. Closing remarks</a><a class="anchor-link" href="#5.-Closing-remarks">¶</a></li><ul class="toc"><li><a href="#5.1.-RMSE-instead-of-MAE">5.1. RMSE instead of MAE</a><a class="anchor-link" href="#5.1.-RMSE-instead-of-MAE">¶</a></li><li><a href="#5.2.-Correlated-features">5.2. Correlated features</a><a class="anchor-link" href="#5.2.-Correlated-features">¶</a></li><li><a href="#5.3.-Suggestion-for-future-research">5.3. Suggestion for future research</a><a class="anchor-link" href="#5.3.-Suggestion-for-future-research">¶</a></li><ul class="toc"><li><a href="#5.3.1.-Aggregate-features">5.3.1. Aggregate features</a><a class="anchor-link" href="#5.3.1.-Aggregate-features">¶</a></li></ul></ul></ul></div>

Project guide: https://www.dataquest.io/m/213/guided-project%3A-predicting-bike-rentals

Solution by DataQuest: https://github.com/dataquestio/solutions/blob/master/Mission213Solution.ipynb

In this project, I will use different machine learning algorithms (linear regression, decision trees and random forest) to predict the target and compare the output.

I will use the [hourly record](https://github.com/gazay/dlnd-project-01/blob/master/Bike-Sharing-Dataset/hour.csv) of the Bike Sharing Dataset compiled by Hadi Fanaee-T. The target variable is the total number of bikes which were rented at a given hour. The following table shows a summary of the dataset, in which the "Description" column is an excerpt from https://github.com/gazay/dlnd-project-01/tree/master/Bike-Sharing-Dataset.

|Variable|Data type|Description|Drop the feature?|Target variable?||
|---|---|---|---|---||
|instant|discrete|record index|Yes (non-data)||
|dteday|ordinal|date|Yes (Uninformative)||
|season|ordinal (circular)|season (1:springer, 2:summer, 3:fall, 4:winter)|||
|yr|ordinal (circular)|year (0: 2011, 1:2012)|||
|mnth|ordinal (circular)|month ( 1 to 12)|||
|hr|ordinal (circular)|hour (0 to 23)|||
|holiday|nominal|weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)|||
|weekday|ordinal (circular)|day of the week|||
|workingday|nominal|if day is neither weekend nor holiday is 1, otherwise is 0.|||
|weathersit|nominal| 1 (Clear, Few clouds, Partly cloudy, Partly cloudy), 2 (Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist), 3 (Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds), 4 (Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog)|||
|temp|continuous|Normalized temperature in Celsius. The values are divided to 41 (max)|||
|atemp|continuous|Normalized feeling temperature in Celsius. The values are divided to 50 (max)|||
|hum|continuous|Normalized humidity. The values are divided to 100 (max)|||
|windspeed|continuous|Normalized wind speed. The values are divided to 67 (max)|||
|casual|discrete|count of casual users|Yes (Leaks info on target variable)||
|registered|discrete|count of registered users|Yes (Leaks info on target variable)||
|cnt|discrete|count of total rental bikes including both casual and registered||Yes|

The steps I will take are shown below.

1. [Read in data](#1.-Read-in-data)<br />
2. [Clean data](#2.-Clean-data)<br />
    2.1. [Check for missing values](#2.1.-Check-for-missing-values)<br />
    2.2. [Remove uninformative features](#2.2.-Remove-uninformative-features)<br />
    2.3. [Remove features leaking info on target](#2.3.-Remove-features-leaking-info-on-target)<br />
3. [Process features](#3.-Process-features)<br />
    3.1. [Min-max scale continuous variables](#3.1.-Min-max-scale-continuous-variables)<br />
    3.2. [One-hot encode nominal data](#3.2.-One-hot-encode-nominal-data)<br />
4. [Predict and evaluate](#4.-Predict-and-evaluate)<br />
5. [Closing remarks](#5.-Closing-remarks)<br />
    5.1. [RMSE instead of MAE](#5.1.-RMSE-instead-of-MAE)<br />
    5.2. [Correlated features](#5.2.-Correlated-features)<br />
    5.3. [Suggestion for future research](#5.3.-Suggestion-for-future-research)<br />
    &nbsp;&nbsp;&nbsp;&nbsp;5.3.1. [Aggregate features](#5.3.1.-Aggregate-features)<br />


# 1. Read in data

In [1]:
from pprint import pprint
from scipy.stats import pearsonr
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from time import time

import numpy as np
import pandas as pd
import re

# Read in data
df = pd.read_csv("bike_rental_hour.csv")

# Display first 5 rows
df.head(5)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


# 2. Clean data
## 2.1. Check for missing values
There is no missing value.

In [2]:
df.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

## 2.2. Remove uninformative features

`dteday` (date in year-month-date format) will be removed.

In [3]:
df = df.drop("dteday", axis=1)
df.columns

Index(['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

`instant` (row ID) will be kept to see if bike rentals increase with over time.

## 2.3. Remove features leaking info on target

`casual` and `registered` will be removed because they constitute the target variable (`cnt`).

In [4]:
df = df.drop(["casual", "registered"], axis=1)
df.columns

Index(['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt'],
      dtype='object')

# 3. Process features
## 3.1. Min-max scale continuous variables

Continouous variables are `temp` (normalised temperature), `atemp` (normalised feeling temperature), `hum` (normalised humidity) and `windspeed` (normalised wind speed). The current values are quotient from dividing each original values by their maximum values.

These will be converted into min-max scale.

In [5]:
# Set maximum values of original data (source https://bit.ly/2MAVs0K)
# for each continuous variable
max_vals = {"temp": 41, "atemp": 50, "hum": 100, "windspeed": 67}

# Rescale values into min-max scale
for col_name in df.select_dtypes(np.float):
    
    col = df[col_name]
        
    # Unscale values to recover original values
    col_original = df[col_name] * max_vals[col_name]

    # Min-max scale original values
    col_minmax = (col_original - col_original.min()) / (col_original.max() - col_original.min())
    
    # Update data frame
    df[col_name] = col_minmax

## 3.2. One-hot encode nominal data

Create one column per category for nominal variables (`holiday`, `workingday`, `weathersit`).

In [6]:
cat_cols = ["holiday", "workingday", "weathersit"]

# Create dummy columns for nominal columns
# (Values are first converted into strings because 
# pandas' get_dummies function cannot process numbers)
dummy_cols = pd.get_dummies(df[cat_cols].astype(str))

# Remove original nominal columns
df.drop(cat_cols, axis=1, inplace=True)

# Add dummy columns to data frame
df = pd.concat([df, dummy_cols], axis=1)
df.head(5)



##########################################
# Incomplete code block for feature hashing
# which I hope to further work on another time
#
# df_cats = df.select_dtypes(include="category")
# h = FeatureHasher(n_features=10, input_type="string")
# pd.DataFrame(h.transform(df_cats).toarray())
##########################################

Unnamed: 0,instant,season,yr,mnth,hr,weekday,temp,atemp,hum,windspeed,cnt,holiday_0,holiday_1,workingday_0,workingday_1,weathersit_1,weathersit_2,weathersit_3,weathersit_4
0,1,1,0,1,0,6,0.22449,0.2879,0.81,0.0,16,1,0,1,0,1,0,0,0
1,2,1,0,1,1,6,0.204082,0.2727,0.8,0.0,40,1,0,1,0,1,0,0,0
2,3,1,0,1,2,6,0.204082,0.2727,0.8,0.0,32,1,0,1,0,1,0,0,0
3,4,1,0,1,3,6,0.22449,0.2879,0.75,0.0,13,1,0,1,0,1,0,0,0
4,5,1,0,1,4,6,0.22449,0.2879,0.75,0.0,1,1,0,1,0,1,0,0,0


# 4. Predict and evaluate

* *The text in this section and the codes in the following cell are modified versions of section "6. Predict and Evaluate" of my [other project](https://github.com/gknam/projects/blob/master/DataScience/DataQuest/Step6_MachineLearning/4_LinearRegressionForMachineLearning/project1/PredictingHouseSalePrices.ipynb).*

The following procedures will be carried out with different learning algorithms: linear regression, decision trees and random forest.

___
**K-fold cross validation** will be carried out where K will range from 2 to and including 10.

In each fold, **feature selection** will be done based on each feature's importance which will be evaluated using **extremely randomised trees** algorithm ([ExtraTreesRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor)).

Feature selection will be done 10 times per fold. First selection will include the most important feature. The second selection will be the first selection plus the next most important feature. The same will be done for up to a selection of 10 most important features.

With each set of selected features, predictions will be made using the chosen learning algorithm. Then prediction errors will be measured using [root-mean-square error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (RMSE) and averaged within each fold.

A total of 100 averaged MAEs will be produced, each representing a unique combination of feature selection sizes (n=10) and folds (n=10). Finally, the combination with the lowest MAE will be selected and its MAE will be reported together with the names and numbers of selected features.
___

**Note**: Feature selection is done ***during*** cross validation rather than beforehand. [This will prevent bias which can be created from using feature sets selected from the *whole* dataset (all rows) to make prediction on *subsets* of data (subsets of rows). By doing this, cross validation assesses the **model fitting process** rather than the model itself](https://stats.stackexchange.com/a/27751).

In [7]:
def feature_selection(selector, features, target, top_features_number):

    selector.fit(features, target)
    
    feature_cols_series = pd.Series(features.columns, \
                                    name="Feature_cols")
    
    if type(selector) == ExtraTreesRegressor:
        feature_significance = selector.feature_importances_
        feature_significance_label = "Feature_importance_score"
        ascending = False

    elif type(selector) == RFE:
        feature_significance = selector.ranking_
        feature_significance_label = "Feature_importance_ranking"
        ascending = True
        
    
    feature_significance_series = pd.Series(feature_significance, \
                                          name=feature_significance_label)

    feature_cols_top = pd.concat([feature_cols_series, feature_significance_series], axis=1)\
            .sort_values(by=feature_significance_label, ascending=ascending)\
            .iloc[:top_features_number, 0].values
            
    return feature_cols_top

def get_model_name(model):
    model_type_str = str((type(model)))
    model_name = re.sub(".*\.|'.*$", "", model_type_str)
    
    return model_name

def train_and_test(fs_train, fs_test, t_train, t_test, model, error_metric):
    
    # Fit model
    model.fit(fs_train, t_train)
    
    # Predict target using test dataset
    p_test = model.predict(fs_test)
    
    # Get error
    
    # RMSE (root-mean-square error)
    if error_metric == "RMSE":
        err = np.sqrt(mean_squared_error(t_test, p_test))
    
    # MAE (mean absolute error)
    elif error_metric == "MAE":
        err = np.sqrt(mean_absolute_error(t_test, p_test))
            
    return err




def k_fold_cross_val(target_col, model, error_metric, selector, num_folds, feature_set_sizes):
    # Get features
    feature_cols = df.columns.drop(target_col)

    features = df[feature_cols]
    target = df[target_col]

    # Track lowest error
    lowest_err = df.max().max() - df.min().min()

    # Track best method
    # [fold, number of selected features, 
    # names of selected features, lowest error]
    best_method = [None, None, None, lowest_err]


    start = time()
    err_all = {str(fold) + "-fold CV": [] for fold in num_folds}

    # K-fold cross validation
    for fold in num_folds:

        err_split = []
        kf = KFold(n_splits=fold, shuffle=True)

        err_selections = {}
        feature_selections = {}

        kf_splits = kf.split(features)

        # For each fold
        for (train_ind, test_ind), split in zip(kf_splits, range(fold)):

            err_selections[split] = []
            feature_selections[split] = []

            # For each feature set
            for fss in feature_set_sizes:

                # Get target
                t_train = target.loc[train_ind]
                t_test = target.loc[test_ind]

                # Get features
                f_train = features.loc[train_ind]
                f_test = features.loc[test_ind]

                # Select features
                fs_cols = feature_selection(selector, f_train, t_train, top_features_number=fss)

                fs_train = f_train[fs_cols]
                fs_test = f_test[fs_cols]

                # record error and selected features
                err = train_and_test(fs_train, fs_test, t_train, t_test, model, error_metric)
                err_selections[split].append(err)
                feature_selections[split].append(fs_cols)

        # Get mean error and feature names per feature selection
        err_array = np.array([err_selections[i] for i in err_selections])
        feature_array = np.array([feature_selections[i] for i in feature_selections])
        for fs in feature_set_sizes:
            err_fs_mean = np.mean(err_array[:, fs - 1])
            features_fs = feature_array[:, fs - 1]

            if err_fs_mean < lowest_err:
                lowest_err = err_fs_mean
                best_method = [fold, fs, features_fs, lowest_err]

            err_all[str(fold) + "-fold CV"].append(err_fs_mean)

    end = time()

    # Get unique feature selections in best_method
    # (source https://stackoverflow.com/a/3724558)
    features_fs_best = [list(x) for x in set(tuple(x) for x in best_method[2])]

    # State model name
    model_name = get_model_name(model)
    print(model_name + ": ")
    
    # State duration of cross-validation
    print("Duration: " + ("{:.2f}").format(end - start) + " seconds")
    print()

    # State (1) fold and feature sets with which best prediction was made
    # and (2) details of error measure
    print("Best prediction was made with {} {}".format(error_metric, best_method[3]))
    print("This was achieved in")
    print("(1) {}-fold cross-validation".format(best_method[0]))
    print("(2) with {} most important features selected in each split of the fold, \
which are listed below.".format(best_method[1]))
    print("(There may be multiple sets because feature selection was made for each split)")
    print()
    
    # State most important features selected for the fold in which best prediction was made
    for ind, val in enumerate(features_fs_best):
        val = re.sub("\[|\]", "", str(val))
        to_print = str(val) + " and" if ind < len(features_fs_best) - 1 \
                                  else str(val)
        print(to_print)
        print()
    print("\n")

    ## Return error metric for all attempted numbers of folds (columns) and feature set sizes (rows)
    # err_all_df = pd.DataFrame(data=err_all, index=[str(i) + " features" for i in feature_set_sizes])
    
    # return err_all_df



target_col = "cnt"
models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor()]
selector = ExtraTreesRegressor()
error_metric = "RMSE"
num_folds = range(2, 11)
feature_set_sizes = range(1, 11)

## Initialise RFE feature selector
# estimator = SVR(kernel="linear")
# selector = RFE(estimator, 1, step=1)

# State feature selector name
selector_name = get_model_name(selector)
print("Feature selection is made using {}.".format(selector_name))

# State numbers of folds
print("K-fold cross validation is carried out with K ranging from 2 to and including {}."\
     .format(min(num_folds), max(num_folds)))
print()

# State sizes of feature sets
print("In each fold, N most important features are selected to train the model")
print("where N ranges from {} to and including {}".format(min(feature_set_sizes), max(feature_set_sizes) + 1))

print("\n\n\n")

for model in models:
    k_fold_cross_val(target_col, model, error_metric, selector, num_folds, feature_set_sizes)

Feature selection is made using ExtraTreesRegressor.
K-fold cross validation is carried out with K ranging from 2 to and including 2.

In each fold, N most important features are selected to train the model
where N ranges from 1 to and including 11




LinearRegression: 
Duration: 306.32 seconds

Best prediction was made with RMSE 141.84283731934252
This was achieved in
(1) 10-fold cross-validation
(2) with 10 most important features selected in each split of the fold, which are listed below.
(There may be multiple sets because feature selection was made for each split)

'hr', 'temp', 'yr', 'atemp', 'hum', 'workingday_0', 'workingday_1', 'season', 'instant', 'weathersit_3' and

'hr', 'temp', 'yr', 'atemp', 'workingday_0', 'workingday_1', 'hum', 'instant', 'season', 'weathersit_3' and

'hr', 'temp', 'yr', 'atemp', 'workingday_1', 'workingday_0', 'instant', 'hum', 'season', 'mnth' and

'hr', 'atemp', 'temp', 'yr', 'hum', 'workingday_0', 'workingday_1', 'instant', 'season', 'weathersit_3'

Random forest algorithm produced most accurate prediction (RMSE≈141.84) while decision tress resulted in slighly lower accuracy (RMSE≈58.24) and linear regression performed noticeably worse (RMSE≈45.56).

On the other hand, performance speed was best with linear regression and worst with random forest.

Also, including more features in the model led to better prediction accuracy for all three algorithms. Among all features, `hr` (hour of day) was most important in predicting number of bike rentals for each hour.


# 5. Closing remarks

## 5.1. RMSE instead of MAE

Root-mean-square error (RMSE) and mean absolute error (MAE) measure prediction errors on the same scale. I chose RMSE so that bigger errors are given higher weight (https://stats.stackexchange.com/a/48268).

## 5.2. Correlated features

In [8]:
def get_correlated_columns(df, r_thres=0.9, p_thres=0.05, dtype=None):
    """
    df: data frame
    columns: data frame's columns to include
    r_thres: Threshold for pearson's r value
    p_thres: Threshold for p value of pearson's r
    dtype: data type
    
    Return df after removing highly correlated columns.
    Hihghly correlated means r > r_thres and p < p_thres.
    """
    
    # Select columns of specified data type
    if dtype is not None:
        cols = df.select_dtypes(dtype).columns

    cols_corr_pair = [] # Pairs of correlated columns
    cols_corr_all = set() # All correlated columns
    
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            
            col1_name = cols[i]
            col2_name = cols[j]
            
            col1 = df[col1_name]
            col2 = df[col2_name]
            r, p = pearsonr(col1, col2)

            if (r > 0.9) and (p < 0.05):
                cols_corr_pair.append([col1_name, col2_name])
                cols_corr_all.add(col1_name)
                cols_corr_all.add(col2_name)
                
    return cols_corr_pair, cols_corr_all

cols_corr_pair, cols_corr_all = get_correlated_columns(df.drop("cnt", axis=1), dtype=np.float)
print(cols_corr_pair)

[['temp', 'atemp']]


Temperature (`temp`) and feeling temperature (`atemp`) are highly (r > 0.9) and significantly (p < 0.05) correlated. This can affect calculation of feature importance in random forest algorithm (https://stats.stackexchange.com/a/144732/209342). This is an issue for this project because extremely random trees, which I used for feature selection, is somewhat similar to random forest.

One alternative could be to do feature selection using recursive feature elimination (RFE) algorithm which could reduce such effect ([Gregorutti et al., 2017](https://arxiv.org/abs/1310.5726)).

## 5.3. Suggestion for future research
### 5.3.1. Aggregate features

One could investigate the interaction between features. For example, one could check if feeling temperature (`atemp`) could affect the number of bike rentals differently depending on `windspeed`. A new variable could be created for this by multiplying the two variables together.