#### Build a random forest model for bike rentals
In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

- formula
- data
- num.trees: the number of trees in the forest.
- respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to "order" for regression.
- seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

In [None]:
# bikesJuly is in the workspace
str(bikesJuly)
# 'data.frame':	744 obs. of  12 variables:
#  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
#  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
#  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
#  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
#  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
#  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
#  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
#  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
#  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

# Random seed to reproduce results
seed
# [1] 423563

# the outcome column
(outcome <- "cnt")
# [1] "cnt"

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
# [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
# [6] "atemp"      "hum"        "windspeed"

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
#[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

# Load the package ranger
library(ranger)

# Fit and print the random forest model.
(bike_model_rf <- ranger(fmla, 
                         bikesJuly, 
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
# Ranger result

# Call:
#  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 

# Type:                             Regression 
# Number of trees:                  500 
# Sample size:                      744 
# Number of independent variables:  8 
# Mtry:                             2 
# Target node size:                 5 
# Variable importance mode:         none 
# Splitrule:                        variance 
# OOB prediction error (MSE):       8230.568 
# R squared (OOB):                  0.8205434

#### Predict bike rentals with the random forest model
In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

```
predict(model, data)$predictions
```

In [None]:
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse
#       rmse
# 1 96.66032

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

![predict_bike_rentals_with_the_random_forest_model](./figures/predict_bike_rentals_with_the_random_forest_model.png)

#### Visualize random forest bike model predictions
In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model's August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.

In [None]:
first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

# Plot predictions and cnt by date/time 
ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

![visualize_random_forest_bike_model_predictions](./figures/visualize_random_forest_bike_model_predictions.png)

#### vtreat on a small example
In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ()

```
treatplan <- designTreatmentsZ(data, varlist)
```

- data: the original training data frame
- varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

```
scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)
```

- varName: the name of the new treated variable
- origName: the name of the original variable that the treated variable comes from
- code: the type of the new variable.
    - "clean": a numerical variable with no NAs or NaNs
    - "lev": an indicator variable for a specific level of the original categorical variable.
    
(magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

```
newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)
```

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

```
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
```
- treatplan: the treatment plan
- data: the data frame to be treated
- varRestrictions: the variables desired in the treated data

In [None]:
# dframe is in the workspace
dframe
#   color size popularity
# 1      b   13  1.0785088
# 2      r   11  1.3956245
# 3      r   15  0.9217988
# 4      r   14  1.2025453
# 5      r   13  1.0838662
# 6      b   11  0.8043527
# 7      r    9  1.1035440
# 8      g   12  0.8746332
# 9      b    7  0.6947058
# 10     b   12  0.8832502

# Create and print a vector of variable names
(vars <- c("color", "size"))
# [1] "color" "size"

# Load the package vtreat
library(vtreat)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
# [1] "vtreat 1.2.0 inspecting inputs Sat Feb  8 09:34:11 2020"
# [1] "designing treatments Sat Feb  8 09:34:11 2020"
# [1] " have initial level statistics Sat Feb  8 09:34:11 2020"
# [1] "design var color Sat Feb  8 09:34:11 2020"
# [1] "design var size Sat Feb  8 09:34:11 2020"
# [1] " scoring treatments Sat Feb  8 09:34:11 2020"
# [1] "have treatment plan Sat Feb  8 09:34:11 2020"

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))
#         varName origName  code
# 1    color_catP    color  catP
# 2    size_clean     size clean
# 3 color_lev_x_b    color   lev
# 4 color_lev_x_g    color   lev
# 5 color_lev_x_r    color   lev
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
# [1] "size_clean"    "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"

# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
#   size_clean color_lev_x_b color_lev_x_g color_lev_x_r
# 1          13             1             0             0
# 2          11             0             0             1
# 3          15             0             0             1
# 4          14             0             0             1
# 5          13             0             0             1
# 6          11             1             0             0
# 7           9             0             0             1
# 8          12             0             1             0
# 9           7             1             0             0
# 10         12             1             0             0

#### Novel levels
When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

In this exercise you will see how vtreat handles categorical values that did not appear in the training set. The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace. dframe and a new data frame testframe are also in your workspace.

In [None]:
# Print testframe
testframe
#   color size popularity
# 1      g    7  0.9733920
# 2      g    8  0.9122529
# 3      y   10  1.4217153
# 4      g   12  1.1905828
# 5      g    6  0.9866464
# 6      y    8  1.3697515
# 7      b   12  1.0959387
# 8      g   12  0.9161547
# 9      g   12  1.0000460
# 10     r    8  1.3137360

# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
#   size_clean color_lev_x_b color_lev_x_g color_lev_x_r
# 1           7             0             1             0
# 2           8             0             1             0
# 3          10             0             0             0
# 4          12             0             1             0
# 5           6             0             1             0
# 6           8             0             0             0
# 7          12             1             0             0
# 8          12             0             1             0
# 9          12             0             1             0
# 10          8             0             0             1

#### vtreat the bike rental data
In this exercise you will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

The data frames bikesJuly and bikesAugust are in the workspace.

For your convenience, we have defined the variable vars with the list of variable columns for the model.

In [None]:
# The outcome column
(outcome <- "cnt")

# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("clean", "lev")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column

# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)

#### Find the right number of trees for a gradient boosting machine
In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data is loaded into your workspace. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this exercise, the key arguments to the xgb.cv() call are:

- data: a numeric matrix.
- label: vector of outcomes (also numeric).
- nrounds: the maximum number of rounds (trees to build).
- nfold: the number of folds for the cross-validation. 5 is a good number.
- objective: "reg:linear" for continuous outcomes.
- eta: the learning rate.
- max_depth: depth of trees.
- early_stopping_rounds: after this many rounds without improvement, stop.
- verbose: 0 to stay silent.

#### Fit an xgboost bike rental model and predict
In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July and predict on data for the month of August.

The datasets for July and August are loaded into your workspace. Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace.

The arguments to xgboost() are similar to those of xgb.cv().

In [None]:
# Examine the workspace
ls()

# The number of trees to use, as determined by xgb.cv
ntrees

# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

![fit_an_xgboost_bike_rental_model_and_predict](./figures/fit_an_xgboost_bike_rental_model_and_predict.png)

#### Evaluate the xgboost bike rental model
In this exercise you will evaluate the gradient boosting model bike_model_xgb that you fit in the last exercise, using data from the month of August. You'll compare this model's RMSE for August to the RMSE of previous models that you've built.

The dataset bikesAugust is in the workspace. You have already made predictions using the xgboost model; they are in the column pred.

In [None]:
# bikesAugust is in the workspace
str(bikesAugust)

# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

Even though this gradient boosting made some negative predictions, overall it makes smaller errors than the previous two models. Perhaps rounding negative predictions up to zero is a reasonable tradeoff.

#### Visualize the xgboost bike rental model
You've now seen three different ways to model the bike rental data. For this example, you've seen that the gradient boosting model had the smallest RMSE. To finish up the course, let's compare the gradient boosting model's predictions to the other two models as a function of time.

On completing this exercise, you will have completed the course. Congratulations! Now you have the tools to apply a variety of approaches to your regression tasks.

In [None]:
# Print quasipoisson_plot
print(quasipoisson_plot)

# Print randomforest_plot
print(randomforest_plot)

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")

The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models. 

![visualize_the_xgboost_bike_rental_model_1](./figures/visualize_the_xgboost_bike_rental_model_1.png)

![visualize_the_xgboost_bike_rental_model_3](./figures/visualize_the_xgboost_bike_rental_model_3.png)

![visualize_the_xgboost_bike_rental_model_2](./figures/visualize_the_xgboost_bike_rental_model_2.png)