#### Find the right number of trees for a gradient boosting machine
In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data is loaded into your workspace. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this exercise, the key arguments to the xgb.cv() call are:

- data: a numeric matrix.
- label: vector of outcomes (also numeric).
- nrounds: the maximum number of rounds (trees to build).
- nfold: the number of folds for the cross-validation. 5 is a good number.
- objective: "reg:linear" for continuous outcomes.
- eta: the learning rate.
- max_depth: depth of trees.
- early_stopping_rounds: after this many rounds without improvement, stop.
- verbose: 0 to stay silent.

#### Fit an xgboost bike rental model and predict
In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July and predict on data for the month of August.

The datasets for July and August are loaded into your workspace. Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace.

The arguments to xgboost() are similar to those of xgb.cv().

In [None]:
# Examine the workspace
ls()

# The number of trees to use, as determined by xgb.cv
ntrees

# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

![fit_an_xgboost_bike_rental_model_and_predict](./figures/fit_an_xgboost_bike_rental_model_and_predict.png)

#### Evaluate the xgboost bike rental model
In this exercise you will evaluate the gradient boosting model bike_model_xgb that you fit in the last exercise, using data from the month of August. You'll compare this model's RMSE for August to the RMSE of previous models that you've built.

The dataset bikesAugust is in the workspace. You have already made predictions using the xgboost model; they are in the column pred.

In [None]:
# bikesAugust is in the workspace
str(bikesAugust)

# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))

Even though this gradient boosting made some negative predictions, overall it makes smaller errors than the previous two models. Perhaps rounding negative predictions up to zero is a reasonable tradeoff.

#### Visualize the xgboost bike rental model
You've now seen three different ways to model the bike rental data. For this example, you've seen that the gradient boosting model had the smallest RMSE. To finish up the course, let's compare the gradient boosting model's predictions to the other two models as a function of time.

On completing this exercise, you will have completed the course. Congratulations! Now you have the tools to apply a variety of approaches to your regression tasks.

In [None]:
# Print quasipoisson_plot
print(quasipoisson_plot)

# Print randomforest_plot
print(randomforest_plot)

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")

The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models. 

![visualize_the_xgboost_bike_rental_model_1](./figures/visualize_the_xgboost_bike_rental_model_1.png)

![visualize_the_xgboost_bike_rental_model_3](./figures/visualize_the_xgboost_bike_rental_model_3.png)

![visualize_the_xgboost_bike_rental_model_2](./figures/visualize_the_xgboost_bike_rental_model_2.png)