#### Fit a model of sparrow survival probability
In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status ("Survived", "Perished"). The variables we will consider are:

- total_length: length of the bird from tip of beak to tip of tail (mm)
- weight: in grams
- humerus : length of humerus ("upper arm bone" that connects the wing to the body) (inches)

Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:

```
glm(formula, data = data, family = binomial)
```
You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to R^2, called pseudo-R^2.

pseudoR^2=1−deviance/null.deviance
You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo-R^2 is analogous to R^2 for standard regression: R2 is a measure of the "variance explained" of a regression model. The pseudo-R2 is a measure of the "deviance explained".

In [None]:
# sparrow is in the workspace
summary(sparrow)
#       status       age             total_length      wingspan    
# Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
# Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
#               Mode  :character   Median :160.0   Median :247.0  
#                                   Mean   :160.4   Mean   :247.5  
#                                   3rd Qu.:162.5   3rd Qu.:251.0  
#                                   Max.   :167.0   Max.   :256.0  
#     weight       beak_head        humerus           femur       
# Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
# 1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
# Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
# Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
# 3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
# Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
#     legbone          skull           sternum      
# Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
# 1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
# Median :1.130   Median :0.6000   Median :0.8500  
# Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
# 3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
# Max.   :1.230   Max.   :0.6400   Max.   :0.9300
 
# Create the survived column
sparrow$survived <- sparrow$status == "Survived"

# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
# survived ~ total_length + weight + humerus

# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = "binomial")


# Call summary
summary(sparrow_model)
# Call:
# glm(formula = fmla, family = "binomial", data = sparrow)

# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -2.1117  -0.6026   0.2871   0.6577   1.7082  

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
# total_length  -0.5435     0.1409  -3.858 0.000115 ***
# weight        -0.5689     0.2771  -2.053 0.040060 *  
# humerus       75.4610    19.1586   3.939 8.19e-05 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# (Dispersion parameter for binomial family taken to be 1)

#     Null deviance: 118.008  on 86  degrees of freedom
# Residual deviance:  75.094  on 83  degrees of freedom
# AIC: 83.094

# Number of Fisher Scoring iterations: 5

# Call glance
(perf <- glance(sparrow_model))
#   null.deviance df.null    logLik      AIC      BIC deviance df.residual
# 1      118.0084      86 -37.54718 83.09436 92.95799 75.09436          83

# Calculate pseudo-R-squared
(pseudoR2 <- 1- perf$deviance / perf$null.deviance)
# [1] 0.3636526

#### Predict sparrow survival
In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.

Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:
```
predict(model, type = "response")
```
Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model's gain curve is close to the ideal ("wizard") gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

- frame: data frame with prediction column and ground truth column
- xvar: the name of the column of predictions (as a string)
- truthVar: the name of the column with actual outcome (as a string)
- title: a title for the plot (as a string)

```
GainCurvePlot(frame, xvar, truthVar, title)
```

In [None]:
# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

![predict_sparrow_survival](./figures/predict_sparrow_survival.png)

#### Fit a model to predict bike rental counts
In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

The data frame has the columns:

- cnt: the number of bikes rented in that hour (the outcome)
- hr: the hour of the day (0-23, as a factor)
- holiday: TRUE/FALSE
- workingday: TRUE if neither a holiday nor a weekend, else FALSE
- weathersit: categorical, "Clear to partly cloudy"/"Light Precipitation"/"Misty"
- temp: normalized temperature in Celsius
- atemp: normalized "feeling" temperature in Celsius
- hum: normalized humidity
- windspeed: normalized windspeed
- instant: the time index -- number of hours since beginning of data set (not a variable)
- mnth and yr: month and year indices (not variables)

Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

In [None]:
# bikesJuly is in the workspace
str(bikesJuly)
# 'data.frame':	744 obs. of  12 variables:
# $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
# $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
# $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
# $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
# $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
# $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
# $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
# $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
# $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

# The outcome column
outcome 
# [1] "cnt"

# The inputs to use
vars 
# [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
# [6] "atemp"      "hum"        "windspeed"

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
# [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly[[outcome]]))
# [1] 273.6653

(var_bikes <- var(bikesJuly[[outcome]]))
# [1] 45863.84

# Fit the model
bike_model <- glm(fmla, data = bikesJuly, family = "quasipoisson")

# Call glance
(perf <- glance(bike_model))
#   null.deviance df.null logLik AIC BIC deviance df.residual
# 1      133364.9     743     NA  NA  NA  28774.9         712

# Calculate pseudo-R-squared
# [1] 0.7842393
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)

#### Predict bike rentals on new data
In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = "response" with predict() when predicting counts from a glm poisson or quasipoisson model.

In [None]:
# bikesAugust is in the workspace
str(bikesAugust)

# bike_model is in the workspace
summary(bike_model)
# Call:
# glm(formula = fmla, family = quasipoisson, data = bikesJuly)

# Deviance Residuals: 
#     Min        1Q    Median        3Q       Max  
# -21.6117   -4.3121   -0.7223    3.5507   16.5079  

# Coefficients:
#                               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                    5.934986   0.439027  13.519  < 2e-16 ***
# hr1                           -0.580055   0.193354  -3.000 0.002794 ** 
# hr2                           -0.892314   0.215452  -4.142 3.86e-05 ***
# hr3                           -1.662342   0.290658  -5.719 1.58e-08 ***
# hr4                           -2.350204   0.393560  -5.972 3.71e-09 ***
# hr5                           -1.084289   0.230130  -4.712 2.96e-06 ***
# hr6                            0.211945   0.156476   1.354 0.176012    
# hr7                            1.211135   0.132332   9.152  < 2e-16 ***
# hr8                            1.648361   0.127177  12.961  < 2e-16 ***
# hr9                            1.155669   0.133927   8.629  < 2e-16 ***
# hr10                           0.993913   0.137096   7.250 1.09e-12 ***
# hr11                           1.116547   0.136300   8.192 1.19e-15 ***
# hr12                           1.282685   0.134769   9.518  < 2e-16 ***
# hr13                           1.273010   0.135872   9.369  < 2e-16 ***
# hr14                           1.237721   0.136386   9.075  < 2e-16 ***
# hr15                           1.260647   0.136144   9.260  < 2e-16 ***
# hr16                           1.515893   0.132727  11.421  < 2e-16 ***
# hr17                           1.948404   0.128080  15.212  < 2e-16 ***
# hr18                           1.893915   0.127812  14.818  < 2e-16 ***
# hr19                           1.669277   0.128471  12.993  < 2e-16 ***
# hr20                           1.420732   0.131004  10.845  < 2e-16 ***
# hr21                           1.146763   0.134042   8.555  < 2e-16 ***
# hr22                           0.856182   0.138982   6.160 1.21e-09 ***
# hr23                           0.479197   0.148051   3.237 0.001265 ** 
# holidayTRUE                    0.201598   0.079039   2.551 0.010961 *  
# workingdayTRUE                 0.116798   0.033510   3.485 0.000521 ***
# weathersitLight Precipitation -0.214801   0.072699  -2.955 0.003233 ** 
# weathersitMisty               -0.010757   0.038600  -0.279 0.780572    
# temp                          -3.246001   1.148270  -2.827 0.004833 ** 
# atemp                          2.042314   0.953772   2.141 0.032589 *  
# hum                           -0.748557   0.236015  -3.172 0.001581 ** 
# windspeed                      0.003277   0.148814   0.022 0.982439    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# (Dispersion parameter for quasipoisson family taken to be 38.98949)

#     Null deviance: 133365  on 743  degrees of freedom
# Residual deviance:  28775  on 712  degrees of freedom
# AIC: NA

# Number of Fisher Scoring iterations: 5

# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, newdata = bikesAugust, type = "response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
#       rmse
# 1 112.5815

# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

![predict_bike_rentals_on_new_data](./figures/predict_bike_rentals_on_new_data.png)

#### Visualize the Bike Rental Predictions
In the previous exercise, you visualized the bike model's predictions using the standard "outcome vs. prediction" scatter plot. Since the bike rental data is time series data, you might be interested in how the model performs as a function of time. In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

- The "wide" data frame to be gathered (implicit in a pipe)
- The name of the key column to be created - contains the names of the gathered columns.
- The name of the value column to be created - contains the values of the gathered columns.
- The names of the columns to be gathered into a single column.

You'll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.

In [None]:
# Plot predictions and cnt by date/time
bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")

![visualize_the_bike_rental_predictions](./figures/visualize_the_bike_rental_predictions.png)

#### Model soybean growth with GAM
In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn't grow at a steady rate, but rather has a "growth spurt" that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

```
y ~ s(x)
```

Also remember that gam() from the package mgcv has the calling interface

```
gam(formula, family, data)
```

For standard regression, use family = gaussian (the default).

The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.

In [None]:
# soybean_train is in the workspace
summary(soybean_train)

# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = Weight)) + 
  geom_point() 

![model_soybean_growth_with_GAM](./figures/model_soybean_growth_with_GAM.png)

In [None]:
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

# Call summary() on model.lin and look for R-squared
summary(model.lin)
# Call:
# lm(formula = fmla.lin, data = soybean_train)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -9.3933 -1.7100 -0.3909  1.9056 11.4381 

# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
# Time         0.292094   0.007444   39.24   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residual standard error: 2.778 on 328 degrees of freedom
# Multiple R-squared:  0.8244,	Adjusted R-squared:  0.8238 
# F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16

# Call summary() on model.gam and look for R-squared
summary(model.gam)
# Family: gaussian 
# Link function: identity 

# Formula:
# weight ~ s(Time)

# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Approximate significance of smooth terms:
#           edf Ref.df     F p-value    
# s(Time) 8.495   8.93 338.2  <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# R-sq.(adj) =  0.902   Deviance explained = 90.4%
# GCV = 4.4395  Scale est. = 4.3117    n = 330

# Call plot() on model.gam
plot(model.gam)

![model_soybean_growth_with_GAM_2](./figures/model_soybean_growth_with_GAM_2.png)

#### Predict with the soybean model on test data
In this exercise you will apply the soybean models from the previous exercise (model.lin and model.gam, already in your workspace) to new data: soybean_test.

In [None]:
# soybean_test is in the workspace
summary(soybean_test)
#      Plot    Variety   Year         Time           weight       
#  1988F8 : 4   F:43    1988:32   Min.   :14.00   Min.   : 0.0380  
#  1988P7 : 4   P:39    1989:26   1st Qu.:23.00   1st Qu.: 0.4248  
#  1989F8 : 4           1990:24   Median :41.00   Median : 3.0025  
#  1990F8 : 4                     Mean   :44.09   Mean   : 7.1576  
#  1988F4 : 3                     3rd Qu.:69.00   3rd Qu.:15.0113  
#  1988F2 : 3                     Max.   :84.00   Max.   :30.2717  
#  (Other):60

# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
# # A tibble: 2 x 2
#   modeltype  rmse
#   <chr>     <dbl>
# 1 pred.gam   2.29
# 2 pred.lin   3.19

# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")

![predict_with_the_soybean_model_on_test_data](./figures/predict_with_the_soybean_model_on_test_data.png)

#### Build a random forest model for bike rentals
In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

- formula
- data
- num.trees: the number of trees in the forest.
- respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to "order" for regression.
- seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

In [None]:
# bikesJuly is in the workspace
str(bikesJuly)
# 'data.frame':	744 obs. of  12 variables:
#  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
#  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
#  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
#  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
#  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
#  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
#  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
#  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
#  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

# Random seed to reproduce results
seed
# [1] 423563

# the outcome column
(outcome <- "cnt")
# [1] "cnt"

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
# [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
# [6] "atemp"      "hum"        "windspeed"

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
#[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

# Load the package ranger
library(ranger)

# Fit and print the random forest model.
(bike_model_rf <- ranger(fmla, 
                         bikesJuly, 
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
# Ranger result

# Call:
#  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 

# Type:                             Regression 
# Number of trees:                  500 
# Sample size:                      744 
# Number of independent variables:  8 
# Mtry:                             2 
# Target node size:                 5 
# Variable importance mode:         none 
# Splitrule:                        variance 
# OOB prediction error (MSE):       8230.568 
# R squared (OOB):                  0.8205434

#### Predict bike rentals with the random forest model
In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

```
predict(model, data)$predictions
```

In [None]:
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse
#       rmse
# 1 96.66032

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

![predict_bike_rentals_with_the_random_forest_model](./figures/predict_bike_rentals_with_the_random_forest_model.png)

#### Visualize random forest bike model predictions
In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model's August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.

In [None]:
first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

# Plot predictions and cnt by date/time 
ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

![visualize_random_forest_bike_model_predictions](./figures/visualize_random_forest_bike_model_predictions.png)

#### vtreat on a small example
In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ()

```
treatplan <- designTreatmentsZ(data, varlist)
```

- data: the original training data frame
- varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

```
scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)
```

- varName: the name of the new treated variable
- origName: the name of the original variable that the treated variable comes from
- code: the type of the new variable.
    - "clean": a numerical variable with no NAs or NaNs
    - "lev": an indicator variable for a specific level of the original categorical variable.
    
(magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

```
newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)
```

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

```
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
```
- treatplan: the treatment plan
- data: the data frame to be treated
- varRestrictions: the variables desired in the treated data

In [None]:
# dframe is in the workspace
dframe
#   color size popularity
# 1      b   13  1.0785088
# 2      r   11  1.3956245
# 3      r   15  0.9217988
# 4      r   14  1.2025453
# 5      r   13  1.0838662
# 6      b   11  0.8043527
# 7      r    9  1.1035440
# 8      g   12  0.8746332
# 9      b    7  0.6947058
# 10     b   12  0.8832502

# Create and print a vector of variable names
(vars <- c("color", "size"))
# [1] "color" "size"

# Load the package vtreat
library(vtreat)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
# [1] "vtreat 1.2.0 inspecting inputs Sat Feb  8 09:34:11 2020"
# [1] "designing treatments Sat Feb  8 09:34:11 2020"
# [1] " have initial level statistics Sat Feb  8 09:34:11 2020"
# [1] "design var color Sat Feb  8 09:34:11 2020"
# [1] "design var size Sat Feb  8 09:34:11 2020"
# [1] " scoring treatments Sat Feb  8 09:34:11 2020"
# [1] "have treatment plan Sat Feb  8 09:34:11 2020"

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))
#         varName origName  code
# 1    color_catP    color  catP
# 2    size_clean     size clean
# 3 color_lev_x_b    color   lev
# 4 color_lev_x_g    color   lev
# 5 color_lev_x_r    color   lev
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
# [1] "size_clean"    "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"

# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
#   size_clean color_lev_x_b color_lev_x_g color_lev_x_r
# 1          13             1             0             0
# 2          11             0             0             1
# 3          15             0             0             1
# 4          14             0             0             1
# 5          13             0             0             1
# 6          11             1             0             0
# 7           9             0             0             1
# 8          12             0             1             0
# 9           7             1             0             0
# 10         12             1             0             0

#### Novel levels
When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

In this exercise you will see how vtreat handles categorical values that did not appear in the training set. The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace. dframe and a new data frame testframe are also in your workspace.

In [None]:
# Print testframe
testframe
#   color size popularity
# 1      g    7  0.9733920
# 2      g    8  0.9122529
# 3      y   10  1.4217153
# 4      g   12  1.1905828
# 5      g    6  0.9866464
# 6      y    8  1.3697515
# 7      b   12  1.0959387
# 8      g   12  0.9161547
# 9      g   12  1.0000460
# 10     r    8  1.3137360

# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
#   size_clean color_lev_x_b color_lev_x_g color_lev_x_r
# 1           7             0             1             0
# 2           8             0             1             0
# 3          10             0             0             0
# 4          12             0             1             0
# 5           6             0             1             0
# 6           8             0             0             0
# 7          12             1             0             0
# 8          12             0             1             0
# 9          12             0             1             0
# 10          8             0             0             1

#### vtreat the bike rental data
In this exercise you will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

The data frames bikesJuly and bikesAugust are in the workspace.

For your convenience, we have defined the variable vars with the list of variable columns for the model.

In [None]:
# The outcome column
(outcome <- "cnt")

# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("clean", "lev")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column

# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)