### HSML 6295 Session 5 (Regression Trees) Revised

#### I. The `HCAHPS` Data Set

We use the HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems) data set to predict a hospital's overall rating.

Read in the data set, call it "`full`", and drop observations with at least one missing value


In [None]:
full = read.csv("HSML 6295 ds HCAHPS.csv")
full = na.omit(full)


**Variable Name** | **Description** 
---               | ---         
`nurses`          | Communication with Nurses 
`doctors`         | Communication with Doctors 
`staff`           | Responsiveness of Hospital Staff
`care`            | Care Transition
`meds`            | Communication about Medicines
`clean`           | Cleanliness and Quietness of Hospital Environment
`info`            | Discharge Information
`rating`          | Overall Rating of Hospital

Show the structure of the data set.


In [None]:
str(full)




Move the response variable `rating` from the eighth to the first position and rename to `response`.


In [None]:
full = subset(full, select=c(8,1:7))
names(full)[names(full)=="rating"] = "response"
# show list of variables in current data set
names(full)



Compute summary statistics for the full data set. 


In [None]:
library(stargazer)
stargazer(full, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Full Data Set", digits=1)



Calculate the number of predictor variables


In [None]:
(predictors = ncol(full)-1)




Create a list called `train_id` of 2,792/2 = 1,396 random numbers between 1 and 2,792, the number of observations in the "`full`" data set.


In [None]:
set.seed (12345)
train_id = sample(1:nrow(full), nrow(full)/2)


Split the full data set into two subsets of equal sample size, called "`train`" and "`test`".
To do so, use the random numbers in the `train_id` list created above to tag the observations that will be assigned to the training set. 


In [None]:
train = full[train_id,]




Assign the observations whose ID number is not included in the `train_id` list to the test set


In [None]:
test = full[-train_id,]




Compute summary statistics for the training set


In [None]:
stargazer(train, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Training Set", digits=1)



Compute summary statistics for the test set


In [None]:
stargazer(test, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Test Set", digits=1)


Note that while the maximum values of most predictors are much larger in the test set than in the training set (e.g. 89.2 versus 69.8 for `care`), the differences in mean values are never larger than one unit and the differences in median values are never larger than half a unit.

#### II. Null Model

The null (intercept-only) model uses the mean value of the response in the *training* set as the prediction of all values of the response in the *test* set. It is called the "null model" because it does not use any predictor variables.


In [None]:
predicted = rep(mean(train$response), length(test$response))




Compute the test error, defined as the mean squared error.


In [None]:
(mse_null = round(mean((predicted - test$response)^2),2))



#### III. Linear Regression

Fit a linear regression model to the training set.


In [None]:
linear = lm(response ~ ., train)
round(coef(summary(linear)),2)
(r2 = round(summary(linear)$r.squared,2))



Compute the predicted values of the response in the *test* set and the test error.


In [None]:
predicted = predict(linear, newdata = test)
(mse_linear = round(mean((predicted - test$response)^2),2))



Plot the predicted against the actual ratings.


In [None]:
par(pty = "s")
plot(predicted ~ test$response, xlim=c(30,110), ylim = c(30,110), asp=1,
     ylab = "Predicted Rating", xlab = "Actual Rating")
abline(0,1)
title(main = "Linear Regression Model")


#### IV. Ridge Regression

Define `x` and `y` as the matrix of predictors and the vector of responses `y` in the training set. Also define a list ("`grid`") of $\lambda$ (lambda) values for which the ridge regression model is fit.


In [None]:
x = model.matrix(response ~ ., data=train)[,-1]
y = train$response
grid=10^seq(-1,.5,length=40)



Compute value of $\lambda$ (lambda) that minimizes the cross-validated mean squared error in the *training* set. This value is stored as `cv$lambda.min`. To fit a ridge regression model, we set the parameter `alpha` to 0. The parameter `nfolds` sets the number of cross-validation folds.


In [None]:
library(glmnet)
set.seed (1)
cv = cv.glmnet(x, y, alpha=0, lambda = grid, nfolds = 10)
round(cv$lambda.min, 4)
plot(cv)


The horizontal axis in this graph is drawn at logarithmic scale to show more detail. "Log" refers to the natural logarithm.
The value of $\lambda$ that minimizes the cross-validated mean squared error is shown at the left dotted line:


In [None]:
round(log(cv$lambda.min),2)



The red dots are the point estimates of the prediction error and the gray bars are one standard error above and below the red dots. The right dotted line in the graph marks the "second-best" value of $\lambda$ whose point estimate is one standard error larger than that of `cv$lambda.min`:


In [None]:
round(cv$lambda.1se,4)
round(log(cv$lambda.1se),2)



Compute the coefficient estimates for the ridge regression model that corresponds to `cv$lambda.min` and save the result as `ridge`.


In [None]:
ridge = glmnet(x, y, alpha=0, lambda=cv$lambda.min)
round(coef(ridge),2)


The coefficients for `nurses` and `staff` have shrunk from 0.51 and -0.10 to 0.48 and -0.08, respectively. The coefficient for `doctors` has increased from 0.11 to 0.12, however. This is possible because ridge regression models are fit by minimizing the sum of the squared coefficient estimates,
$$ \lambda \sum_{j=1}^p \beta_j^2$$
In this example, the shrinkage of values of the coefficients for `nurses` and `staff` (and others) more than compensates for the expansion of the coefficient for `doctors`.

If we wanted to shrink the coefficient estimates even further, we could use the "second-best" value of $\lambda$:


In [None]:
ridge.2 = glmnet(x, y, alpha=0, lambda=cv$lambda.1se)
round(coef(ridge.2),2)


The larger value of $\lambda$ has shrunk the coefficient for `nurses` from 0.48 to 0.32. The sign of the coefficient for `staff` has flipped from negative to positive but its absolute value has shrunk: it changed from -0.08 to 0.03. The coefficient for `doctors` has increased further, from 012 to 0.16.

Compute the predicted values of the response in the *test* set and the test error.


In [None]:
x = model.matrix(response ~ ., test)[,-1]
predicted = predict(ridge, s=cv$lambda.min, newx=x)
(mse_ridge = round(mean((predicted - test$response)^2),2))
predicted.2 = predict(ridge.2, s=cv$lambda.min, newx=x)
(mse_ridge.2 = round(mean((predicted.2 - test$response)^2),2))


#### V. The Lasso

Define `x` and `y` as the matrix of predictors and the vector of responses `y` in the training set. Also define a list ("`grid`") of $\lambda$ (lambda) values for which the lasso regression model is fit.


In [None]:
x = model.matrix(response ~ ., data=train)[,-1]
y = train$response
grid=10^seq(-1.5,-0.2,length=40)



Compute value of $\lambda$ (lambda) that minimizes the cross-validated mean squared error in the *training* set. This value is stored as `cv$lambda.min`. To fit a lasso regression model, we set the parameter `alpha` to 1. The parameter `nfolds` sets the number of cross-validation folds.


In [None]:
library(glmnet)
set.seed (1)
cv = cv.glmnet(x, y, alpha=1, lambda = grid, nfolds = 10)
round(cv$lambda.min, 4)
plot(cv)


The horizontal axis in this graph is drawn at logarithmic scale to show more detail. "Log" refers to the natural logarithm.
The value of $\lambda$ that minimizes the cross-validated mean squared error is shown at the left dotted line:


In [None]:
round(log(cv$lambda.min),2)



The red dots are the point estimates of the prediction error and the gray bars are one standard error above and below the red dots. The right dotted line in the graph marks the "second-best" value of $\lambda$ whose point estimate is one standard error larger than that of `cv$lambda.min`:


In [None]:
round(cv$lambda.1se,4)
round(log(cv$lambda.1se),2)



Compute the coefficient estimates for the lasso regression model that corresponds to `cv$lambda.min` and save the result as `lasso`.


In [None]:
lasso = glmnet(x, y, alpha=1, lambda=cv$lambda.min)
round(coef(lasso),2)


As shown in the top horizontal axis of the graph, the lasso model given by `cv$lambda.min` includes only 6 predictors. The predictor `meds` is dropped. The coefficients for `nurses` and `doctors` have shrunk from 0.51 and 0.11 to 0.47 and 0.10, respectively.

If we wanted to shrink the coefficient estimates even further, we could use the "second-best" value of $\lambda$:


In [None]:
lasso.2 = glmnet(x, y, alpha=1, lambda=cv$lambda.1se)
round(coef(lasso.2),2)


As shown in the graph, the more restrictive lasso model also drops `staff` as a predictor. The coefficients for `nurses` and `doctors` have shrunk from 0.47 and 0.10 to 0.42 and 0.05, respectively.

Compute the predicted values of the response in the *test* set and the test error.


In [None]:
x = model.matrix(response ~ ., test)[,-1]
predicted = predict(lasso, s=cv$lambda.min, newx=x)
(mse_lasso = round(mean((predicted - test$response)^2),2))
predicted.2 = predict(lasso.2, s=cv$lambda.min, newx=x)
(mse_lasso.2 = round(mean((predicted.2 - test$response)^2),2))


#### II. Single Pruned Tree

Using the training set, grow the unpruned ("fully grown") tree.


In [None]:
library(tree)
tree = tree(response ~ ., data=train)
summary(tree)



Plot the unpruned tree.


In [None]:
plot(tree)
text(tree, pretty = 0)
title(main = "Unpruned Classification Tree \n")



Using the training set, compute the cross-validation error for subtrees of various sizes. The parameter `K` sets the number of cross-validation folds.


In [None]:
set.seed(6295)
cv = cv.tree(tree, K = 10)



Identify the tree size that minimizes the cross-validation error.


In [None]:
(arg_min_cv = cv$size[which.min(cv$dev)])




Plot the cross-validation error as a function of the tree size.


In [None]:
plot(cv$dev ~ cv$size, type='b', col="lightseagreen", lwd=2,
     xlab = "Subtree Size (Terminal Nodes)", ylab = "Cross-Validated Prediction Error")
axis(1, at=cv$size)



In this example, the subtree size that minimizes the cross-validated mean squared error is identical to the size of the fully grown tree. Both the unpruned and the optimal tree have 9 terminal nodes. Thus, pruning will not affect the performance of this predictive model:


In [None]:
pruned_tree = prune.tree(tree, best = arg_min_cv)
summary(pruned_tree)
plot(pruned_tree)
text(pruned_tree, pretty=0)
title(main = "Pruned Classification Tree \n")



Compute the predicted values of the response in the *test* set and the test error.


In [None]:
predicted = predict(pruned_tree, newdata=test)
(mse_tree = round(mean((predicted - test$response)^2),2))



Plot the predicted against the actual ratings.


In [None]:
par(pty = "s")
plot(predicted ~ test$response, xlim=c(35,100), ylim = c(35,100), asp=1,
     ylab = "Predicted Rating", xlab = "Actual Rating")
abline(0,1)
title(main = "Single Pruned Tree")



Plot the predicted against the actual ratings.


In [None]:
par(pty = "m")
plot(predicted ~ test$response, xlim=c(35,100), ylim=c(68,70),
     ylab = "Predicted Rating", xlab = "Actual Rating")
abline(0,1)
title(main = "Single Pruned Tree: Detail")


#### III. Bagging

Using the training set, grow one unpruned tree for each of 500 bootstrap samples. The parameter `mtry` sets the number of predictors that are tried at each split.


In [None]:
library(randomForest)
set.seed(1)
bag = randomForest(response ~ ., data = train, mtry = predictors, importance = TRUE)
bag



Compute the predicted values of the response in the *test* set and the test error.


In [None]:
predicted = predict(bag, newdata=test)
(mse_bag = round(mean((predicted - test$response)^2),2))


#### IV. Random Forest

Define $m = \sqrt{p}$, the number of predictors tried at each split.


In [None]:
(m = round(sqrt(predictors)))




Grow a random forest of 500 trees. The parameter `mtry` sets the number of predictors that are tried at each split.


In [None]:
library(randomForest)
set.seed(1)
rf = randomForest(response ~ ., data = train, mtry = m, importance = TRUE)
rf



Compute the predicted values of the response in the *test* set and the test error


In [None]:
predicted = predict(rf, newdata=test)
(mse_rf = round(mean((predicted - test$response)^2),2))



Plot the predicted against the actual ratings.


In [None]:
par(pty = "s")
plot(predicted ~ test$response, xlim=c(35,100), ylim = c(35,100), asp=1,
     ylab = "Predicted Rating", xlab = "Actual Rating")
abline(0,1)
title(main = "Random Forest")



Plot the importance of each predictor


In [None]:
varImpPlot(rf)



#### V. Boosting

Grow a sequence of 5,000 trees.


In [None]:
library(gbm)
set.seed(1)
boost = gbm(response ~ ., data = train, distribution = "gaussian", 
            n.trees=5000, interaction.depth=1)
summary(boost, plotit = FALSE)



Compute the predicted values of the response in the *test* set and the test error.


In [None]:
predicted = predict(boost, newdata = test, n.trees=5000)
(mse_boost = round(mean((predicted - test$response)^2),2))


#### X. Summary

**Prediction Method**   | **Test MSE**
---                     | ---:
Null Model              |	55.19
Linear Regression	      | 14.40
Ridge Regression        |	14.39
Lasso                   |	14.38
Single Pruned Tree      |	18.92
Bagging                 |	13.60
Random Forest           |	13.31
Boosting                |	15.60


Plot the test MSE values in a Cleveland dot plot.


In [None]:
x = c(mse_boost, mse_rf, mse_bag, mse_tree, mse_lasso, mse_ridge, mse_linear)
l = c("Boosting", "Random Forest", "Bagging", "Single Pruned Tree", "Lasso", "Ridge Regression", "Linear Regression")
dotchart(x, labels = l, xlab = "Test Mean Squared Error",
         color = ifelse(x==x[which.min(x)], "red3", "black"))


The graph shows how much the two ensemble methods, bagging and random forest, improve on the single pruned tree. The graph also shows how much the random-forest model, which only considers *a subset of* all available predictors at each split, improves on the bootstrap-aggregation (bagging) model, which considers all available predictors at each split.

**Knowledge Check 1**

Use the "visits" data set to predict `visits` as a function of all 11 predictors in that data set.

Which prediction method achieves the lowest test error?

How does the test error of the null model compare to the test errors of the tree -and regression-based methods that use the predictor variables?
