## HSML 6295 Session 5 -- Classification Trees

This code uses the following packages:


In [None]:
# install.packages("tree")
# install.packages("randomForest")
# install.packages("gbm")
# install.packages("glmnet")
# install.packages("stargazer")
library(stargazer)


### I. The "Framingham Heart Study" Data Set

In this exercise we use the Framingham data set to predict whether a respondent's body mass index (BMI) is above 25 kg/m^2^, the clinical definition of "overweight".

Read in the data set, call it "`framingham`", and drop observations with at least one missing value.


In [None]:
framingham = read.csv("HSML 6295 ds Framingham.csv")
framingham = na.omit(framingham)



Show the dimensions (number of observations, number of variables) of this data set.


In [None]:
dim(framingham)




Show the names of the variables in this data set.


In [None]:
names(framingham)




Define the the `response` variable as 1 if the respondent's BMI exceeded 25 kg/m^2^ and 0 otherwise.


In [None]:
framingham$response = ifelse(framingham$BMI > 25, 1, 0)
table(framingham$response)
round(prop.table(table(framingham$response)),4)



Retain a subset of the variables for building the prediction models below.


In [None]:
framingham = subset(framingham, 
                    select = c(response, Male, Age, Education, 
                               Smoker, Cigs.per.Day, BP.Medication, Hypertension, 
                               Diabetes, Heart.Rate, Glucose, observed.CHD)) 
names(framingham)   # show list of variables in subset



Create a list called `train_id` of 3,658/2 = 1,829 random numbers between 1 and 3,658, the number of observations in the "`framingham`" data set.


In [None]:
set.seed(101)
train_id = sample(1:nrow(framingham), nrow(framingham)/2)


Split the subset into two subsets of equal sample size, called "`training_set`" and "`test_set`".
To do so, use the random numbers in the `train_id` list created above to tag the observations that will be assigned to the training set. 


In [None]:
training_set = framingham[train_id,]




Assign the observations whose ID number is not included in the `train_id` list to the test set.


In [None]:
test_set = framingham[-train_id,]




Compute summary statistics for the training set


In [None]:
stargazer(training_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Training Set", digits=2)



Compute summary statistics for the test set.


In [None]:
stargazer(test_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Test Set", digits=2)


Note that while the maximum values of most continuous variables may vary modestly between the training and test sets, the differences in mean and median values are quite small and often nil.

Define the confusion matrix (CM), accuracy, true positive rate (TPR), and false positive rate (FPR) achieved by a given prediction model as functions of the observed (`Actual`) and predicted (`Predicted`) responses.


In [None]:
# confusion matrix
CM = function(Actual, Predicted) {
    addmargins(table(Actual, Predicted), FUN = list(Total = sum), quiet = TRUE)
}
# accuracy
accuracy = function(Actual, Predicted) {
    round(100*mean(Actual == Predicted),2)
}
# true positive rate
TPR = function(Actual, Predicted) {
    round(100*sum((Actual==1)*(Predicted==1))/sum(Actual==1),2)
}
# false positive rate
FPR = function(Actual, Predicted) {
    round(100*sum((Actual==0)*(Predicted==1))/sum(Actual==0),2)
}


### II. Null Model

The simplest prediction model assigns to each test observation the modal (most common) class found in the training set, *even if fewer than half the training observations belong to the modal class*. This classifier is also referred to as the "null model" in that it does not use any predictor variables. It thus serves as a baseline model: any prediction model that uses at least one predictor variable should perform at least as well as the null model.

Compute the predicted values of the response in the *test* set.

1. Find the modal class in the *training* set.


In [None]:
round(prop.table(table(training_set$response)),4)
mode = function(x) {
  unique_x = unique(x)
  unique_x[which.max(tabulate(match(x, unique_x)))]
}
mode(training_set$response)



2. Assign this class to *all* observations in the *test* set.


In [None]:
test_set$prediction = rep(mode(training_set$response), length(test_set$response))




Compute measures of predictive performance.


In [None]:
(CM(test_set$response, test_set$prediction))
(accuracy_null = accuracy(test_set$response, test_set$prediction))
(TPR_null = TPR(test_set$response, test_set$prediction))
(FPR_null = FPR(test_set$response, test_set$prediction))


### III. Logistic Regression

Fit a logistic regression of the `response` on all predictor variables in the training set and save the result as `logistic`.


In [None]:
logistic = glm(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                          + BP.Medication + Hypertension + Diabetes 
                          + Heart.Rate + Glucose + observed.CHD, 
               data   = training_set, 
               family = binomial)
round(coef(summary(logistic)),2)


For purposes of comparing the logistic regression model to the ridge and lasso regression models, it is useful to *scale* all predictors by 

a. subtracting from each value $x_i$ the predictor's mean value $\bar{x}$ (known as "centering")
b. dividing the resulting difference by the predictor's standard deviation $s$ (known as "standardizing")

$\frac{x_i - \bar{x}}{s}$


In [None]:
stargazer(training_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Training Set", digits=2)

A = function(x) scale(x)
(variables   = ncol(training_set))
training_set[2:variables] = lapply(training_set[2:variables], A)

stargazer(training_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Training Set Scaled", digits=2)


When we center and standardize the predictors, they all have mean 0 and standard deviation 1.

The resulting logistic regression model is estimated as:


In [None]:
logistic = glm(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                          + BP.Medication + Hypertension + Diabetes 
                          + Heart.Rate + Glucose + observed.CHD, 
               data   = training_set, 
               family = binomial)
round(coef(summary(logistic)),2)


Note that the z and p values of the coefficient estimates are the same as before. The z and p values measure the strength of the statistical association between the predictor and response after accounting for the influence of all other predictors included in the model. 
Centering and standardizing the predictors removes the influence of the unit of measurement. For instance, if we measures cigarette consumption in *packs* per day rather than *cigarettes* per day, mean cigarette consumption would now be 9.12/20 packs per day and the coefficient estimate would be 20\*0.01 = 0.2 to ensure the product of the coefficient estimate and the predictor remained the same. The magnitude of the coefficient estimate matters when we estimate ridge or lasso regression models because these minimize the sum of the squared (ridge) coefficient estimates and the sum of the absolute values of the coefficient estimates (lasso), respectively. Therefore, if we measured cigarette consumption in packs per day we would give more weight to cigarette consumption than if we measured it in cigarettes per day, *even though the strength of their statistical association with the response is the same*. To avoid this dependency on the unit of measurement, we center and standardize the predictors.

We also center and standardize the predictors in the test set.


In [None]:
stargazer(test_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Test Set", digits=2)

A = function(x) scale(x)
(variables   = ncol(test_set))
test_set[2:variables] = lapply(test_set[2:variables], A)

stargazer(test_set, 
          type = "text", 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          title="Training Set Scaled", digits=2)


Compute the predicted values of the response in the *test* set.

1. Compute the predicted probability for each observation in the test set


In [None]:
test_set$prediction = predict(logistic, test_set, type="response")




2. Convert the predicted probability to a predicted class using the probability threshold of 0.5


In [None]:
test_set$prediction = ifelse(test_set$prediction > 0.5, 1, 0)




Generate the confusion matrix for the test set and compute the accuracy, true positive rate (TPR), and false positive rate (FPR).


In [None]:
(CM(test_set$response, test_set$prediction))
(accuracy_logistic = accuracy(test_set$response, test_set$prediction))
(TPR_logistic = TPR(test_set$response, test_set$prediction))
(FPR_logistic = FPR(test_set$response, test_set$prediction))


### IV. Ridge Regression

Declare matrix of predictors `x` and response variable `y` and define the list ("`grid`") of $\lambda$ (lambda) values for which the ridge regression model is fit.


In [None]:
x = model.matrix(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                            + BP.Medication + Hypertension + Diabetes 
                            + Heart.Rate + Glucose + observed.CHD, 
                 data = training_set)[,-1]
y = training_set$response
grid=10^seq(-0.8,-1.9,length=40)


Compute value of $\lambda$, stored as `cv$lambda.min`, that minimizes the training error, defined as the cross-validated prediction error for the *training* set. 

**To fit a ridge regression model, we set `alpha` to 0.**


In [None]:
library(glmnet)
set.seed (1)
cv = cv.glmnet(x, y, alpha=0, family = "binomial", lambda = grid)
round(cv$lambda.min, 4)
plot(cv)


The horizontal axis in this graph is drawn at logarithmic scale to show more detail. "Log" refers to the natural logarithm, also abbreviated as "ln".
The value of $\lambda$ shown at the left dotted line is


In [None]:
round(log(cv$lambda.min),2)



It is the natural logarithm of the value that minimizes the training error (`cv$lambda.min`). The red dots are the point estimates of the prediction error and the gray bars are one standard error above and below the red dots. The right dotted line in the graph marks the value of $\lambda$ whose point estimate of the cross-validated prediction error ("Binomial Deviance") is one standard error larger than that of `cv$lambda.min`:


In [None]:
round(cv$lambda.1se,4)
round(log(cv$lambda.1se),2)


You can think of this "second-best" value of $\lambda$ as the largest value of $\lambda$ that is statistically indistinguishable from `cv$lambda.min`. If our goal is to shrink the coefficient estimates as much as possible, we could choose a value of $\lambda$ as high as `cv$lambda.1se`.

Compute the coefficient estimates for the ridge regression model that corresponds to `cv$lambda.min` and save the result as `ridge`.


In [None]:
ridge = glmnet(x, y, alpha=0, lambda=cv$lambda.min, family = "binomial")
round(coef(ridge),2)


The numbers of predictors included in the various ridge regression fits are shown above the top horizontal axis in the graph above. When we fit ridge regression models, the coefficient estimates are "shrunk", i.e. their absolute magnitude is reduced. For instance, the coefficient estimate for the predictor `Smoker` has shrunk from -0.43 in the logistic regression fit (i.e. $\lambda = 0$) to -0.37 in the ridge regression fit ($\lambda = 0.0153$).

If we wanted to shrink the coefficient estimates even further, we could use the "second-best" value of $\lambda$:


In [None]:
ridge.2 = glmnet(x, y, alpha=0, lambda=cv$lambda.1se, family = "binomial")
round(coef(ridge.2),2)


The larger value of $\lambda$ has shrunk the coefficients for `Smoker` even further to -0.20. 

Compute the predicted values of the response in the *test* set.

1. Compute the predicted probability for each observation in the test set.


In [None]:
x = model.matrix(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                            + BP.Medication + Hypertension + Diabetes 
                            + Heart.Rate + Glucose + observed.CHD, 
                 data = test_set)[,-1]

test_set$prediction = predict(ridge, s=cv$lambda.min, newx=x, type = "response")



2. Convert the predicted probability to a predicted class using the probability threshold of 0.5.


In [None]:
test_set$prediction = ifelse(test_set$prediction > 0.5, 1, 0)




Generate the confusion matrix for the test set and compute the accuracy, true positive rate (TPR), and false positive rate (FPR).


In [None]:
(CM(test_set$response, test_set$prediction))
(accuracy_ridge = accuracy(test_set$response, test_set$prediction))
(TPR_ridge = TPR(test_set$response, test_set$prediction))
(FPR_ridge = FPR(test_set$response, test_set$prediction))


### V. The Lasso

Compute value of $\lambda$, stored as `cv$lambda.min`, that minimizes the training error, defined as the cross-validated prediction error for the *training* set.

**To fit a lasso model, we set `alpha` to 1.**


In [None]:
x = model.matrix(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                            + BP.Medication + Hypertension + Diabetes 
                            + Heart.Rate + Glucose + observed.CHD, 
                 data = training_set)[,-1]
y = training_set$response
grid=10^seq(-1.6,-2.2,length=40)

library(glmnet)
set.seed (1)
cv = cv.glmnet(x, y, alpha=1, family = "binomial", lambda = grid)
round(cv$lambda.min,4)
plot(cv)



In the graph, the value of $\lambda$ that minimizes the cross-validated training error, `cv$lambda.min`, is shown at


In [None]:
round(log(cv$lambda.min),2)



The top horizontal axis in the graph shows that the lasso model that minimizes the training error includes only 8 predictors.

To see which predictors the optimal lasso model has dropped, we compute the coefficient estimates for the lasso model that corresponds to `cv$lambda.min` and save the result as `lasso`.


In [None]:
lasso = glmnet(x, y, alpha=1, lambda=cv$lambda.min, family = "binomial")
round(coef(lasso),2)


The optimal lasso model no longer includes the predictors `Cigs.per.Day`, `BP.Medication`, and `Diabetes`.
Also, note that the absolute magnitudes of the coefficients of the other predictors have shrunk. For instance, the coefficient for `Smoker` has shrunk from -0.43 in the logistic regression model (i.e. $\lambda = 0$) to -0.34 in the lasso model ($\lambda = 0.0065$).

If we wanted to drop even more predictors (and shrink the remaining non-zero coefficient estimates even further), we could use the "second-best" value of $\lambda$, which is larger than `cv$lambda.min` and shown at the right dotted line in the graph:


In [None]:
round(cv$lambda.1se, 4)
round(log(cv$lambda.1se),2)


The top horizontal axis of the graph shows that this more restrictive model only includes 6 predictors. 
The coefficient estimates for this lasso model are:


In [None]:
lasso.2 = glmnet(x, y, alpha=1, lambda=cv$lambda.1se, family = "binomial")
round(coef(lasso.2),2)


In addition to the predictors `Cigs.per.Day`, `BP.Medication`, and `Diabetes`, the more restrictive lasso model drops (shrinks to zero) the coefficients of the predictors `Glucose` and `observed.CHD`. Also, the coefficient for `Smoker` has shrunk from -0.34 to -0.25.

Compute the predicted values of the response in the *test* set.

1. Compute the predicted probability for each observation in the test set.


In [None]:
x = model.matrix(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                            + BP.Medication + Hypertension + Diabetes 
                            + Heart.Rate + Glucose + observed.CHD, 
                 data = test_set)[,-1]

test_set$prediction = predict(lasso, s=cv$lambda.min, newx=x, type = "response")



2. Convert the predicted probability to a predicted class using the probability threshold of 0.5.


In [None]:
test_set$prediction = ifelse(test_set$prediction > 0.5, 1, 0)




Generate the confusion matrix for the test set and compute the accuracy, true positive rate (TPR), and false positive rate (FPR).


In [None]:
(CM(test_set$response, test_set$prediction))
(accuracy_lasso = accuracy(test_set$response, test_set$prediction))
(TPR_lasso = TPR(test_set$response, test_set$prediction))
(FPR_lasso = FPR(test_set$response, test_set$prediction))



Note that we could have obtained the logistic regression model by estimating a ridge (`alpha` = 0) or lasso (`alpha` = 1) regression model and setting `lambda` = 0.


In [None]:
library(glmnet)
x = model.matrix(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                            + BP.Medication + Hypertension + Diabetes 
                            + Heart.Rate + Glucose + observed.CHD, 
                 data = training_set)[,-1]
y = training_set$response
logistic.ridge = glmnet(x, y, alpha=0, lambda=0, family = "binomial")
logistic.lasso = glmnet(x, y, alpha=1, lambda=0, family = "binomial")
round(coef(summary(logistic)),2)
round(coef(logistic.ridge),2)
round(coef(logistic.lasso),2)


### VI. Single Pruned Tree

To grow a classification tree using the `tree` command, we must convert the response variable to a "factor" variable with two levels, “Yes” and “No”. (If we don't convert the response variable to a factor, the `tree` command will grow a regression tree instead.)


In [None]:
training_set$response = factor(training_set$response, 
                               levels = c(0,1), 
                               labels = c("Not Overweight", "Overweight"))
table(training_set$response)


In [None]:
names(training_set)




Using the training set, grow the unpruned ("fully grown") tree and save the result as `tree`.


In [None]:
library(tree)
tree = tree(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                        + BP.Medication + Hypertension + Diabetes 
                        + Heart.Rate + Glucose + observed.CHD,
            data = training_set)
summary(tree)



Generate the confusion matrix for the *training* set.


In [None]:
training_set$prediction = predict(tree, newdata=training_set, type = "class")
CM(training_set$response, training_set$prediction)


Note that the number of misclassified patients is 666 and matches the number reported by the `summary(tree)` command above. 149 patients were false negatives: the predicted condition was "No" when their actual condition was "Yes". 517 patients were false positives: the predicted condition was "Yes" when their actual condition was "No".

Plot the unpruned tree


In [None]:
plot(tree)
text(tree, pretty = 0)
title(main = "Unpruned Classification Tree \n")


Note that the number of terminal nodes is 4, matching the number reported by the `summary(tree)` command above. Three terminal nodes predict that the respondent's BMI is greater than 25 ("Overweight").

Compute the 20-fold cross-validated (CV) prediction error for subtrees of various sizes. The cross-validated prediction error is the average number of misclassified patients in the test set defined by each of the 20 cross-validation folds. The tree sizes are measured by the number of terminal nodes.


In [None]:
set.seed(6295)
cv = cv.tree(tree, FUN=prune.misclass, K=20)



Plot the cross-validated prediction error as a function of the tree size.


In [None]:
plot(cv$dev ~ cv$size, type='b', col="lightseagreen", lwd=2,
     xlab = "Subtree Size (Terminal Nodes)", ylab = "Cross-Validated Prediction Error")



Save the size of the subtree that minimizes the cross-validated prediction error.


In [None]:
(arg_min_cv = cv$size[which.min(cv$dev)])




Prune the original tree to the size that minimizes the CV error and save the result as `pruned_tree`.


In [None]:
pruned_tree = prune.misclass(tree, best = arg_min_cv)
summary(pruned_tree)


The number of misclassified *training* observations is 675.

Plot the pruned tree


In [None]:
plot(pruned_tree)
text(pruned_tree, pretty=0)
title(main = "Pruned Classification Tree \n")


To obtain the pruned tree in this example, we've eliminated the split on "Age".

Compute the predicted values of the response in the *test* set.


In [None]:
test_set$prediction = predict(pruned_tree, newdata=test_set, type = "class")




Compute measures of predictive performance (after converting the predictions back to numeric format so that we can apply the functions defined at the end of section I.).


In [None]:
test_set$prediction = as.numeric(test_set$prediction)-1
(CM(test_set$response, test_set$prediction))
(accuracy_tree = accuracy(test_set$response, test_set$prediction))
(TPR_tree = TPR(test_set$response, test_set$prediction))
(FPR_tree = FPR(test_set$response, test_set$prediction))


The number of misclassified patients when we apply the pruned classification tree to the *test* set is 272+392 = 664, slightly fewer than the misclassified *training* observations. This is not typical but, as shown here, does happen every now and then.

### VII. Bootstrap Aggregation (Bagging)

Using the training set, grow one unpruned tree for each of 500 bootstrap samples and save the result as `bag`. 500 bootstrap samples is the default for the `randomForest` command; this value can be changed by adding and specificying the `ntree=500` option. The option `mtry` specifies the number of predictors that are considered at each split. Bootstrap aggregation always considers all predictors at each split so we set `mtry` to 


In [None]:
(n_predictors = ncol(framingham)-1)



In [None]:
library(randomForest)
set.seed(1)
(bag = randomForest(response ~ Male + Age + Education + Smoker 
                              + Cigs.per.Day + BP.Medication 
                              + Hypertension + Diabetes + Heart.Rate 
                              + Glucose + observed.CHD,
                   data = training_set, 
                   mtry = n_predictors, 
                   importance = TRUE))



Compute the predicted values of the response in the *test* set.


In [None]:
test_set$prediction = predict(bag, test_set, type = "class")




Compute measures of predictive performance.


In [None]:
test_set$prediction = as.numeric(test_set$prediction)-1
(CM(test_set$response, test_set$prediction))
(accuracy_bag = accuracy(test_set$response, test_set$prediction))
(TPR_bag = TPR(test_set$response, test_set$prediction))
(FPR_bag = FPR(test_set$response, test_set$prediction))


### VIII. Random Forest

Random forests are grown just like the bootstrap-aggregated forests. The difference is that to grow each tree in a random forest only a subset of all available predictors is considered. To grow each tree in a bootstrap-aggregated forest, *all* available predictors are considered.

Define $m = \sqrt{p}$, the number of predictors considered at each split.


In [None]:
(m = round(sqrt(n_predictors)))




Grow a random forest of 500 trees and save the result as `rf`.


In [None]:
library(randomForest)
set.seed(1)
(rf = randomForest(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                              + BP.Medication + Hypertension + Diabetes 
                              + Heart.Rate + Glucose + observed.CHD, 
                  data = training_set, 
                  mtry = m, 
                  importance = TRUE))



Compute the predicted values of the response in the *test* set


In [None]:
test_set$prediction = predict(rf, newdata = test_set, type = "class")




Compute measures of predictive performance.


In [None]:
test_set$prediction = as.numeric(test_set$prediction)-1
(CM(test_set$response, test_set$prediction))
(accuracy_rf = accuracy(test_set$response, test_set$prediction))
(TPR_rf = TPR(test_set$response, test_set$prediction))
(FPR_rf = FPR(test_set$response, test_set$prediction))



Plot the importance of each predictor


In [None]:
varImpPlot(rf)



### IX. Boosting

Convert the response variable in the training set back to numeric format, as the `boost` command will grow classification trees when the option `distribution = "bernoulli"` is specified.


In [None]:
training_set$response = as.numeric(training_set$response)-1




Grow a sequence of 5,000 trees using the *training* set and save the result as `boost`.


In [None]:
library(gbm)
set.seed(1)
boost = gbm(response ~ Male + Age + Education + Smoker + Cigs.per.Day 
                        + BP.Medication + Hypertension + Diabetes 
                        + Heart.Rate + Glucose + observed.CHD, 
            data = training_set, 
            distribution = "bernoulli",
            n.trees=5000, 
            interaction.depth=1)


Compute the predicted values of the response in the *test* set.

1. Compute the predicted probability for each observation in the test set (after converting the response variable back to numeric format)


In [None]:
test_set$prediction = predict(boost, newdata=test_set, n.trees=5000, type = "response")




2. Convert the predicted probability into a predicted class using the probability threshold of 0.5


In [None]:
test_set$prediction = ifelse(test_set$prediction > 0.5, 1, 0)




Compute measures of predictive performance.


In [None]:
(CM(test_set$response, test_set$prediction))
(accuracy_boost = accuracy(test_set$response, test_set$prediction))
(TPR_boost = TPR(test_set$response, test_set$prediction))
(FPR_boost = FPR(test_set$response, test_set$prediction))


### X. Model Comparison

The following table shows the values of three performance statistics for the eight different prediction models that were fitted to the training set and evaluated on the test set. The eight models are listed in ascending order of the false positive rate. Note that the three performance statistics reported in the table were computed for a probability threshold of 0.5. A different probability threshold might yield a different ranking.


In [None]:
FPR      = c(FPR_lasso, FPR_ridge, FPR_logistic, FPR_rf, 
             FPR_boost, FPR_bag, FPR_tree, FPR_null)
TPR      = c(TPR_lasso, TPR_ridge, TPR_logistic, TPR_rf, 
             TPR_boost, TPR_bag, TPR_tree, TPR_null)
accuracy = c(accuracy_lasso, accuracy_ridge, accuracy_logistic, accuracy_rf, 
             accuracy_bag, accuracy_boost, accuracy_tree, accuracy_null)

# Matrix of predictive performance statistics
results = t(rbind(FPR, TPR, accuracy))
rownames(results) = c("Lasso","Ridge Regression","Logistic Regression",
                      "Random Forest","Boosting","Bagging",
                      "Single Pruned Tree","Null Model")
results


The table shows that, with the exception of the boosting and bagging models, raising the false positive rate (FPR) raises the true positive rate (TPR). The intuition is the same as in the construction of ROC curves for a given class of predictive model when we vary the probability threshold: the more readily a model predicts a positive response, the more readily that model will return both true *and* false positives.

We can show the performance of the eight prediction models in the space spanned by the false positive rate on the horizontal axis and the true positive rate on the vertical axis.


In [None]:
FPR = c(0, FPR_lasso, FPR_ridge, FPR_logistic, FPR_rf, 
        FPR_boost, FPR_bag, FPR_tree, FPR_null)
FPR_frontier = c(0, FPR_lasso, FPR_ridge, FPR_rf, FPR_tree, FPR_null)
TPR = c(0, TPR_lasso, TPR_ridge, TPR_logistic, TPR_rf, 
        TPR_boost, TPR_bag, TPR_tree, TPR_null)
TPR_frontier = c(0, TPR_lasso, TPR_ridge, TPR_rf, TPR_tree, TPR_null)
accuracy = c(0, accuracy_lasso, accuracy_ridge, accuracy_logistic, accuracy_rf, 
             accuracy_bag, accuracy_boost, accuracy_tree, accuracy_null)
roc = data.frame(FPR, TPR, accuracy)
attr(roc, "row.names") = c("", "", "", "", "", "", "", "", "null")
par(pty = "s")
plot(FPR, TPR, xlim=c(-10,110), ylim = c(-10,110), asp=1,
     xlab = "False Positive Rate (FPR)", ylab = "True Positive Rate (TPR)")
with(roc, 
     text(TPR ~ FPR, labels = row.names(roc), pos = 1, col='dodgerblue', cex = 0.8))
lines(FPR_frontier, TPR_frontier, type='b', lwd=2, col='dodgerblue')
lines(c(0,100), c(0,100), type='l', lwd=2, col='red', lty=3)
title(main = "Predictive Performance of All 8 Models")


The dotted red line shows the ROC curve of the "random guess" model, which assigns a proportion $p$ of the test observations to one class and the remainder $1-p$ to the other class. The `null` model is a special case of the "random guess" model that sets $p=1$.

There is a cluster of six models whose false positive rate is between 35% and 45%, as shown in the figure below:


In [None]:
roc$label_color = "dimgray"
roc$label_color[roc$accuracy == accuracy_lasso] = "dodgerblue"
roc$label_color[roc$accuracy == accuracy_ridge] = "dodgerblue"
roc$label_color[roc$accuracy == accuracy_rf]    = "dodgerblue"
attr(roc, "row.names") = c("","lasso","ridge","logistic","random forest","boost","bag","tree","null")
plot(FPR, TPR, xlim=c(36,45), ylim = c(65.5,72.3), 
     xlab = "False Positive Rate (FPR)", ylab = "True Positive Rate (TPR)")
with(roc, text(TPR ~ FPR, labels = row.names(roc), pos = 4, col=roc$label_color, cex = 0.8))
lines(FPR_frontier, TPR_frontier, type='b', lwd=2, col='dodgerblue')
title(main = "The logistic, boost, and bag models are strongly dominated.")


The `lasso` and `ridge` regression models and the `random forest` model lie on the *frontier* (shown in blue): For each of these models there is no other model to their northwest, i.e. with a higher TPR *and* a lower FPR. 

The other three models -- `logistic`, `boost`, and `bag` -- are *strongly dominated*: For each of these models, there is at least one other model to their northwest, i.e. at least one other model offers a lower FPR *and* a higher TPR. For instance, the `ridge` regression model's FPR (37.39) is lower (to the "west") and its TPR (68.01) is higher ("north") than the `logistic` regression model's FPR (37.52) and TPR (67.81).

The `ridge` regression model itself is *weakly dominated*: Although there is no single model to its northwest, a segment (solid orange line) of the straight (dotted orange) line connecting the `lasso` and `random forest` models lies to its northwest. This segment represents all the *convex combinations* of the `lasso` and `random forest` models that achieve a higher TPR and lower FPR than the `ridge` model. For instance, we could predict the overweight status of three out of every four (75%) individuals using the `lasso` model and predict the overweight status of the remaining fourth (25%) using the `random forest` model. This "mixture" of the `lasso` and `random forest` models would achieve an FPR of 0.75\*36.04 + 0.25\*40.71 = 37.21 (less than the 37.39 achieved by the `ridge` model) and a TPR of 0.75\*67.72 + 0.25\*72.15 = 68.83 (more than the 68.01 achieved by the `ridge` model). 


In [None]:
roc$label_color = "dodgerblue"
attr(roc, "row.names") = c("","lasso","ridge","random forest","tree","null")

slope_lasso_rf = (TPR_rf - TPR_lasso)/(FPR_rf - FPR_lasso)
FPR_start = FPR_lasso + (TPR_ridge - TPR_lasso)/slope_lasso_rf
TPR_end = TPR_rf - slope_lasso_rf*(FPR_rf - FPR_ridge)

plot(FPR_frontier, TPR_frontier, xlim=c(36,41.5), ylim = c(67.5,72.3), 
     xlab = "False Positive Rate (FPR)", ylab = "True Positive Rate (TPR)")
with(roc, text(TPR_frontier ~ FPR_frontier, labels = row.names(roc), pos = 4, col=roc$label_color, cex = 0.8))
lines(FPR_frontier, TPR_frontier, type='b', lwd=2, col='dodgerblue')
lines(c(FPR_lasso, FPR_rf), c(TPR_lasso, TPR_rf), 
      type='b', lwd=2, col='orangered', lty=3)
lines(c(FPR_start, FPR_ridge), c(TPR_ridge, TPR_end), 
      type='b', lwd=2, col='orangered')
title(main = "The ridge regression model is weakly dominated.")



In the graph, we can identify a strongly dominated model easily by the slope of the line that connects this model to the model immediately to its left, i.e. the model that achieves the next lower FPR.


In [None]:
FPR      = c(0, FPR_lasso, FPR_ridge, FPR_logistic, FPR_rf, 
             FPR_boost, FPR_bag, FPR_tree, FPR_null)
TPR      = c(0, TPR_lasso, TPR_ridge, TPR_logistic, TPR_rf, 
             TPR_boost, TPR_bag, TPR_tree, TPR_null)
accuracy = c(0, accuracy_lasso, accuracy_ridge, accuracy_logistic, accuracy_rf, 
             accuracy_boost, accuracy_bag, accuracy_tree, accuracy_null)
roc = data.frame(FPR, TPR, accuracy)
attr(roc, "row.names") = c("","lasso","ridge","logistic","random forest","boost","bag","tree","null")

roc$label_color = "dimgray"
roc$label_color[roc$accuracy == accuracy_lasso] = "dodgerblue"
roc$label_color[roc$accuracy == accuracy_ridge] = "dodgerblue"
roc$label_color[roc$accuracy == accuracy_rf]    = "dodgerblue"
plot(FPR, TPR, xlim=c(36,45), ylim = c(65.5,72.3), 
     xlab = "False Positive Rate (FPR)", ylab = "True Positive Rate (TPR)")
with(roc, 
     text(TPR ~ FPR, labels = row.names(roc), pos = 4, col=roc$label_color, cex = 0.8))
lines(c(FPR_rf, FPR_boost, FPR_bag), c(TPR_rf, TPR_boost, TPR_bag), 
      type='b', lwd=2, col='dimgray')
title(main = "Identifying strongly dominated models")


Whenever the slope of the line connecting a model to the model with the model immediately to its left is negative, that model is strongly dominated. In the graph the slope of the line connect the `random forest` model to the `boost` model is negative, and thus the `boost` model is strongly dominated. Ditto for the `bag` and `logistic` models.

We can use the graph to identify weakly dominated models as well. Whenever the slope of the line connecting a model to its neighbor to the right (next larger FPR) is larger than the slope of the line connecting that model to its neighbor to the left (next smaller FPR), that model is weakly dominated.


In [None]:
attr(roc, "row.names") = c("","lasso","ridge","random forest","tree","null")
plot(FPR_frontier, TPR_frontier, xlim=c(36,50), ylim = c(65.5,74), 
     xlab = "False Positive Rate (FPR)", ylab = "True Positive Rate (TPR)")
with(roc, 
     text(TPR_frontier ~ FPR_frontier, labels = row.names(roc), 
          pos = 4, col="dodgerblue", cex = 0.8))
lines(FPR_frontier, TPR_frontier, 
      type='b', lwd=2, col='dodgerblue', lty=3) # lty=3 draws a dotted line
lines(c(FPR_ridge, FPR_rf), c(TPR_ridge, TPR_rf), 
      type='b', lwd=2, col='dodgerblue')
lines(c(FPR_tree, FPR_null), c(TPR_tree, TPR_null), 
      type='b', lwd=2, col='dodgerblue')
title(main = "Identifying weakly dominated models")


In the graph, the slope of the solid line connecting the `ridge` model to the `random forest` model is larger than the slope of the line connecting the `ridge` model to the `lasso` model, and thus the `ridge` model is weakly dominated. The same reasoning shows that the single pruned `tree` is weakly dominated by a convex combination of the `random forest` (neighbor to its left) and `null` (neighbor to its right) models.

Once you've dropped the two strongly and weakly dominated models, only the `lasso`, `random forest`, and `null` models remain. Which model should you use? 

One criterion is the relative weight you assign to false positives and false negatives. The `null` model offers a false negative rate (1-sensitivity) of 0% at the cost of a false positive rate of 100%. The `lasso` predicts far fewer false positives but more false negatives. 

The slope of the line connecting two models in the graph shows you how many true positives you gain on average for each additional false positive you allow by moving from one model to the next model along the frontier as you allow for a larger FPR.
Suppose you are classifying 20,000 patients, and you know that 10,000 of these are overweight and 10,000 are not. You just don't know which patients are overweight. If you use the `lasso` model, you will identify 6,772 of the 10,000 overweight patients. But you'll also falsely label 3,604	of the 10,000 normal-weight patients as overweight. Switching to the `random forest` model will enable you to raise the number of true positives by 


In [None]:
100*round((TPR_rf - TPR_lasso),2)



to 7,215 patients. But this switch comes at the cost of raising the number of false positives by 


In [None]:
100*round((FPR_rf - FPR_lasso),2)



to 4,071. Thus, when you switch from the `lasso` model to the `random forest` model, allowing one additional false positive allows you to identify 


In [None]:
round((TPR_rf - TPR_lasso)/(FPR_rf - FPR_lasso),2)



additional true positives on average. 
You can raise the number of true positives even further by switching to the null model. But this switch yields only 


In [None]:
round((TPR_null - TPR_rf)/(FPR_null - FPR_rf),2)



additional true positives for every additional false positive: The slope of the line connecting the `random forest` and `null` models is smaller than the slope of the line connecting the `lasso` and `random forest` models.

A second criterion is interpretability. The `lasso` model is substantially easier to understand and apply in practice than the `random forest` model.
