# _Ensemble Learning with Super Learner and H2O in R_
## Nima Hejazi and Evan Muzzall
## [The Hacker Within](http://www.thehackerwithin.org/berkeley/), 6 December 2016

In [1]:
# preliminaries
library(mlbench)
data(BreastCancer)
head(BreastCancer)

Id,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
1000025,5,1,1,1,2,1,3,1,1,benign
1002945,5,4,4,5,7,10,3,2,1,benign
1015425,3,1,1,1,2,2,3,1,1,benign
1016277,6,8,8,1,3,4,3,7,1,benign
1017023,4,1,1,3,2,1,3,1,1,benign
1017122,8,10,10,8,7,10,9,7,1,malignant


## The Super Learner algorithm

* __R package:__ `SuperLearner`
* Main functions: `SuperLearner`, `CV.SuperLearner`

In [2]:
set.seed(654123)
library(SuperLearner)
library(cvAUC)

Loading required package: nnls
Super Learner
Version: 2.0-20
Package created on 2016-04-06

Loading required package: ROCR
Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

 
cvAUC version: 1.1.0
Notice to cvAUC users: Major speed improvements in version 1.1.0
 


Firstly, we need to transform our data matrix so that it can be easily passed to `SuperLearner`. We'll do this with the `dplyr` package.

In [4]:
"%ni%" = Negate("%in%")
library(dplyr)

dim(BreastCancer)

# examine whether there are NAs in the data
colSums(is.na(BreastCancer))

In [5]:
# remove the NAs before proceeding with Super Learner
bc <- BreastCancer %>%
 dplyr::filter(complete.cases(.))
dim(bc)

In [6]:
# create a data.frame of covariates to be used in prediction
X <- bc %>%
 dplyr::select(which(colnames(.) %ni% c("Id", "Class")))
head(X)

Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses
5,1,1,1,2,1,3,1,1
5,4,4,5,7,10,3,2,1
3,1,1,1,2,2,3,1,1
6,8,8,1,3,4,3,7,1
4,1,1,3,2,1,3,1,1
8,10,10,8,7,10,9,7,1


In [7]:
# create a numeric vector of the binary outcomes...
Y <- bc %>%
 dplyr::select(which(colnames(.) %in% c("Class")))
Y <- as.vector(ifelse(Y == "benign", 0, 1))
head(Y)

Next, we set up a `SuperLearner` library...

In [8]:
SL.lib <- c("SL.mean", "SL.glm", "SL.bayesglm", "SL.stepAIC", "SL.randomForest", "SL.gam")

SL.fit <- SuperLearner(X = X,
                       Y = Y,
                       family = binomial(),
                       SL.library = SL.lib,
                       verbose = FALSE
                      )

Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

“glm.fit: fitted probabilities numerically 0 or 1 occurred”

In [12]:
SL.fit$coef

Cross-validated Super Learner

In [None]:
V = 5
cv.SL.fit <- CV.SuperLearner(X = X,
                             Y = Y,
                             V = V,
                             family = binomial(),
                             SL.library = SL.lib,
                             verbose = FALSE
                            )

“Error in algorithm SL.glm 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“Error in algorithm SL.step 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“Error in algorithm SL.glm 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“Error in algorithm SL.step 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“Error in algorithm SL.glm 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“Error in algorithm SL.step 
  The Algorithm will be removed from the Super Learner (i.e. given weight 0) 
“glm.fit: algorithm did not converge”

In [None]:
cv.SL.fit$coef

In [None]:
library(ggplot2)
plot(cv.SL.fit)

In [None]:
fld = cv.SL.fit$fold
predsY = cv.SL.fit$SL.predict
n = length(predsY)
fold = rep(NA, n)

for(k in 1:V) {
    ii = unlist(fld[k])
    fold[ii] = k
}

ciout = ci.cvAUC(predsY, Y, folds = fold)
txt = paste("AUC = ", round(ciout$cvAUC, 2),", 95% CI = ", round(ciout$ci[1], 2), "-", round(ciout$ci[2], 2))
print(txt)

In [None]:
pred <- prediction(predsY,Y)
perf1 <- performance(pred, "sens", "spec")
plot(1-slot(perf1,"x.values")[[1]],slot(perf1,"y.values")[[1]],type="s")
abline(0,1)

### Compare to main terms logistic regression
SL.sht="SL.glm" fit.test=CV.SuperLearner(Y=Y,X=X,family=binomial(),SL.library=SL.sht,verbose=F,V=V)
fld=fit.test$fold predsY=fit.test$SL.predict n=length(predsY) fold=rep(NA,n)
for(k in 1:V) {
    ii=unlist(fld[k])
    fold[ii]=k
}

### Get the CI for the x-validated AUC
ciout2=ci.cvAUC(predsY, Y, folds = fold)
txt2=paste("AUC = ",round(ciout2$cvAUC,2),", 95% CI = ",round(ciout2$ci[1],2),"-",round(ciout2[2],2))

#### Compare via ROC plots
pred <- prediction(predsY,Y)
perf1 <- performance(pred, "sens", "spec")
lines(1-slot(perf1,"x.values")[[1]],slot(perf1,"y.values")[[1]],type="s",col=2)
text(0.75,0.30,txt2,col=2)
legend(0.05,0.95,c("SL","Logit Reg"),col=1:2,lty=rep(1,2),bty=n)