Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with binary outcomes using SuperLearner #20

Open
philipclare opened this issue Sep 24, 2018 · 8 comments
Open

Issues with binary outcomes using SuperLearner #20

philipclare opened this issue Sep 24, 2018 · 8 comments

Comments

@philipclare
Copy link

philipclare commented Sep 24, 2018

Hi,

I'm having an issue with binary outcomes. From the source, and discussions in other places, it appears that in order to handle continuous outcomes, the program transforms them into the range 0<x<1 and then uses pseudobinomial analysis.
However, it appears that is causing issues with some SuperLearner wrappers, even when the variables included are true binomial variables. For example, when using the 'SL.ranger' wrapper to classify using random forests, an error is returned. I may be misunderstanding something, but I think maybe this could be fixed by using a quasibinomial family for transformed continuous variables, but binomial for binary variables?

A replicable (hopefully) example:

cloudstor <- "C:/Users/z3312911/Cloudstor/"
.libPaths(paste0(cloudstor,"R Library"))

library("tmle")
library("ltmle")
library("haven")
library("SuperLearner")
library("simcausal")
library("gam")
library("ranger")

# Data creation using simcausal
D <- DAG.empty() + 
  node("ba", distr="rnorm", mean=0, sd = 1) +
  node("bb", distr="rnorm", mean=0, sd = 1) +
  node("bc", distr="rnorm", mean=0, sd = 1) +
  node("u", t=0, distr="rnorm", mean=0, sd = 1) +
  node("l", t=0, distr="rbern", prob=plogis(-2 + 1.5*u[t] + 0.1*ba - 0.1*bb + 0.1*bc)) + 
  node("a", t=0, distr="rbern", prob=plogis(-2 + 1.0*l[t] + 0.2*ba - 0.2*bb + 0.2*bc)) +
  node("u", t=1:4, distr="rnorm", mean=0.7*u[t-1], sd = 1) +
  node("l", t=1:4, distr="rbern", prob=plogis(-2 + 1.0*a[t-1] + 2.0*l[t-1] + 1.5*u[t] + 0.1*ba - 0.1*bb + 0.1*bc)) +
  node("a", t=1:4, distr="rbern", prob=plogis(-2 + 2.0*a[t-1] + 1.0*l[t] + 0.2*ba - 0.2*bb + 0.2*bc)) +
  node("y", t=4, distr="rbern", prob=plogis(-3 + 0.20*a[t-4] + 0.40*a[t-3] + 0.60*a[t-2] + 0.80*a[t-1] + 1.00*a[t]
                                        + 0.50*l[t-4] + 0.50*l[t-3] + 0.50*l[t-2] + 0.50*l[t-1] + 0.50*l[t]
                                        + 0.50*u[t-4] + 0.50*u[t-3] + 0.50*u[t-2] + 0.50*u[t-1] + 0.50*u[t]
                                        + 0.2*ba - 0.2*bb + 0.2*bc))
  # Set this causal structure, defining all 'u' variables as latent (so they will not be included in the data)
  D <- set.DAG(D, latent.v = c("u_0","u_1","u_2","u_3","u_4"))
  # Create simulated dataset
  data <- sim(D,n=500)

  # LTMLE estimation with repeated observations of exposure but a single outcome comparing exposure at all time points to no exposure
  rltmle1 <- ltmle(data,Anodes=c("a_0","a_1","a_2","a_3","a_4"),
             Lnodes=c("l_1","l_2","l_3","l_4"),
             Ynodes="y_4",
             survivalOutcome=FALSE,
             abar=list(c(1,1,1,1,1),c(0,0,0,0,0)),
             SL.library=c("SL.glm","SL.glm.interaction","SL.gam","SL.ranger"))

Thanks!

@joshuaschwab
Copy link
Owner

Hi philipclare,
I believe ltmle is working as expected and you can use the answer.

> summary(rltmle1)
Estimator:  tmle 
Call:
ltmle(data = data, Anodes = c("a_0", "a_1", "a_2", "a_3", "a_4"), 
    Lnodes = c("l_1", "l_2", "l_3", "l_4"), Ynodes = "y_4", survivalOutcome = FALSE, 
    abar = list(c(1, 1, 1, 1, 1), c(0, 0, 0, 0, 0)), SL.library = c("SL.glm", 
        "SL.glm.interaction", "SL.gam", "SL.ranger"))

Treatment Estimate:
   Parameter Estimate:  0.75259 
    Estimated Std Err:  0.15614 
              p-value:  1.4352e-06 
    95% Conf Interval: (0.44657, 1) 
...

The issue is:

  1. the initial SuperLearner Q regression is done on the binary Y - this works fine
  2. the subsequent Q regressions are done on Q_{k+1}, which is continuous [0, 1] even though Y is binary - this causes an error in SL.ranger
    You end up seeing a lot of messages, but these are errors in SL.ranger that cause SuperLearner to give SL.ranger weight 0. SuperLearner still finishes because the other wrappers don't have errors and ltmle also finishes.

You can check that SL.ranger has Coef 0 and Risk NA at all of the L nodes, but nonNA Risk and usually positive Coef at the A and Y nodes (where the regressand is binary):

> rltmle1$fit$Q
[[1]]
[[1]]$l_1
                             Risk      Coef
SL.glm_All             0.01384787 0.8163311
SL.glm.interaction_All 0.01424013 0.1836689
SL.gam_All             0.01391151 0.0000000
SL.ranger_All                  NA 0.0000000

[[1]]$l_2
                             Risk      Coef
SL.glm_All             0.01602560 0.7128688
SL.glm.interaction_All 0.01662685 0.2871312
SL.gam_All             0.01614384 0.0000000
SL.ranger_All                  NA 0.0000000

[[1]]$l_3
                             Risk Coef
SL.glm_All             0.01242406    1
SL.glm.interaction_All 0.01405787    0
SL.gam_All             0.01255452    0
SL.ranger_All                  NA    0

[[1]]$l_4
                              Risk      Coef
SL.glm_All             0.007104126 0.8113366
SL.glm.interaction_All 0.008230640 0.1886634
SL.gam_All             0.007162058 0.0000000
SL.ranger_All                   NA 0.0000000

[[1]]$y_4
                            Risk       Coef
SL.glm_All             0.1248136 0.79882078
SL.glm.interaction_All 0.2314382 0.01448145
SL.gam_All             0.1252425 0.14037659
SL.ranger_All          0.1334619 0.04632119


[[2]]
[[2]]$l_1
                             Risk      Coef
SL.glm_All             0.01983224 0.7186961
SL.glm.interaction_All 0.02036409 0.2813039
SL.gam_All             0.02003470 0.0000000
SL.ranger_All                  NA 0.0000000

[[2]]$l_2
                             Risk      Coef
SL.glm_All             0.02443332 0.2579116
SL.glm.interaction_All 0.02609210 0.1566819
SL.gam_All             0.02442018 0.5854065
SL.ranger_All                  NA 0.0000000

[[2]]$l_3
                             Risk Coef
SL.glm_All             0.01451075    1
SL.glm.interaction_All 0.01778840    0
SL.gam_All             0.01467757    0
SL.ranger_All                  NA    0

[[2]]$l_4
                              Risk      Coef
SL.glm_All             0.006327121 0.0000000
SL.glm.interaction_All 0.007821409 0.1613392
SL.gam_All             0.006265457 0.8386608
SL.ranger_All                   NA 0.0000000

[[2]]$y_4
                            Risk       Coef
SL.glm_All             0.1248136 0.79882078
SL.glm.interaction_All 0.2314382 0.01448145
SL.gam_All             0.1252425 0.14037659
SL.ranger_All          0.1334619 0.04632119


> rltmle1$fit$g
[[1]]
[[1]]$a_0
                            Risk       Coef
SL.glm_All             0.1150444 0.90374677
SL.glm.interaction_All 0.1191908 0.00000000
SL.gam_All             0.1154734 0.09625323
SL.ranger_All          0.1216147 0.00000000

[[1]]$a_1
                            Risk      Coef
SL.glm_All             0.1352237 0.4424547
SL.glm.interaction_All 0.1487119 0.0000000
SL.gam_All             0.1350058 0.2728304
SL.ranger_All          0.1379507 0.2847149

[[1]]$a_2
                            Risk      Coef
SL.glm_All             0.1481390 0.8445075
SL.glm.interaction_All 0.1717063 0.0111711
SL.gam_All             0.1498611 0.0000000
SL.ranger_All          0.1548507 0.1443214

[[1]]$a_3
                            Risk      Coef
SL.glm_All             0.1961866 0.8642027
SL.glm.interaction_All 0.2271313 0.1357973
SL.gam_All             0.1981329 0.0000000
SL.ranger_All          0.2118192 0.0000000

[[1]]$a_4
                            Risk      Coef
SL.glm_All             0.1764165 0.6134281
SL.glm.interaction_All 0.2333325 0.0000000
SL.gam_All             0.1787627 0.0000000
SL.ranger_All          0.1790368 0.3865719


[[2]]
[[2]]$a_0
                            Risk       Coef
SL.glm_All             0.1150444 0.90374677
SL.glm.interaction_All 0.1191908 0.00000000
SL.gam_All             0.1154734 0.09625323
SL.ranger_All          0.1216147 0.00000000

[[2]]$a_1
                            Risk      Coef
SL.glm_All             0.1352237 0.4424547
SL.glm.interaction_All 0.1487119 0.0000000
SL.gam_All             0.1350058 0.2728304
SL.ranger_All          0.1379507 0.2847149

[[2]]$a_2
                            Risk      Coef
SL.glm_All             0.1481390 0.8445075
SL.glm.interaction_All 0.1717063 0.0111711
SL.gam_All             0.1498611 0.0000000
SL.ranger_All          0.1548507 0.1443214

[[2]]$a_3
                            Risk      Coef
SL.glm_All             0.1961866 0.8642027
SL.glm.interaction_All 0.2271313 0.1357973
SL.gam_All             0.1981329 0.0000000
SL.ranger_All          0.2118192 0.0000000

[[2]]$a_4
                            Risk      Coef
SL.glm_All             0.1764165 0.6134281
SL.glm.interaction_All 0.2333325 0.0000000
SL.gam_All             0.1787627 0.0000000
SL.ranger_All          0.1790368 0.3865719

We could consider adding functionality to ltmle to allow the user to specify a separate SL.library for each regression or one for the initial Q regression and another for all subsequent Q regressions. Or we could try to detect which wrappers fail with non-binary regressands, but that might be tricky. Or we could just try to improve the messages (also tricky because these are mostly coming from SuperLearner) and/or documentation.

Josh

@ck37
Copy link

ck37 commented Sep 24, 2018

Hi Josh,

Just a quick note that if the outcome variable is non-binary, e.g. a continuous range in [0, 1], then SuperLearner should be used with family = gaussian(). This will be necessary for almost any machine learning algorithm to work correctly.

Thanks,
Chris

@joshuaschwab
Copy link
Owner

Thanks Chris.
How about we add an option SL.family with default NULL, where NULL means
SuperLearner(..., family = gaussian()) if outcome is continuous in [0, 1]
SuperLearner(..., family = binomial()) if outcome is binary
but the user can also specify SL.familly = gaussian() or SL.family = binomial() to force one or the other?

@Lauren-EylerDang
Copy link

Hi Josh,

I was hoping to use the ltmle package with highly adaptive lasso, but I have this same issue that HAL doesn't support family="quasibinomial" and HAL is dropped from the SL library because the initial outcome variable is binary but the subsequent Q regressions are for a [0,1] outcome. I ended up hard-coding the ltmle estimator using HAL with family="binomial" for the initial Q regression and HAL with family="gaussian" for the subsequent Q regressions. Do you think that is a reasonable method for dealing with this issue? I didn't see documentation for adding a SL.family=NULL option as you described above (I am sorry if I missed it), but is there a way to do what you had described above with the ltmle package?

Thanks,
Lauren

@joshuaschwab
Copy link
Owner

Hi Lauren,
Sorry, I never followed up on implementing the SL.family option. Yes, I think your approach sounds reasonable. I don't know if HAL ever returns predictions less than 0 or greater than 1 if all Y are continuous in [0, 1]. If it could, you might need to do something to bound the predictions.
You could also write your own HAL wrapper than automatically selects the family, something like:

n <- 100
rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x))
W <- rnorm(n)
A1 <- rexpit(W)
L <- 0.3 * W + 0.2 * A1 + rnorm(n)
A2 <- rexpit(W + A1 + L)
Y <- rexpit(W - 0.6 * A1 + L - 0.8 * A2)
data <- data.frame(W, A1, L, A2, Y)

SL.hal9001.flexible <- function (Y, X, newX = NULL, max_degree = 3, 
                                 fit_type = c("glmnet", "lassi"), 
                                 n_folds = 10, use_min = TRUE, 
                                 family, #not actually used but needed for compatability with SuperLearner
                                 obsWeights = rep(1, length(Y)), ...) {
  if (all(Y %in% c(0, 1, NA))) {
    family <- stats::binomial()
  } else {
    family <- stats::gaussian()
  }
  print(family) #temp
  SL.hal9001(Y, X, newX, max_degree, fit_type, n_folds, use_min, family, obsWeights, ...)
  #may need to bound predictions?
}

ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0), SL.library = "SL.hal9001.flexible", estimate.time = F)

If you try it, let me know how it works. This would be the same idea as the SL.family=NULL option. If it works well, I could add it to the next release.
Josh

@Lauren-EylerDang
Copy link

Lauren-EylerDang commented Dec 9, 2020 via email

@joshuaschwab
Copy link
Owner

I just pushed a fix to the SLfamily branch (I didn't end up making a SL.family option - it just happens internally). Will you try it and see if it works for you?

@Lauren-EylerDang
Copy link

Lauren-EylerDang commented Dec 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants