Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using causal_forest to estimate average treatment effects over subgroups #238

Closed
markhwhiteii opened this issue Apr 22, 2018 · 26 comments
Closed
Labels
feature help wanted Community members are welcome to submit a pull request to address the issue. question

Comments

@markhwhiteii
Copy link

markhwhiteii commented Apr 22, 2018

After reading a few of the papers on causal forests, it seems like the idea is the trees will decide to split more often on variables that cause the treatment effect to vary (that is, moderators of the experimental effect). However, the object returned by causal_forest seems to work like any other machine learning algorithm focused on prediction—not causal statistical inference—in the way the predict function associated with it works.

How can I use the causal_forest function to find where the treatment varies?

For example, I simulate a randomized experiment with a binary outcome where there is a positive effect for women and a negative effect for black men. There are also nuisance variables in here. How would I use causal_forest to tell me that, "The trees are tending to split most on gender, and then when it leads to male, it the trees also are more likely to split on race." This would help show me where the treatment effects are occurring.

Here are the data:

set.seed(1839)
n <- 5000
X <- data.frame(
  condition = factor(sample(c("control", "treatment"), n, TRUE)),
  gender = factor(sample(c("male", "female"), n, TRUE)),
  race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)),
  generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)),
  has_kids = factor(sample(c("no", "yes"), n, TRUE))
)
y <- factor(ifelse(
  X$gender == "female" & X$condition == "treatment",
  sample(c("positive", "negative"), n, TRUE, c(.50, .50)),
  ifelse(X$race == "black" & X$condition == "treatment",
    sample(c("positive", "negative"), n, TRUE, c(.12, .88)),
    sample(c("positive", "negative"), n, TRUE, c(.40, .60))
  )
))

And then I fit a model using grf::causal_forest:

library(grf)
mod <- causal_forest(
  model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies
  (as.numeric(y) - 1), # convert outcome to 0 or 1
  (as.numeric(X[, 1]) - 1), # convert treatment to 0 or 1
  num.trees = 4000
)

Looking at mod, I see some variable importance issues that hints at where the splits are occurring most:

Variable importance: 
    1     2     3     4     5     6     7     8 
0.233 0.203 0.267 0.061 0.055 0.046 0.081 0.054 

However, this doesn't capture how multiple X variables may depend on one another in their interaction with W on Y (what may be called a three-way interaction in the general linear model world).

If I use the predict() function, I can only look at individual-level conditional treatment effects and their associated variance estimates. For example, just the first row of my data:

predict(mod, model.matrix(~ -1 + ., X[1, -1]), estimate.variance = TRUE)

However, how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?

Additionally, can the causal_forest function tell me where these variations are most likely to occur? It seems like this is possible from the papers I have read on causal forests (as well as earlier papers on causal trees and transformed outcome trees, etc.), but if I may very well be mistaken.

@lbuckley13
Copy link

Hi Mark,

I have the same question as you. The best answer that I've found was published by a group at Mt. Sinai Medical School using a neutral randomized controlled trial of a weight loss intervention in patients with type 2 diabetes mellitus. To cut to the chase, their analysis led them to conclude that the neutral overall finding masked heterogeneous treatment effects within the overall cohort and that those heterogeneous treatment effects were driven by two parameters (hemoglobin A1c and self-care ability). The reference is Baum A, et al. Lancet Diabetes Endocrinol 2017;5:808-15.

I'd be interested to learn whether you've made any progress on this question on your own. Please let me know.

Leo

@adeldaoud
Copy link

Mark, excellent questions. Wondering the same things. Especially,

how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?

and

Additionally, can the causal_forest function tell me where these variations are most likely to occur?

@markhwhiteii
Copy link
Author

markhwhiteii commented May 18, 2018

@lbuckley13 @adeldaoud what I have been doing so far is "aggregating up" to get effects conditional on one variable (e.g., gender = male, race = Black, etc.) To do that, I've been doing the following code. First, I'll recreate the data and model above and then show how I aggregate up:

# copying code from above ------------------------------------------------------
set.seed(1839)
n <- 5000
X <- data.frame(
  condition = factor(sample(c("control", "treatment"), n, TRUE)),
  gender = factor(sample(c("male", "female"), n, TRUE)),
  race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)),
  generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)),
  has_kids = factor(sample(c("no", "yes"), n, TRUE))
)
y <- factor(ifelse(
  X$gender == "female" & X$condition == "treatment",
  sample(c("positive", "negative"), n, TRUE, c(.50, .50)),
  ifelse(X$race == "black" & X$condition == "treatment",
         sample(c("positive", "negative"), n, TRUE, c(.12, .88)),
         sample(c("positive", "negative"), n, TRUE, c(.40, .60))
  )
))

library(grf)
mod <- causal_forest(
  model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies
  (as.numeric(y) - 1),             # convert outcome to 0 or 1
  (as.numeric(X[, 1]) - 1),        # convert treatment to 0 or 1
  num.trees = 4000
)

# aggregating up ---------------------------------------------------------------
model_hat <- predict(mod, estimate.variance = TRUE)
model_sigma <- sqrt(model_hat$variance.estimates)

# saving estimated treatment effect and upper/lower bounds to data frame
# save these back to the original data frame and calculate confidence intervals
dat <- as.data.frame(X[, -1])
dat$pred_est <- c(model_hat$predictions)
dat$pred_var <- c(model_sigma)
dat$pred_est_lb <- dat$pred_est - 1.96 * dat$pred_var
dat$pred_est_ub <- dat$pred_est + 1.96 * dat$pred_var

# for each factor, get the average at each level
# get results for every level of every variable by aggregating up
library(dplyr)
cates <- lapply(names(dat[, 1:4]), function(x) {
  tmp <- dat %>% 
    group_by_(x) %>% 
    transmute(
      variable = x,
      ate = round(mean(pred_est) * 100, 2),
      ate_lb = round(mean(pred_est_lb) * 100, 2),
      ate_ub = round(mean(pred_est_ub) * 100, 2)
    ) %>% 
    unique() %>% 
    as.data.frame()
  tmp <- tmp[, c(2, 1, 3, 4, 5)]
  names(tmp)[2] <- "level"
  tmp
})
cates <- do.call(rbind, cates) %>% 
  mutate_if(is.character, as.factor)

# visualize these
library(ggplot2)
ggplot(cates, aes(x = level, y = ate, color = variable)) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    strip.text.y = element_text(colour = "black"),
    strip.background = element_rect(colour = NA, fill = NA),
    legend.position = "none"
  ) +
  geom_point() +
  geom_errorbar(aes(ymin = ate_lb, ymax = ate_ub), width = .2) +
  geom_hline(yintercept = 0, linetype = 3) +
  facet_grid(variable ~ ., scales = "free_y") +
  coord_flip()

The resulting cates (for conditional average treatment effects) data frame looks like:

     variable      level   ate ate_lb ate_ub
1      gender       male -6.15 -22.74  10.45
2      gender     female  7.08 -10.36  24.52
3        race   hispanic  5.37 -11.88  22.61
4        race      black -9.18 -25.43   7.07
5        race      asian  1.46 -15.45  18.38
6        race      white  5.04 -12.69  22.77
7  generation babyboomer -2.59 -19.27  14.08
8  generation millennial  0.30 -16.72  17.33
9  generation          x  4.07 -13.31  21.45
10   has_kids        yes -0.05 -17.34  17.24
11   has_kids         no  1.25 -15.52  18.01

And the plot looks like:

screen shot 2018-05-17 at 8 48 08 pm

The only thing I'm wondering is: are the confidence intervals too wide, since they are based on individual-level estimates?

Also, this does not get me closer to answering the question about how to find if it is warranted to look at a subgroup effect (sort of akin to telling me if the interaction is significant or not). But I thought I'd share it with you both in hopes that we can move it forward.

@swager
Copy link
Member

swager commented May 18, 2018

@markhwhiteii thank you for the very detailed post. This type of subgroup analysis seems like a very useful thing to be able to do.

In terms of the CIs, the short answer is that, yes, as you suspect, the intervals are too long. As an extreme example of this, if you wanted to estimate the average treatment effect for everyone, the function average_treatment_effect would give accurate uncertainty quantification for the ATE, whereas averaging the endpoints for individual CIs will be way too wide.

One solution would be to extend the function for ATEs, so that it would allow focusing on a subgroup, allowing for syntax something like this:

average_treatment_effect(mod, subset=(X$race == "black"))

Would something like this be helpful?

@swager
Copy link
Member

swager commented May 18, 2018

Also, as to the original question of looking for variables that are associated with treatment heterogeneity, I often try something like this:

  • Divide the training set into quintiles by CATE estimate,
  • Compute the mean and variance of X[,j] for different j across all quintiles, and then
  • Look for j where these distributions are very different.

This can help you find variables that are associated with treatment heterogeneity.

@markhwhiteii
Copy link
Author

@swager responding to each comment, in turn:

  1. Yes, that would be very helpful. It looks like that syntax would allow me to do something like average_treatment_effect(mod, subset=(X$race == "black" & X$generation == "x")) as well? Is the math behind this already in the causal forest papers?

  2. Thanks for this tip on how we can find what variables determine heterogeneity—that helps!

@adeldaoud
Copy link

@swager , @markhwhiteii , thanks for your input. My follow-up questions:

  1. Regarding the reply below, would we not get an indication of this procedure by looking at variable importance?

Also, as to the original question of looking for variables that are associated with treatment heterogeneity, I often try something like this:

Divide the training set into quintiles by CATE estimate, Compute the mean and variance of X[,j] for different j across all quintiles, and then Look for j where these distributions are very different.

  1. Building on @markhwhiteii ’s question:

Also, this does not get me closer to answering the question about how to find if it is warranted to look at a subgroup effect (sort of akin to telling me if the interaction is significant or not). But I thought I'd share it with you both in hopes that we can move it forward.

How can we capture interactive effects between two or more variables by looking at variable importance or searching manually over the CATE distribution (as you suggested)? I know that trees build interactions between variables as the algorithm grows the trees, but how does that show when we want to retrieve the most important interactions? Would it e.g. be possible to build a variable importance for interactive effects? E.g. it would indicate that gender and race being importan only together, but not seperatly.

  1. Related to all these questions, I am trying to understand the difference between ITE and CATE in the context of CausalForest, and how their respective SE should be understood.

In the original tutorial https://github.com/swager/grf/ , Usage Examples, we specify the test set (X.test) as a desired sequence that reflects the variable of interest.

# Generate data.
n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 2, length.out = 101)

(3a) my first question is, we set all the other variables p to zero, and I wonder if this is equivalent of setting of setting covariate to their mean in a regression framework? But how does that effect our interpretation of the average effect for the X.test sample?

We then plot the point estimate and the SE, which leads to (3b) my second question. I am interpreting this as one-hundred and one ITEs predicted for each point in a desired sequence, and not a CATE. In other words, I am interpreting this in a similar ways as giving the model one observation for which I want one ITE estimate with corresponding SE. So in @markhwhiteii example this could be the ITEs for the two first rows:

model_hat.ite <- predict(mod, model.matrix(~ -1 + ., X[1:2, -1]), estimate.variance = T)
model_sigma.ite <- sqrt(model_hat$variance.estimates)
str(model_hat.ite)

From this intepretation of ITEs, I would now like to get group-averaged ITEs which leads me to the questions @markhwhiteii has raised. If my interpretation of ITE is correct, than I am not sure how we get a CATE that is group averaged (e.g. ATE for gender=female & race=black) that is sensitive to the group size in each values—maybe we can call it GATE, group averaged treatment effect. You have already indicated an answer to this, but could you please provide how the mathematics behind would look like (e.g. is it averaging of trees or sampling for GATE give its groupsize)?

@kingsharaman
Copy link

kingsharaman commented May 24, 2018

I created a version of average_treatment_effect() that supports subsetting and also you can provide predictions beforehand so you don't have to call predict() within average_treatment_effect() each time.

You can find my version of the R file here:
https://github.com/kingsharaman/grf/blob/master/r-package/grf/R/average_treatment_effect.R

@swager should I create a pull request?

@swager
Copy link
Member

swager commented May 29, 2018

Thanks a lot, @kingsharaman! Yes, please create a PR.

I like the idea of not having to compute predictions each time you want to call the average_treatment_effect function. However, I don't like passing the predictions separately (as they should really be part of the forest object).

How about doing something like this instead? Once you compute OOB predictions the first time you call average_treatment_effect, then you add them to the object in the parent scope, so the next time you can just grab them from there? Some prototyping:

foo = function(cf) {
  if (is.null(cf$predictions)) {
    preds <- 1:3
    cf.with.preds <- cf
    cf.with.preds$predictions <- preds
    eval.parent(substitute(cf <- cf.with.preds))
  }
}

a <- list(forest="forest")
foo(a)

@jtibshirani let me know what you think.

p.s. We should probably move this conversation to a new thread with the PR.

@swager
Copy link
Member

swager commented May 29, 2018

@adeldaoud the main idea is that if you want to estimate an average effect over a subgroup, then you can just use the same method as you'd use for an ATE, but only considering those observations in the subgroup. For example, AIPW computes the ATE by averaging Gamma over all samples, where

Gamma[i] = mu.hat.1[i] - mu.hat.0[i] + W[i] / e.hat[i] * (Y[i] - mu.hat.1[i]) - (1 - W[i]) / (1 - e.hat[i]) * (Y[i] - mu.hat.0[i])

where mu.hat.1[i] is a prediction for treated units at the i-th feature value, etc. Now, if you want to estimate ATEs over a subgroup, just average Gamma_i over the subgroup. (But you can only do this for fairly large subgroups. Notice that Gamma_i is a terrible estimate of the CATE at a point because it is noisy; you need to compute it over a large subgroup to average out the noise.)

@jtibshirani
Copy link
Member

jtibshirani commented May 30, 2018

@kingsharaman a couple thoughts on the discussion around subsetting:

  • I think we should create two separate PRs, one that adds support for subsetting, and another that addresses the fact that we compute OOB predictions multiple times. This helps keep these different concerns separate, and will make sure each change doesn't get hung up by the other. I see you've already opened Added subsetting to average_treatment_effect #249 to add subsetting support -- thank you for the contribution!
  • Offline, @swager and I discussed what to do about the fact we predict OOB multiple times, and are not entirely satisfied with the approaches considered above. I am going to give it a bit of thought and will circle back tomorrow with a suggestion. I'm glad that your discussion brought up this performance issue we previously overlooked.

@jtibshirani
Copy link
Member

Concerning the OOB predictions, here's what I propose: by default, we should calculate OOB automatically during training, and return them as part of the forest object (forest$oob_predictions = predict_oob(...)). We can introduce a flag skip_oob_prediction so that power users can avoid this extra cost if they know they don't need OOB predictions for their analysis. In methods like average_treatment_effect, we can verify that forest$oob_predictions is present, and fail with an informative error if they are not ('it looks like this forest was trained with skip_oob_prediction enabled...')

As an aside to @swager -- if we calculate OOB predictions during training by default, I wonder if we can now avoid hanging on to the training matrix (forest$X.orig)?

@adeldaoud
Copy link

@swager, thanks for the reply. So the new subsetting function will also be sensitive to the size (and clustering) of the subgroup and calibrate the standard errors accordingly?

My 3ed question (requoted parts of it below) above aims at understanding the difference between a SE produced for ITE (prediction for one individual conditional on all the features) and a SE produced for CATE (prediction for more than one individual conditional on all the features). With subsetting we seem to get what I would name group average treatment effect (GATE) that produces a prediction for more than one individual conditional on some the features (e.g. race or age).

  1. Related to all these questions, I am trying to understand the difference between ITE and CATE in the context of CausalForest, and how their respective SE should be understood.

@markhwhiteii
Copy link
Author

markhwhiteii commented Jun 4, 2018

@kingsharaman great function. I've sourced it in my sessions as average_treatment_effect_2 and have used it as such:

All of my variables are factors, so I turn them all to dummy variables first:

# create data for causal forest
grf_dat <- model.matrix(~ -1 + ., dat[, c(1, 3:7)])
grf_dat <- cbind(
  grf_dat, 
  response = as.numeric(dat[complete.cases(dat), 8]) - 1
)
grf_dat <- cbind(
  grf_dat,
  condition = as.numeric(dat[complete.cases(dat), 2]) - 1
)

Then I fit the causal forest:

set.seed(1839)
model <- causal_forest(
  grf_dat[, 1:15], 
  grf_dat[, 16], 
  grf_dat[, 17],
  num.trees = 4000
)

Then I use the most recent function from the pull request:

# generate predictions so we don't have to do it every time
predictions <- predict(model)$predictions

# get conditional average treatment effects
cates <- data.frame(var = rep(NA, 40), level = NA, cate = NA, lb = NA, ub = NA)
counter <- 1
cov_names <- names(dat[, c(1, 3:7)])
for (i in seq_along(cov_names)) {
  for (j in levels(dat[[cov_names[i]]])) {
    tmp <- average_treatment_effect_2(
      model, 
      subset = dat[complete.cases(dat), cov_names[i]] == j,
      predictions = predictions
    )
    cates$var[[counter]] <- cov_names[i]
    cates$level[[counter]] <- j
    cates$cate[[counter]] <- tmp[[1]]
    cates$lb[[counter]] <- tmp[[1]] - 1.96 * tmp[[2]]
    cates$ub[[counter]] <- tmp[[1]] + 1.96 * tmp[[2]]
    counter <- counter + 1
  }
}
cates <- cates[complete.cases(cates), ]

That calculates the average treatment effect for each level of each covariate variable.

@swager swager changed the title Using causal_forest to find heterogenous treatment effects Using causal_forest to estimate average treatment effects over subgroups Jun 24, 2018
@Zaw5009
Copy link

Zaw5009 commented Jun 25, 2018

Hi, I am looking to calculate ATEs (and their SEs) for predicted deciles based on predictions from a causal forest. A version of this is done in the Uplift package, but in that package 1) causal forests aren't used and 2) you have to bootstrap SEs. Here is an example of a paper that used this approach http://journals.ama.org/doi/10.1509/jmr.16.0163?code=amma-site

@markhwhiteii
Copy link
Author

markhwhiteii commented Jul 5, 2018

@Zaw5009 very interesting. I'd love it if someone could explain the difference between a causal forest and an uplift random forest—from reading the algorithms and pseudocode in their respective papers, they look to be very similar algorithms.

How do you figure the best way to compute bootstrap SEs? It seems like bootstrapping on an algorithm that already uses bagging would be prohibitively computationally expensive.

@Zaw5009
Copy link

Zaw5009 commented Jul 5, 2018

@markhwhiteii I think they're basically the same. What I want is to be able to reproduce something like this
image
with standard errors using GRF (instead of risk I would just plot the ATE as a horizontal line in my case. The groups are predicted deciles of the treatment effect.)

@markhwhiteii
Copy link
Author

@Zaw5009 if the groups are categorical, you can subset using this function: https://github.com/kingsharaman/grf/blob/master/r-package/grf/R/average_treatment_effect.R. But yes, I'm looking for the same type of functionality. Have you tried bootstrapping SEs for the upliftRF? I'm going to take a crack at it today.

@Zaw5009
Copy link

Zaw5009 commented Jul 5, 2018

I have not tried the bootstrapping approach, but this solves my problem. I can just plug in the values from the holdout group to obtain lift deciles and then calculate the ATEs and their accompanying SEs for each predicted lift decile. Thanks!

@swager
Copy link
Member

swager commented Jul 6, 2018

Thanks a lot @markhwhiteii! We'll make sure this functionality is included in the next release.

In terms of uplift forests: It looks like the main difference between uplift forests and causal forests is that uplift forests try to find regions in feature space with a large divergence between the treated and control outcome distributions, whereas causal forests directly target treatment heterogeneity. Thus, although the algorithms are similar, the statistical motivation seems quite different. The CCIF function in the uplift package looks more similar to causal forests.

Another difference is that causal forests, as implemented here, are locally robust in a way that increases accuracy in observational studies: See Section 6.2 of https://arxiv.org/pdf/1610.01271.pdf, and also https://arxiv.org/pdf/1712.04912.pdf for a broader discussion.

@jtibshirani jtibshirani added help wanted Community members are welcome to submit a pull request to address the issue. feature labels Jul 10, 2018
@philipy1219
Copy link

@swager As you noted before, a large sample size of subgroups are required when averaging Gamma over them. Is there any guideline to decide whether the size is "large" enough to get credible result?

@swager
Copy link
Member

swager commented Aug 7, 2018

@philipy1219 the subset-specific CIs should be valid even over small-ish subsets; however, the CIs may be very wide.

@tianshengwang
Copy link

Also, as to the original question of looking for variables that are associated with treatment heterogeneity, I often try something like this:

  • Divide the training set into quintiles by CATE estimate,
  • Compute the mean and variance of X[,j] for different j across all quintiles, and then
  • Look for j where these distributions are very different.

This can help you find variables that are associated with treatment heterogeneity.

@swager This method (divide the training set into quintiles ...) also works for binary outcome, right? As it seems the estimated CATE from causal_forest will be a continuous value between 0 and 1 for a binary outcome (0, 1).

@swager
Copy link
Member

swager commented Feb 1, 2020

Yes -- everything with causal forests works the same for both continuous and binary outcomes. If you have binary outcomes, then the CATE is estimated on the "difference in probabilities" scale.

@tianshengwang
Copy link

Yes -- everything with causal forests works the same for both continuous and binary outcomes. If you have binary outcomes, then the CATE is estimated on the "difference in probabilities" scale.

Thanks a lot!

@tianshengwang
Copy link

Yes -- everything with causal forests works the same for both continuous and binary outcomes. If you have binary outcomes, then the CATE is estimated on the "difference in probabilities" scale.

Does causal forest work for recurrent events? e.g., assessing comparative effectiveness of treatment A vs treatment B for asthma exacerbation outcome, majority of patients will be 0 events and some have 1 event only, and a smaller portion have multiple events during a fixed follow-up period (e.g., 1 year).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature help wanted Community members are welcome to submit a pull request to address the issue. question
Projects
None yet
Development

No branches or pull requests

9 participants