Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRF refuse to report average treatment effect when the dataset is imbalanced #406

Closed
ginward opened this issue Apr 23, 2019 · 18 comments
Closed
Labels

Comments

@ginward
Copy link
Contributor

ginward commented Apr 23, 2019

Description of the bug
Download the dataset at this link. This is a dataset by LaLonde (1986), Dehejia and Wahba (1999) on the NSW training program. I am not sure if there is copyright - if so it belongs to them.

The data set is very imbalanced - there are around 10000 untreated example and less than 300 treated example.

When I run the code below in the development version, the average_treatment_effect(cps.forest, target.sample = "all") returns NaN. However, the code returns numbers in the released version 0.10.2. It only returns NaN in the development branch.

Is this supposed to be a feature (i.e. returns NaN when data is very imbalanced), or is it actually a bug?

The average treatment effect for the treated and the overlap weighted treatment effects are all normal. Although the numbers differ slightly from the released branch as well.

Steps to reproduce

#compile the grf development code and include it into the library
library(dplyr)
select <- dplyr::select
TREES_CPS=1000
SEED=6000
set.seed(SEED)
dat=read.csv("bug.csv")
forest=causal_forest(seed=SEED,num.trees = TREES_CPS, X=as.matrix(select(dat, re74, re75, age, education, black, hispanic, married, nodegree)), Y=dat$re78, W=dat$treat)
average_treatment_effect(forest, target.sample = "all")

Output in Development Version

> average_treatment_effect(forest, target.sample = "all")
estimate  std.err 
     NaN      NaN 

Output in Release Version

> average_treatment_effect(forest, target.sample = "all")
 estimate   std.err 
-2058.140  1401.162 

GRF version
development

@swager swager added the bug label Apr 25, 2019
@swager
Copy link
Member

swager commented Apr 26, 2019

I tried running this on both the old and new versions of the package. First, on this dataset, the propensity scores get very close to 0 (they get down to 0.0005). So even though the old version gives a point estimate for the ATE, it also warns Estimated treatment propensities go as low as 0 which means that treatment effects for some controls may not be well identified...

The difference with the release version is that the forest that estimates the propensity scores appears to be splitting more aggressively, and now returns some propensity estimates that are exactly (rather than very nearly) zero. So that's why the ATE estimate comes out as NaN (because we're dividing by a zero propensity).

I'm removing the "bug" label, as what the development version is doing still appears to be reasonable (it's just estimating a small propensity score by exactly zero instead of a very small positive number). However, it may be worth investigating further why the development version makes more splits when estimating the propensity score here.

p.s. In cases like these, with very small propensity scores, it's often a good idea to focus on the average treatment effect on the treated instead of the overall average treatment effect.

@swager swager removed the bug label Apr 26, 2019
@ginward
Copy link
Contributor Author

ginward commented Apr 26, 2019

@swager Thanks for you reply! I guess what you mean by release version is actually the development version? Because it seems that only the old version is released (0.10.2) and the one on GitHub is still not released yet. (Correct me if I am wrong).

I agree with you, when the data has propensities that are close to 0, it is better to use just treated treatment effects. What was puzzling me was the NaN, but I think what you said is right - it is because of the dividing by a value that is almost 0.

It is indeed interesting to find out why is it the case that the new version splits more vigorously ...

@swager
Copy link
Member

swager commented Apr 26, 2019

Thanks, I fixed the typo in my previous comment. I discussed with @jtibshirani, and it appears that the more aggressive splitting is due to a subtle interaction between early stopping rules for tree splitting and honesty. It's not clear to me whether such more aggressive splitting is better or worse in practice (my hunch is that slightly more aggressive splitting may in fact be good, in that it gives us more power to detect failures in overlap), so I'm closing this ticket for now. However, if anyone comes across adverse effects of the more aggressive splitting in the development version, please let me know.

@swager swager closed this as completed Apr 26, 2019
@ginward
Copy link
Contributor Author

ginward commented Apr 28, 2019

I tried running this on both the old and new versions of the package. First, on this dataset, the propensity scores get very close to 0 (they get down to 0.0005). So even though the old version gives a point estimate for the ATE, it also warns Estimated treatment propensities go as low as 0 which means that treatment effects for some controls may not be well identified...

The difference with the release version is that the forest that estimates the propensity scores appears to be splitting more aggressively, and now returns some propensity estimates that are exactly (rather than very nearly) zero. So that's why the ATE estimate comes out as NaN (because we're dividing by a zero propensity).

I'm removing the "bug" label, as what the development version is doing still appears to be reasonable (it's just estimating a small propensity score by exactly zero instead of a very small positive number). However, it may be worth investigating further why the development version makes more splits when estimating the propensity score here.

p.s. In cases like these, with very small propensity scores, it's often a good idea to focus on the average treatment effect on the treated instead of the overall average treatment effect.

@swager Just curious - because the calculation of ATE is actually from the following formula:

target.sample = "all": the ATE on the whole population, sum_{i = 1}^n E[Y(1) - Y(0) | X = X_i] / n.
How is that "dividing by 0 propensity"? My understanding is that, because now there are more propensities that are 0, the 0s will go into the denominator. But it seems that, from the formula, the propensities doesn't go into the denominator at all.

@ginward
Copy link
Contributor Author

ginward commented Apr 28, 2019

Unless it is dividing by 0 in a step prior to calculating the average treatment effect ...

@davidahirshberg
Copy link
Member

This formula defines the average treatment effect, but it cannot be used to calculate it, as it involves potential outcomes that we do not observe. For each i, we observe either Y_i(0) [if W_i=0] or Y_i(1) [if W_i=1] but not both. Our estimator does divide by the propensity score --- take a look at lines 183-210 of average_treatment_effect.R

@ginward
Copy link
Contributor Author

ginward commented Apr 28, 2019

@davidahirshberg Thanks! I will take a look.

@ginward
Copy link
Contributor Author

ginward commented May 9, 2019

This formula defines the average treatment effect, but it cannot be used to calculate it, as it involves potential outcomes that we do not observe. For each i, we observe either Y_i(0) [if W_i=0] or Y_i(1) [if W_i=1] but not both. Our estimator does divide by the propensity score --- take a look at lines 183-210 of average_treatment_effect.R

@davidahirshberg if the estimator does divide by the propensity score, does it mean that the ATEs are by default doubly robust?

@swager swager reopened this May 9, 2019
@ginward
Copy link
Contributor Author

ginward commented May 9, 2019

@swager are you reopening this issue? 😂

@swager
Copy link
Member

swager commented May 9, 2019

Hi @ginward, I reopened the issue while there's still discussion on it. We only actively monitor open tickets.

For the question -- the function average_treatment_effect implements a doubly robust ATE estimator. When estimating ATEs with machine learning methods, you essentially always want to use a doubly robust (or related) method to avoid bias due to regularization (see, e.g., the double machine learning paper from Chernozhukov et al.) For a discussion of what we implement exactly, see https://arxiv.org/pdf/1902.07409.pdf.

@ginward
Copy link
Contributor Author

ginward commented May 9, 2019

Thanks @swager

I have indeed read the application paper you wrote, and at the time I thought the doubly robust is only implemented for the cluster robust estimators.

But anyways - it looks like, when the fitted propensity is 0, the doubly robust method will not work (+inf basically). Is this what doubly robust meant to behave? What if the person, indeed, has lower propensity of treatment that is close to 0?

@swager
Copy link
Member

swager commented May 9, 2019

Yes that's right. If there are some people whose treatment propensity is (essentially) zero, then the ATE is (essentially) not identified (because the treatment effect function is not identified in the region with poor overlap). So that's why any non-parametric estimate of the ATE will blow up in this case.

The only way around this is to change estimands, so that the target of inference is still identified even when some propensities are 0. One simple way to do this is to use the average treatment effect on the treated. Because the ATT only cares about the CATE function in regions where some people are treated, it doesn't matter that the CATE function is not identified in the region with 0 propensity of treatment (because that region just gets ignored by the ATT anyways).

@ginward
Copy link
Contributor Author

ginward commented May 9, 2019

@swager I see. This cleared things up a lot. Much appreciated!

@jtibshirani
Copy link
Member

As a note, I walked back part of the change that caused forests to split more aggressively (#415). The behavior of the development version should now match that of 0.10.2 more closely.

I'm going to close this out, since it seems that this issue has been resolved. Thanks again @ginward for reporting the issue, it helped us catch an unintentional change in behavior (before we released it)!

@ginward
Copy link
Contributor Author

ginward commented May 11, 2019

@jtibshirani Sure thing!

@minhengw
Copy link

Dear all,

I have read all the posts above, and learned a lot. Very appreciated.

My question is similar, the output of average_treatment_effect is [1] NaN. However, I am using IV forest insteas of conventional causal forest, and the estimand of my case is the Average (Conditional) Local Average Treatment (ACLATE):

average_treatment_effect(ivforest, target.sample = "all")

The warning message also indicated poor overlap, but I can't change to estimate ATE on the treated due to IV forest. The error message is as below:

Error in average_treatment_effect(ivforest, target.sample = "treated") :
For any forest type other than causal_forest, the only implemented option is method=AIPW and target.sample=all

I am eager to know how to handle this condition. Because IV forest is very essential in my research paper. Thank you very much.!

Warm regards,
David

@swager
Copy link
Member

swager commented May 15, 2022

The first thing I'd do in a situation like this is to make a histogram of the estimated compliance scores, i.e., estimates of Delta(x) = P[W = 1 | Z = 1, X = x] - P[W = 0 | Z = 0, X = x]. Are there any regions of X-space where these are reliably away from 0 or not? If yes, then focusing on those regions can help; but if no, then you have a weak IV problem and using AIPW for the ACLATE is simply not going to work.

Algorithmically, one thing you can do to focus on well-identified regions is to use the `subset' argument to focus on regions of X-space where treatment effects are well identified. For example, in the case of ATE estimation under unconfoundedness, the following paper considers subsetting to regions of X-space where propensity scores don't get too close to 0 or 1:

Crump, R. K., Hotz, V. J., Imbens, G. W., & Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), 187-199.

The important thing to remember in doing so is that your subsetting should only be a function of X, not of Z / W / Y.

@minhengw
Copy link

Dear Dr. Swager,

Thanks! But I am NOT sure that in the case of IV forest, I should use the specific range of Z.hat or of W.hat as the subsample? And set it as subset in the average_treatment_effect( ).

Many thanks.
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants