Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Causal Forests in the tails of the covariate distribution #1246

Closed
thomasklausch2 opened this issue Dec 5, 2022 · 3 comments
Labels

Comments

@thomasklausch2
Copy link

thomasklausch2 commented Dec 5, 2022

I am trying to understand Causal Forest's behavior in the tail of the distribution of the covariates. I run simulations of the type below and often find that CF estimates are constant towards the tails (see plot) which means that CFs are biased there. In the simulation below I compare this to the naïve approach using random forests for each of Y0 and Y1 to predict the outcomes and then obtain tau by the difference in predictions \hat{Y1}-\hat{Y0}.
That approach seemingly does not suffer from the bias problem in the tails (but has larger variance throughout). In larger samples (e.g. n=1e4) the problem seems to persist.
Is there anything I can do to get a better fit in the tails? Thanks.

library(grf)
library(ranger)

## Simulate non-linear ps-scores and y-models
e.x = function(x) 1/ (1 + exp(- ( 3 * x) ))
mu.0 = function(x) sin(-1/2- 4*x)
mu.1 = function(x) sin(1/2 + 4*x)
tau  = function(x) (mu.1(x) - mu.0(x))
m.x = function(x) mu.0(x) + e.x(x) * tau(x)

## Sample training data
set.seed(2023)
n = 1000
sd.y = 0.6
Z = rnorm(n, mean=0, sd = 0.3)
y0 = mu.0(Z) + rnorm(n, sd = sd.y)
y1 = mu.1(Z) + rnorm(n, sd = sd.y)
e  = e.x(Z)
W  = rbinom(n, 1, e)
y  = W*y1 + (1-W) * y0
df = data.frame(Z = Z, y, y0, y1, e, W = factor(W), W.num = W)

## Estimate causal forest with tuning on
cf = causal_forest(X = cbind(df$Z), Y = df$y, W = df$W.num, num.trees = 2e3, tune.parameters = 'all')

## Estimate heterogeineity using conditional means random forests
rf0 = ranger(y ~ Z, data = df[df$W.num==0,], num.trees = 5e3 )
rf1 = ranger(y ~ Z, data = df[df$W.num==1,], num.trees = 5e3 )

## Create test data
Z.test  = seq(-1,1,0.01)
df.test = data.frame(Z = Z.test)
tau.test = tau(Z.test)
tau.test.cf = predict(cf, newdata = df.test)[,1]
tau.test.cdmrf = predict(rf1, data= df.test)$predictions - predict(rf0, data= df.test)$predictions

## Compare fits
par(mfrow=c(1,2))
plot(Z.test,tau.test, ylim=c(-3,3),ty='l', main='Predictions vs true effect')
lines(Z.test,tau.test.cf,col=2)
lines(Z.test,tau.test.cdmrf,col=4)
plot(Z.test,tau.test.cf-tau.test, ylim=c(-3,3),col=2, ty='l', main= 'Error')
lines(Z.test,tau.test.cdmrf-tau.test,col=4)
abline(h=0)
legend('bottomright', legend = c('Causal Forest', 'Standard Forests'),lty=c(1,1),col=c(2,4))

@thomasklausch2 thomasklausch2 changed the title Performance of Causal Forests in the tails of the covariate space Performance of Causal Forests in the tails of the covariate distribution Dec 5, 2022
@erikcs
Copy link
Member

erikcs commented Dec 6, 2022

Hi @thomasklausch2, you could try with local linear forest (i.e. replace tau.test.cf with tau.test.cf = predict(cf, newdata = df.test, linear.correction.variables = 1)[,1])

@thomasklausch2
Copy link
Author

thomasklausch2 commented Dec 6, 2022

Thanks! That indeed solves the issue to some extent. Is this the same thing that has been called 'local centering' in the literature? (e.g. here on p.142)

EDIT: oh no, maybe it's not the same because your addition is only done in the predictor and not in the estimator? I understand from the reference local centering means replacing Y_i by Y_i - mu(X_i) and D_i by D_i - e(X_i) before estimation. Is that a different feature of causal_forest?

@erikcs
Copy link
Member

erikcs commented Dec 6, 2022

What you are highlighting here is "boundary bias" and is actually quite normal, the local linear prediction does a correction that may help. (Note that the question of what constitute a boundary when dimensions get high is tricky and correction methods like these will have a hard time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants