You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand Causal Forest's behavior in the tail of the distribution of the covariates. I run simulations of the type below and often find that CF estimates are constant towards the tails (see plot) which means that CFs are biased there. In the simulation below I compare this to the naïve approach using random forests for each of Y0 and Y1 to predict the outcomes and then obtain tau by the difference in predictions \hat{Y1}-\hat{Y0}.
That approach seemingly does not suffer from the bias problem in the tails (but has larger variance throughout). In larger samples (e.g. n=1e4) the problem seems to persist.
Is there anything I can do to get a better fit in the tails? Thanks.
library(grf)
library(ranger)
## Simulate non-linear ps-scores and y-models
e.x = function(x) 1/ (1 + exp(- ( 3 * x) ))
mu.0 = function(x) sin(-1/2- 4*x)
mu.1 = function(x) sin(1/2 + 4*x)
tau = function(x) (mu.1(x) - mu.0(x))
m.x = function(x) mu.0(x) + e.x(x) * tau(x)
## Sample training data
set.seed(2023)
n = 1000
sd.y = 0.6
Z = rnorm(n, mean=0, sd = 0.3)
y0 = mu.0(Z) + rnorm(n, sd = sd.y)
y1 = mu.1(Z) + rnorm(n, sd = sd.y)
e = e.x(Z)
W = rbinom(n, 1, e)
y = W*y1 + (1-W) * y0
df = data.frame(Z = Z, y, y0, y1, e, W = factor(W), W.num = W)
## Estimate causal forest with tuning on
cf = causal_forest(X = cbind(df$Z), Y = df$y, W = df$W.num, num.trees = 2e3, tune.parameters = 'all')
## Estimate heterogeineity using conditional means random forests
rf0 = ranger(y ~ Z, data = df[df$W.num==0,], num.trees = 5e3 )
rf1 = ranger(y ~ Z, data = df[df$W.num==1,], num.trees = 5e3 )
## Create test data
Z.test = seq(-1,1,0.01)
df.test = data.frame(Z = Z.test)
tau.test = tau(Z.test)
tau.test.cf = predict(cf, newdata = df.test)[,1]
tau.test.cdmrf = predict(rf1, data= df.test)$predictions - predict(rf0, data= df.test)$predictions
## Compare fits
par(mfrow=c(1,2))
plot(Z.test,tau.test, ylim=c(-3,3),ty='l', main='Predictions vs true effect')
lines(Z.test,tau.test.cf,col=2)
lines(Z.test,tau.test.cdmrf,col=4)
plot(Z.test,tau.test.cf-tau.test, ylim=c(-3,3),col=2, ty='l', main= 'Error')
lines(Z.test,tau.test.cdmrf-tau.test,col=4)
abline(h=0)
legend('bottomright', legend = c('Causal Forest', 'Standard Forests'),lty=c(1,1),col=c(2,4))
The text was updated successfully, but these errors were encountered:
thomasklausch2
changed the title
Performance of Causal Forests in the tails of the covariate space
Performance of Causal Forests in the tails of the covariate distribution
Dec 5, 2022
Hi @thomasklausch2, you could try with local linear forest (i.e. replace tau.test.cf with tau.test.cf = predict(cf, newdata = df.test, linear.correction.variables = 1)[,1])
Thanks! That indeed solves the issue to some extent. Is this the same thing that has been called 'local centering' in the literature? (e.g. here on p.142)
EDIT: oh no, maybe it's not the same because your addition is only done in the predictor and not in the estimator? I understand from the reference local centering means replacing Y_i by Y_i - mu(X_i) and D_i by D_i - e(X_i) before estimation. Is that a different feature of causal_forest?
What you are highlighting here is "boundary bias" and is actually quite normal, the local linear prediction does a correction that may help. (Note that the question of what constitute a boundary when dimensions get high is tricky and correction methods like these will have a hard time)
I am trying to understand Causal Forest's behavior in the tail of the distribution of the covariates. I run simulations of the type below and often find that CF estimates are constant towards the tails (see plot) which means that CFs are biased there. In the simulation below I compare this to the naïve approach using random forests for each of Y0 and Y1 to predict the outcomes and then obtain tau by the difference in predictions \hat{Y1}-\hat{Y0}.
That approach seemingly does not suffer from the bias problem in the tails (but has larger variance throughout). In larger samples (e.g.
n=1e4
) the problem seems to persist.Is there anything I can do to get a better fit in the tails? Thanks.
The text was updated successfully, but these errors were encountered: