New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address poor performance of honest forests on small datasets. #273
Comments
Just flagging that I'm currently working with @rugilmartin to allow users to select the fraction of data that should be used for splitting. |
What has been done so far to allow users to select the fraction of data that should be used for honest splitting:
|
I am a bit worried about the lack of backward compatibility, so I hope that the error message is clear. Also the help files that come up in R should clarify that there was a change. |
For testing that honest.fraction is working, we can count the number of observations used in the training and estimation sets in the forest objects that come out? We could also look at the depth of the trees, where the training sample size together with the minimum leaf size constraints lead to maximum depth (but not minimum depth--that depends on splitting rule). |
@susanathey On counting the number of observations used in each set: good idea. Currently, the trees do not store this information, but I'm working on having them save it during training. |
@rugilmartin @halflearned thank you for tackling this and for the detailed write-up! I think that we should have two parameters,
In your write-up above, I assume that part of the motivation is to avoid having a parameter that only makes sense if a separate flag is enabled? I'm not too concerned about this, as there's precedent for such a set-up: Lastly, I really like @susanathey's suggestion to test this directly by checking the number of samples against the provided |
@jtibshirani and @susanathey thanks again for the feedback. @jtibshirani I'm convinced by your counterarguments! We'll leave both flags in then. |
On forests that are:
erikcs@b59b694 allows pruning to be shut off ( this simple example trains a causal forest with: p = 5
X = matrix(runif(n * p), nrow = n, ncol = p)
W.hat = rep(1/2, n)
Y.hat = rep(0, n)
W = rbinom(n, 1, 0.5)
sigma = function(x) 1 / (1 + exp(-4 * (x - 0.5)))
tau = sigma(X[, 1]) * sigma(X[, 2])
Y = (W - 1/2) * tau + rnorm(n)
cf = causal_forest(X, Y, W, Y.hat = Y.hat, W.hat = W.hat,
honesty = honesty,
honesty.fraction = honesty.fraction,
prune = prune)
tau.hat = predict(cf)$predictions
mse = mean((tau - tau.hat)^2) But there is nothing striking that sets pruning vs no pruning apart (moreover, the adaptive forest is doing worse... this reverses with a different dgp)
|
Yeah the honest forests are not so bad, especially in a causal setting! Adaptive forests only get an upside when there's a very strong signal. In the simulation below, though, we do get that adaptive causal forests beat honest ones. Setting honesty fraction to be 0.9 helps, but not as much as one would hope. Does not pruning make a difference, and bring us closer to the adaptive forests? (Note that, as it's currently implemented, honesty.fraction counts the fraction of the data used for splitting -- so the interesting case is where we make it big, not small.) library(grf)
n = 100; p = 4
sigma = function(x) 1 / (1 + exp(-3 * (x - 0.5)))
reps = replicate(100, {
X = matrix(runif(n * p), nrow = n, ncol = p)
W.hat = rep(1/2, n); Y.hat = rep(0, n)
W = rbinom(n, 1, 0.5)
tau = 6 * sigma(X[, 1]) * sigma(X[, 2])
Y = (W - 1/2) * tau + rnorm(n)
cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, honesty.fraction = 0.9)
cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
c(mean((predict(cf1)$predictions - tau)^2),
mean((predict(cf2)$predictions - tau)^2),
mean((predict(cf3)$predictions - tau)^2))
})
> rowMeans(reps)
[1] 0.5884138 0.4715652 0.3002721 |
Here it is with with no pruning: cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, prune = FALSE)
cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, honesty.fraction = 0.9, prune = FALSE)
cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
> rowMeans(reps)
[1] 0.5821823 0.4395167 0.3153103
# With 10x more trees (20 000) in cf1 and cf2
> rowMeans(reps)
[1] 0.5607152 0.4139892 0.2972485 |
Interesting. Ok seems like it clearly helps, at least a little. Want to open a PR so we can merge this in? Then we can talk about what we want to use for defaults. (In the PR, can you also add tests confirming it helps in the basic regression case?) |
Ok, I will open a PR, but it will take a short while as the above branch is completely mangled from a merge with my open .Rproj PR (which made switching between branches that changes C++ code faster) |
Currently a tuned forest #484 does well (honesty is always on by default):
For very small n there is little to do as using a tree in the first place is questionable. |
Here are some results replicating the simulation I had posted earlier: Once with a small sample size and strong effects, and once with a very small sample size and very strong effects (relative to what's typical in a heterogeneous treatments problem). Overall, it seems like we're currently doing fine for the reasonable settings: sigma = function(x) 1 / (1 + exp(-3 * (x - 0.5)))
n = 400; p = 4
reps = replicate(25, {
X = matrix(runif(n * p), nrow = n, ncol = p)
W.hat = rep(1/2, n); Y.hat = rep(0, n)
W = rbinom(n, 1, 0.5)
tau = sigma(X[, 1]) * sigma(X[, 2])
Y = (W - 1/2) * tau + rnorm(n)
cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat,
honesty = TRUE, tune.parameters = TRUE)
c(mean((predict(cf1)$predictions - tau)^2),
mean((predict(cf2)$predictions - tau)^2),
mean((predict(cf3)$predictions - tau)^2))
})
> rowMeans(reps)
0.03083259 0.13115172 0.03217602
n = 100; p = 4
reps2 = replicate(25, {
X = matrix(runif(n * p), nrow = n, ncol = p)
W.hat = rep(1/2, n); Y.hat = rep(0, n)
W = rbinom(n, 1, 0.5)
tau = 6 * sigma(X[, 1]) * sigma(X[, 2])
Y = (W - 1/2) * tau + rnorm(n)
cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat,
honesty = TRUE, tune.parameters = TRUE)
c(mean((predict(cf1)$predictions - tau)^2),
mean((predict(cf2)$predictions - tau)^2),
mean((predict(cf3)$predictions - tau)^2))
})
> rowMeans(reps2)
0.5346930 0.2705676 0.3404423 |
* Explain the new parameters `honesty.fraction` and `prune.empty.leaves`. * Update the suggested mitigation strategy when training honest forests on a small sample. * Clarify the parameter tuning behavior related to honesty. Relates to #273.
Regarding this benchmark on data form the UCI database: Here is a benchmark of the bike data (n=2500 subsample) using "tuned" parameters for randomForest and ranger from @mattschaelling's (thanks) caret script:
grf does not come out on top |
We can get honest grf closer to the dishonest grf above by setting ci.group.size to = 1 since this is only about prediction and manually set sample.fraction = 0.7 as this is not tuned but decreases mse. also set mtry to 12 since it is not much point in making the tuning grid bigger with only 12 variables for mtry grf <- regression_forest(X = datagrf[, -Yi], Y = datagrf[, Yi],
ci.group.size = 1,
mtry = 12,
sample.fraction = 0.7,
tune.parameters = TRUE)
print(rmse.grf.tuned.noCIs <- sqrt(mean((datagrf[, Yi] - predict(grf)$predictions)^2)))
#[1] 23.77796 prune.empty.leaves is correctly shut off by tuning but no amount of tweaking brings the mse on par with the dishonest forest below (but increasing the sub sample from n=2500 brings them closer) Summary: a) sample.fraction should be tuned when ci.group.size=1? . b) tuning mtry when ncols(X) is tiny (like 12) seems meaningless |
The dishonest forest can be tweaked further to give rmse on par with the best randomForest and ranger grf <- regression_forest(X = datagrf[, -Yi], Y = datagrf[, Yi], honesty = FALSE, tune.parameters = TRUE,
sample.fraction = 0.75, ci.group.size = 1, mtry = 12)
print(rmse.grf.dishonest.tuned.noCIs <- sqrt(mean((datagrf[, Yi] - predict(grf)$predictions)^2)))
#[1] 21.63562 |
…abs#496) * Explain the new parameters `honesty.fraction` and `prune.empty.leaves`. * Update the suggested mitigation strategy when training honest forests on a small sample. * Clarify the parameter tuning behavior related to honesty. Relates to grf-labs#273.
When honesty is enabled, the training subsample is further split in half before performing splitting. With small datasets, this may not leave enough information for the algorithm to determine high-quality splits.
This issue is still pending a concrete proposal on how it should be addressed.
The text was updated successfully, but these errors were encountered: