Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address poor performance of honest forests on small datasets. #273

Closed
jtibshirani opened this issue Aug 4, 2018 · 18 comments
Closed

Address poor performance of honest forests on small datasets. #273

jtibshirani opened this issue Aug 4, 2018 · 18 comments
Assignees
Labels

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Aug 4, 2018

When honesty is enabled, the training subsample is further split in half before performing splitting. With small datasets, this may not leave enough information for the algorithm to determine high-quality splits.

This issue is still pending a concrete proposal on how it should be addressed.

@halflearned
Copy link
Member

Just flagging that I'm currently working with @rugilmartin to allow users to select the fraction of data that should be used for splitting.

@rugilmartin
Copy link
Contributor

What has been done so far to allow users to select the fraction of data that should be used for honest splitting:

  • The honesty argument in every type of forest (causal, custom, quantile, instrumental, and regression) has been changed to honesty.fraction.
    • honesty: was a boolean argument, which defaulted to TRUE and determined whether or not honest splitting would be used.
    • honesty.fraction: is a double argument, which defaults to 0.5 and determines the fraction of the sample that is used for the training and cross-validation step in honest splitting. Passing a value less than or equal to 0 or greater than or equal to 1 turns off honest splitting and results in a forest of non-honest trees.
    • This change is made in all *_forest.R files, all *_tuning.R files, and all *ForestBindings.cpp files.
    • This change is also represented in the auto-generated RcppExports.R file.
    • This change is also represented in ForestOptions.cpp, ForestOptions.h, TreeOptions.cpp, TreeOptions.h, and TreeTrainer.cpp.
  • The new design passes all 145 existing testthat tests.
  • Advice on additional testing?
    • We don’t have any tests that deal with values other than 0.0 or 0.5 for honesty.fraction.
    • Only one existing test deals with honesty.fraction directly. This test used to set honesty = FALSE. It now sets honesty.fraction = 0.0. All other tests used the default value for honesty, so they now use the default value for honesty.fraction.
    • I’m not sure how to test that alternative values of honesty.fraction are performing as expected/desired, so any advice would be much appreciated!
    • One thought I had: if a test sets a value of honesty.fraction that is close to 0.0 or 1.0, we would expect larger confidence intervals, since in one case there isn’t much data available for training and cross-validation and in the other there isn’t much data available for prediction. Is my understanding here correct? And if so, does this sound like a good test?
  • The default value for honesty.fraction is the same as the previous fixed value for honest splitting. The behavior of the package is unchanged in cases where the default is used: honest splitting on, with half of the sample used for training and cross validation.
  • R uses partial string matching for function arguments, which can cause problems if a user attempts to call a forest using the old honesty argument. That call might look like: regression_forest(..., honesty = FALSE).
    • R would interpret this as: regression_forest(..., honesty.fraction = FALSE), since there is no longer an argument called “honesty.”
    • Once honesty.fraction is passed from R to C++ by rcpp, the value of honesty.fraction is coerced to a double and passed as an argument called honesty_fraction. In the case above, we get honesty_fraction = 0.0.
    • This does not pose a problem when a user attempts to use honesty = FALSE. In this case, we get honesty_fraction = 0.0 in C++, which turns off honest splitting and results in a forest of non-honest trees, as desired. This is because values of honesty.fraction less than or equal to 0 or greater than or equal to 1 turn off honest splitting.
    • However, this does pose a problem when a user attempts to set honesty = TRUE. In this case, we get honesty_fraction = 1.0 in C++, which turns off honest splitting and results in a forest of non-honest trees. However, since the user passed honesty = TRUE, they probably intended to use the old default value of 0.5 for honest splitting. So really, they meant honesty_fraction = 0.5.
  • @halflearned and I discussed the above issue with partial matching and came up with four possible solutions:
    1. Put honesty back in as a Boolean argument like before, in addition to honesty.fraction. This would give us two arguments related to honesty and we would have to resolve conflicts between the two somehow. This didn’t seem like a good option.
    2. Have honesty.fraction = TRUE (in R) be interpreted as honesty_fraction = 0.5 (in C++) and keep honesty.fraction = FALSE (in R) interpreted as honesty_fraction = 0.0 (in C++). This would allow users to continue to use their old grf code (with honesty = TRUE and honesty = FALSE behaving the same as they used to). However, this could be confusing for new users. It isn’t intuitive to have regression_forest(..., honesty = FALSE) mean the same thing as regression_forest(..., honesty.fraction = 0.0). And it isn’t intuitive to have regression_forest(..., honesty = TRUE) just be another way of using the default value of 0.5 for honesty.fraction.
    3. Same as option ii, but also raise a warning when a Boolean value is passed for honesty.fraction. This might be sufficient to prompt users to go back and fix their legacy grf code, but the warning might be overlooked by users who have large programs with many warnings that they are used to overlooking. This also doesn’t address the confusion that option ii might generate for new users.
    4. Don’t change anything about the interpretation of the arguments, and raise an error when a Boolean value is passed for honesty.fraction. This requires a bit more work for users with legacy grf code because it forces them replace any reference to honesty with a reference to honesty.fraction and the double value they want to use (0.0 for honesty = FALSE, 0.5 for honesty = TRUE). However, this has the advantage of guaranteeing that users are made aware of the change the first time they attempt to use honesty improperly. It also makes the use of honesty.fraction easy to understand (Boolean values don’t need to be handled in weird, unintuitive ways because they simply aren’t accepted).
  • We decided to go with option iv, from above. As it stands, an error message is thrown if a user passes a Boolean value for honesty.fraction (either by writing honesty = TRUE/FALSE or by writing honesty.fraction = TRUE/FALSE). We think this is the right way to go for a few reasons:
    • It guarantees users are made aware of the change the first time they attempt to use honesty improperly.
    • It makes the use of honesty.fraction easy to understand. Boolean values don’t need to be handled in weird, unintuitive ways because they simply aren’t accepted.
    • The changes users will need to make to their existing grf code are minimal. In fact, since honesty was a Boolean, we can only imagine two possible changes a user would want to make: honesty = TRUE becomes honesty.fraction = 0.5 and honesty = FALSE becomes honesty.fraction = 0.0. These changes are obvious once a user reads the updated documentation.
    • Since grf is moving towards a 1.0 release, it isn’t critical to maintain backwards compatibility with versions <1.0.
    • This option standardizes the use of honesty.fraction going forward, which should make future updates easier. Continuing to accommodate Boolean values for honesty would quickly get confusing as the package grows and changes.

@susanathey
Copy link
Collaborator

I am a bit worried about the lack of backward compatibility, so I hope that the error message is clear. Also the help files that come up in R should clarify that there was a change.

@susanathey
Copy link
Collaborator

For testing that honest.fraction is working, we can count the number of observations used in the training and estimation sets in the forest objects that come out? We could also look at the depth of the trees, where the training sample size together with the minimum leaf size constraints lead to maximum depth (but not minimum depth--that depends on splitting rule).

@halflearned
Copy link
Member

@susanathey On counting the number of observations used in each set: good idea. Currently, the trees do not store this information, but I'm working on having them save it during training.

@jtibshirani
Copy link
Member Author

jtibshirani commented Sep 3, 2018

@rugilmartin @halflearned thank you for tackling this and for the detailed write-up!

I think that we should have two parameters, honesty and honesty.fraction, where honesty.fraction is only used if honesty = TRUE. There are a couple reasons for this choice:

  • We will want to include honesty.fraction in the set of parameters that we automatically tune. The current way that the user indicates a parameter should be tuned is by setting it to NULL. If a non-null value is provided, it indicates that the parameter value should be used as is, and not included in tuning. If honesty is enabled by setting honesty.fraction to a value in (0, 1), then there would be no way to indicate it should be tuned.
  • In general, it's best not to rely on a double value to indicate a distinct change in behavior. In most languages, double equality can only be checked up to some small variable error -- so it could be the case that p == 0.0, even if p is a non-zero, but very small value. It's also not as clean conceptually that depending on the value, the parameter switches from representing a fraction to representing a boolean flag.

In your write-up above, I assume that part of the motivation is to avoid having a parameter that only makes sense if a separate flag is enabled? I'm not too concerned about this, as there's precedent for such a set-up: clusters must be non-empty for samples_per_cluster, tune.parameters must be specified for num.fit.trees, etc.

Lastly, I really like @susanathey's suggestion to test this directly by checking the number of samples against the provided honesty.fraction. It should already be possible to count the number of examples used for splitting vs. prediction -- details can be found in #258. It might also be nice to have a test (either in this repo or in the grf simulations repo we're building out) that confirms changing the honesty fraction can improve forest performance on a small dataset.

@halflearned
Copy link
Member

@jtibshirani and @susanathey thanks again for the feedback.

@jtibshirani I'm convinced by your counterarguments! We'll leave both flags in then.

@erikcs
Copy link
Member

erikcs commented Jul 17, 2019

On forests that are:

  1. adaptive (honesty=FALSE)
  2. pruned honest (honesty=TRUE)
  3. non-pruned honest (new: prune = FALSE)

erikcs@b59b694 allows pruning to be shut off (TreeTrainer::train(): skip pruning when doing repopulate_leaf_nodes)

this simple example trains a causal forest with:

  p = 5
  X = matrix(runif(n * p), nrow = n, ncol = p)
  W.hat = rep(1/2, n)
  Y.hat = rep(0, n)
  W = rbinom(n, 1, 0.5)
  sigma = function(x) 1 / (1 + exp(-4 * (x - 0.5)))
  tau = sigma(X[, 1]) * sigma(X[, 2])
  Y = (W - 1/2) * tau + rnorm(n)

  cf = causal_forest(X, Y, W, Y.hat = Y.hat, W.hat = W.hat,
                     honesty = honesty,
                     honesty.fraction = honesty.fraction,
                     prune = prune)
  tau.hat = predict(cf)$predictions
  mse = mean((tau - tau.hat)^2)

But there is nothing striking that sets pruning vs no pruning apart (moreover, the adaptive forest is doing worse... this reverses with a different dgp)

================================================================
   mse (n=100)  mse (n=1500)  honesty  honesty.fraction  prune
----------------------------------------------------------------
    0.17410      0.09738      FALSE     NULL              FALSE
   (0.00266)    (0.00050)

    0.07254      0.02023      TRUE      0.50000           TRUE
   (0.00176)    (0.00022)
    0.07501      0.01917      TRUE      0.25000           TRUE
   (0.00174)    (0.00020)
    0.07494      0.02178      TRUE      0.10000           TRUE
   (0.00174)    (0.00018)

    0.07281      0.02023      TRUE      0.50000           FALSE
   (0.00178)    (0.00022)
    0.07494      0.01916      TRUE      0.25000           FALSE
   (0.00174)    (0.00020)
    0.07494      0.02181      TRUE      0.10000           FALSE
   (0.00173)    (0.00018)
================================================================
(mc standard error in () from 1 000 repetitions)

@swager
Copy link
Member

swager commented Jul 17, 2019

Yeah the honest forests are not so bad, especially in a causal setting! Adaptive forests only get an upside when there's a very strong signal. In the simulation below, though, we do get that adaptive causal forests beat honest ones. Setting honesty fraction to be 0.9 helps, but not as much as one would hope. Does not pruning make a difference, and bring us closer to the adaptive forests? (Note that, as it's currently implemented, honesty.fraction counts the fraction of the data used for splitting -- so the interesting case is where we make it big, not small.)

library(grf)
n = 100; p = 4
sigma = function(x) 1 / (1 + exp(-3 * (x - 0.5)))

reps = replicate(100, {
    X = matrix(runif(n * p), nrow = n, ncol = p)
    W.hat = rep(1/2, n); Y.hat = rep(0, n)
    W = rbinom(n, 1, 0.5)
    tau = 6 * sigma(X[, 1]) * sigma(X[, 2])
    Y = (W - 1/2) * tau + rnorm(n)
    
    cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
    cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, honesty.fraction = 0.9)
    cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
    
    c(mean((predict(cf1)$predictions - tau)^2),
      mean((predict(cf2)$predictions - tau)^2),
      mean((predict(cf3)$predictions - tau)^2))
})

> rowMeans(reps)
[1] 0.5884138 0.4715652 0.3002721

@erikcs
Copy link
Member

erikcs commented Jul 17, 2019

Here it is with with no pruning:

  cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, prune = FALSE)
  cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE, honesty.fraction = 0.9, prune = FALSE)
  cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
  
> rowMeans(reps)
[1] 0.5821823 0.4395167 0.3153103

# With 10x more trees (20 000) in cf1 and cf2
> rowMeans(reps)
[1] 0.5607152 0.4139892 0.2972485

@swager
Copy link
Member

swager commented Jul 18, 2019

Interesting. Ok seems like it clearly helps, at least a little. Want to open a PR so we can merge this in? Then we can talk about what we want to use for defaults. (In the PR, can you also add tests confirming it helps in the basic regression case?)

@erikcs
Copy link
Member

erikcs commented Jul 18, 2019

Ok, I will open a PR, but it will take a short while as the above branch is completely mangled from a merge with my open .Rproj PR (which made switching between branches that changes C++ code faster)

@erikcs
Copy link
Member

erikcs commented Aug 22, 2019

Currently a tuned forest #484 does well (honesty is always on by default):

  • On data where an adaptive forest is better: tuning shuts of prune.empty.leaves and adjusts honesty.fraction

    Further improvements can potentially be gained by increasing num.trees manually

  • On various data where pruning might/might not be beneficial: tuning adjusts prune.empty.leaves (and honesty.fraction) accordingly

For very small n there is little to do as using a tree in the first place is questionable.

@swager
Copy link
Member

swager commented Aug 22, 2019

Here are some results replicating the simulation I had posted earlier: Once with a small sample size and strong effects, and once with a very small sample size and very strong effects (relative to what's typical in a heterogeneous treatments problem).

Overall, it seems like we're currently doing fine for the reasonable settings:

sigma = function(x) 1 / (1 + exp(-3 * (x - 0.5)))

n = 400; p = 4
reps = replicate(25, {
    X = matrix(runif(n * p), nrow = n, ncol = p)
    W.hat = rep(1/2, n); Y.hat = rep(0, n)
    W = rbinom(n, 1, 0.5)
    tau = sigma(X[, 1]) * sigma(X[, 2])
    Y = (W - 1/2) * tau + rnorm(n)
    
    cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
    cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
    cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat,
                        honesty = TRUE, tune.parameters = TRUE)
    
    c(mean((predict(cf1)$predictions - tau)^2),
      mean((predict(cf2)$predictions - tau)^2),
      mean((predict(cf3)$predictions - tau)^2))
})
> rowMeans(reps)
0.03083259 0.13115172 0.03217602

n = 100; p = 4
reps2 = replicate(25, {
    X = matrix(runif(n * p), nrow = n, ncol = p)
    W.hat = rep(1/2, n); Y.hat = rep(0, n)
    W = rbinom(n, 1, 0.5)
    tau = 6 * sigma(X[, 1]) * sigma(X[, 2])
    Y = (W - 1/2) * tau + rnorm(n)
    
    cf1 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = TRUE)
    cf2 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat, honesty = FALSE)
    cf3 = causal_forest(X, Y, W, W.hat = W.hat, Y.hat = Y.hat,
                        honesty = TRUE, tune.parameters = TRUE)
    
    c(mean((predict(cf1)$predictions - tau)^2),
      mean((predict(cf2)$predictions - tau)^2),
      mean((predict(cf3)$predictions - tau)^2))
})

> rowMeans(reps2)
0.5346930 0.2705676 0.3404423

jtibshirani added a commit that referenced this issue Aug 25, 2019
* Explain the new parameters `honesty.fraction` and `prune.empty.leaves`.
* Update the suggested mitigation strategy when training honest forests on a
  small sample.
* Clarify the parameter tuning behavior related to honesty.

Relates to #273.
@erikcs
Copy link
Member

erikcs commented Sep 9, 2019

Regarding this benchmark on data form the UCI database:

Here is a benchmark of the bike data (n=2500 subsample) using "tuned" parameters for randomForest and ranger from @mattschaelling's (thanks) caret script:

rmse.rf.tuned                21.63832*
rmse.ranger.default          33.60643
rmse.ranger.tuned            23.19322
rmse.ranger.tuneRanger       21.97965* (3 parameters from their tuneRanger package)
rmse.grf.default             29.42733
rmse.grf.tuned               28.18083*
rmse.grf.tuned.10k.num.trees 28.42887
rmse.grf.tuned.tweaked       25.60646
rmse.grf.dishonest           23.21979*
rmse.grf.dishonest.tuned     23.46529

grf does not come out on top

@erikcs
Copy link
Member

erikcs commented Sep 9, 2019

We can get honest grf closer to the dishonest grf above by setting ci.group.size to = 1 since this is only about prediction and manually set sample.fraction = 0.7 as this is not tuned but decreases mse.

also set mtry to 12 since it is not much point in making the tuning grid bigger with only 12 variables for mtry

grf <- regression_forest(X = datagrf[, -Yi], Y = datagrf[, Yi],
                         ci.group.size = 1,
                         mtry = 12,
                         sample.fraction = 0.7,
                         tune.parameters = TRUE)
print(rmse.grf.tuned.noCIs <- sqrt(mean((datagrf[, Yi] - predict(grf)$predictions)^2)))
#[1] 23.77796

prune.empty.leaves is correctly shut off by tuning but no amount of tweaking brings the mse on par with the dishonest forest below (but increasing the sub sample from n=2500 brings them closer)

Summary: a) sample.fraction should be tuned when ci.group.size=1? . b) tuning mtry when ncols(X) is tiny (like 12) seems meaningless

@erikcs
Copy link
Member

erikcs commented Sep 9, 2019

The dishonest forest can be tweaked further to give rmse on par with the best randomForest and ranger

grf <- regression_forest(X = datagrf[, -Yi], Y = datagrf[, Yi], honesty = FALSE, tune.parameters = TRUE,
                         sample.fraction = 0.75, ci.group.size = 1, mtry = 12)
print(rmse.grf.dishonest.tuned.noCIs <- sqrt(mean((datagrf[, Yi] - predict(grf)$predictions)^2)))
#[1] 21.63562

davidahirshberg pushed a commit to davidahirshberg/grf that referenced this issue Dec 6, 2019
…abs#496)

* Explain the new parameters `honesty.fraction` and `prune.empty.leaves`.
* Update the suggested mitigation strategy when training honest forests on a
  small sample.
* Clarify the parameter tuning behavior related to honesty.

Relates to grf-labs#273.
@jtibshirani
Copy link
Member Author

jtibshirani commented Feb 21, 2020

This was addressed in the following PRs: #297, #456, #484, #496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants