New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for hierarchical structure #178
Comments
@jtibshirani I'd be happy to give this a try. Do you know of any paper/other repo that might give me a sense about how I might need to alter the sampling strategy after passing in the requisite information? |
Thanks, @lminer! First, a few thoughts as preliminaries:
So anyways, noting these preliminaries, you can think of the forest as just doing a half-sampling bootstrap (along with some tricks that make it computationally tractable, but that don't matter from the perspective of the IID/non-IID question). Thus, to make the CIs robust to non-IID data, we just need to make the "usual" modification to half-sampling bootstrap, i.e., cluster similar items during half-sampling. The way we implement the bootstrap of little bags in the code is that we train trees in groups of size |
@swager thanks for such a detailed exposition. Seems fairly straightforward if we're only clustering at one level. Do you have a sense of what the procedure would be if we have multiple levels that are nested? I'm assuming non-nested multi-way clustering is complicated and I shouldn't bother with that. |
@lminer, I have not looked at the code for either of these R packages so I can't speak to bias or CI coverage, but the articles were good reads. The REEMtree package handles a variety of nested or clustered relationships both cross-sectional and longitudinal: https://cran.r-project.org/web/packages/REEMtree/index.html. And glmertree handles nested data if isolating a treatment effect across subgroups is what you're after: https://cran.r-project.org/web/packages/glmertree/index.html |
Still trying to find an authoritative article on this. The only thing that I can find about boostrap standard errors when there are multiple levels is this. It suggests that if you had two levels like city and school, you would first randomly sample a city and then from within that city you would randomly sample a school and use that as your unit of sampling. Does this seem right? |
@swager, @jtibshirani I propose implementing the following. We'll add two extra arguments
We're only going to implement clustering at a single level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering will work as follows. If One question. Does it make sense to provide this option for all the forests in the package: instrumental, causal, quantile, regression? Yes, right? |
Sounds great, thanks! One minor thing: How about calling the second argument |
@swager Digging into this more, I see that there are several different sampling options.
Do I need to implement each of these options for the hierarchical use case or is plain bootstrap enough? Also, after I've chosen the clusters, how should I sample observations from the clusters? With replacement or without replacement? |
@swager I think I've got the basic implementation down for training. Do I need to make any adjustments for prediction? |
Great! Unless I'm missing something, I think prediction should be OK as already implemented. As to your previous point (sorry for missing it earlier), everything in GRF so far runs on bootstrap sampling without replacement, so that's the most important case. |
@swager that makes it easier. Last two questions. Should I sample from within the clusters with or without replacement? For the oob sample, do I include observations from sampled clusters that haven't themselves been sampled? |
It's most consistent with the rest if all sampling is without replacement (including within clusters). For the OOB sample, we should only include observations from clusters that haven't been used at all (because an OOB sample is supposed to be independent from the tree that was grown; and, in case of cluster-wide correlations, a sample may be correlated with a tree prediction whenever the tree used a training example from the same cluster as the sample). Finally, for the same reason: In the case of subsample splitting for honesty, we should make sure that we split the subsample along cluster boundaries (i.e., all samples from the same cluster end up in the same half). |
Got it. For the OOB sample, do we include all observations in a cluster or just |
@swager now that we need to subsample along cluster boundaries, I have another few questions. Basically, where in the code do I need to make this change?
|
We should subsample along clusters in all cases during training, including those in the first two bullets. Then, given that during training we did all subsampling along clusters, the prediction code should be able to run verbatim (including confidence intervals). The reason for this is that the uncertainty quantification at prediction time is driven by the sampling in the training phase, so if the sampling is already cluster-robust, the uncertainty quantification will also be cluster-robust. |
Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.
The text was updated successfully, but these errors were encountered: