Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for hierarchical structure #178

Closed
lminer opened this issue Feb 5, 2018 · 15 comments
Closed

Account for hierarchical structure #178

lminer opened this issue Feb 5, 2018 · 15 comments

Comments

@lminer
Copy link
Contributor

lminer commented Feb 5, 2018

Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

@lminer
Copy link
Contributor Author

lminer commented Feb 5, 2018

@jtibshirani I'd be happy to give this a try. Do you know of any paper/other repo that might give me a sense about how I might need to alter the sampling strategy after passing in the requisite information?

@swager
Copy link
Member

swager commented Feb 6, 2018

Thanks, @lminer! First, a few thoughts as preliminaries:

  • The point estimates given by the current GRF implementation should be fine even with hierarchical structure; however, the confidence intervals clearly need to be adjusted for failure of the IID assumption, like you said.
  • As described in Section 5.1 here (https://arxiv.org/pdf/1610.01271v3.pdf), the core idea of our confidence intervals is built around a half-sampling variance estimator. In fact, the estimator (51) is just pure half-sampling (but is computationally intractable); then (52) develops a computationally tractable "bootstrap of little bags" approximation to (51) following Sexton and Laake (2009).
  • To account for non-IID data, the half-sampling behind the computationally infeasible optimal estimator (51) needs to be changed in the usual way (e.g., half-sample towns rather than individuals); however, the "little bags" approximation and resulting Monte Carlo correction can be left unchanged.
  • Aside: the simplest example of this Monte Carlo correction in the code is in RegressionPredictionStrategy::compute_variance in this file.

So anyways, noting these preliminaries, you can think of the forest as just doing a half-sampling bootstrap (along with some tricks that make it computationally tractable, but that don't matter from the perspective of the IID/non-IID question). Thus, to make the CIs robust to non-IID data, we just need to make the "usual" modification to half-sampling bootstrap, i.e., cluster similar items during half-sampling.

The way we implement the bootstrap of little bags in the code is that we train trees in groups of size ci_group_size, and each of these tree groups only uses data sampled from the same half sample. You can see these half-samples being generated here. At the very least, you'll need to modify this line (and pipe down the required group information needed to do so). I'll let @jtibshirani chime in in case there's anything else we'd need to worry about?

@lminer
Copy link
Contributor Author

lminer commented Feb 15, 2018

@swager thanks for such a detailed exposition. Seems fairly straightforward if we're only clustering at one level. Do you have a sense of what the procedure would be if we have multiple levels that are nested? I'm assuming non-nested multi-way clustering is complicated and I shouldn't bother with that.

@nredell
Copy link

nredell commented Feb 16, 2018

@lminer, I have not looked at the code for either of these R packages so I can't speak to bias or CI coverage, but the articles were good reads. The REEMtree package handles a variety of nested or clustered relationships both cross-sectional and longitudinal: https://cran.r-project.org/web/packages/REEMtree/index.html.

And glmertree handles nested data if isolating a treatment effect across subgroups is what you're after: https://cran.r-project.org/web/packages/glmertree/index.html

@lminer
Copy link
Contributor Author

lminer commented Feb 19, 2018

Still trying to find an authoritative article on this. The only thing that I can find about boostrap standard errors when there are multiple levels is this. It suggests that if you had two levels like city and school, you would first randomly sample a city and then from within that city you would randomly sample a school and use that as your unit of sampling. Does this seem right?

@lminer
Copy link
Contributor Author

lminer commented Feb 23, 2018

@swager, @jtibshirani I propose implementing the following.

We'll add two extra arguments

  • clusters: (optional argument) a vector of numbers or factors, specifying the clusters
  • samples_per_cluster: (optional argument) the number of observations to be sampled from each sampled cluster

We're only going to implement clustering at a single level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering will work as follows. If samples_per_cluster is not specified, we will set it equal to the size of the smallest cluster. When sampling for the bootstrap, we will half-sample by clusters. Rather than taking all observations in a sampled cluster, we will take a sample from selected cluster of size samples_per_cluster.

One question. Does it make sense to provide this option for all the forests in the package: instrumental, causal, quantile, regression? Yes, right?

@swager
Copy link
Member

swager commented Feb 24, 2018

Sounds great, thanks! One minor thing: How about calling the second argument samples_per_cluster instead? And yes, we should provide the option to all the forest types.

@lminer
Copy link
Contributor Author

lminer commented Feb 27, 2018

@swager Digging into this more, I see that there are several different sampling options.

  • bootstrap
  • bootstrap without replacement
  • bootstrap weighted
  • bootstrap weighted without replacement

Do I need to implement each of these options for the hierarchical use case or is plain bootstrap enough?

Also, after I've chosen the clusters, how should I sample observations from the clusters? With replacement or without replacement?

@lminer
Copy link
Contributor Author

lminer commented Mar 5, 2018

@swager I think I've got the basic implementation down for training. Do I need to make any adjustments for prediction?

@swager
Copy link
Member

swager commented Mar 5, 2018

Great! Unless I'm missing something, I think prediction should be OK as already implemented. As to your previous point (sorry for missing it earlier), everything in GRF so far runs on bootstrap sampling without replacement, so that's the most important case.

@lminer
Copy link
Contributor Author

lminer commented Mar 7, 2018

@swager that makes it easier. Last two questions. Should I sample from within the clusters with or without replacement? For the oob sample, do I include observations from sampled clusters that haven't themselves been sampled?

@swager
Copy link
Member

swager commented Mar 7, 2018

It's most consistent with the rest if all sampling is without replacement (including within clusters). For the OOB sample, we should only include observations from clusters that haven't been used at all (because an OOB sample is supposed to be independent from the tree that was grown; and, in case of cluster-wide correlations, a sample may be correlated with a tree prediction whenever the tree used a training example from the same cluster as the sample).

Finally, for the same reason: In the case of subsample splitting for honesty, we should make sure that we split the subsample along cluster boundaries (i.e., all samples from the same cluster end up in the same half).

@lminer
Copy link
Contributor Author

lminer commented Mar 7, 2018

Got it. For the OOB sample, do we include all observations in a cluster or just samples_per_cluster?

@lminer
Copy link
Contributor Author

lminer commented Mar 7, 2018

@swager now that we need to subsample along cluster boundaries, I have another few questions. Basically, where in the code do I need to make this change?

  • Do I only subsample along clusters after checking for honesty here in treetrainer.cpp where we explicitly check for honest?
  • Or do we also subsample along clusters before then. So here as well.
  • Finally, do we only do this when calculating confidence intervals (tree.train called from train_ci_group), or should we also subsample along clusters for the point estimates?

@swager
Copy link
Member

swager commented Mar 20, 2018

We should subsample along clusters in all cases during training, including those in the first two bullets.

Then, given that during training we did all subsampling along clusters, the prediction code should be able to run verbatim (including confidence intervals). The reason for this is that the uncertainty quantification at prediction time is driven by the sampling in the training phase, so if the sampling is already cluster-robust, the uncertainty quantification will also be cluster-robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants