Account for hierarchical structure #178

lminer · 2018-02-05T18:53:10Z

Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

lminer · 2018-02-05T18:56:32Z

@jtibshirani I'd be happy to give this a try. Do you know of any paper/other repo that might give me a sense about how I might need to alter the sampling strategy after passing in the requisite information?

swager · 2018-02-06T19:17:20Z

Thanks, @lminer! First, a few thoughts as preliminaries:

The point estimates given by the current GRF implementation should be fine even with hierarchical structure; however, the confidence intervals clearly need to be adjusted for failure of the IID assumption, like you said.
As described in Section 5.1 here (https://arxiv.org/pdf/1610.01271v3.pdf), the core idea of our confidence intervals is built around a half-sampling variance estimator. In fact, the estimator (51) is just pure half-sampling (but is computationally intractable); then (52) develops a computationally tractable "bootstrap of little bags" approximation to (51) following Sexton and Laake (2009).
To account for non-IID data, the half-sampling behind the computationally infeasible optimal estimator (51) needs to be changed in the usual way (e.g., half-sample towns rather than individuals); however, the "little bags" approximation and resulting Monte Carlo correction can be left unchanged.
Aside: the simplest example of this Monte Carlo correction in the code is in RegressionPredictionStrategy::compute_variance in this file.

So anyways, noting these preliminaries, you can think of the forest as just doing a half-sampling bootstrap (along with some tricks that make it computationally tractable, but that don't matter from the perspective of the IID/non-IID question). Thus, to make the CIs robust to non-IID data, we just need to make the "usual" modification to half-sampling bootstrap, i.e., cluster similar items during half-sampling.

The way we implement the bootstrap of little bags in the code is that we train trees in groups of size ci_group_size, and each of these tree groups only uses data sampled from the same half sample. You can see these half-samples being generated here. At the very least, you'll need to modify this line (and pipe down the required group information needed to do so). I'll let @jtibshirani chime in in case there's anything else we'd need to worry about?

lminer · 2018-02-15T23:25:32Z

@swager thanks for such a detailed exposition. Seems fairly straightforward if we're only clustering at one level. Do you have a sense of what the procedure would be if we have multiple levels that are nested? I'm assuming non-nested multi-way clustering is complicated and I shouldn't bother with that.

nredell · 2018-02-16T03:04:29Z

@lminer, I have not looked at the code for either of these R packages so I can't speak to bias or CI coverage, but the articles were good reads. The REEMtree package handles a variety of nested or clustered relationships both cross-sectional and longitudinal: https://cran.r-project.org/web/packages/REEMtree/index.html.

And glmertree handles nested data if isolating a treatment effect across subgroups is what you're after: https://cran.r-project.org/web/packages/glmertree/index.html

lminer · 2018-02-19T17:51:04Z

Still trying to find an authoritative article on this. The only thing that I can find about boostrap standard errors when there are multiple levels is this. It suggests that if you had two levels like city and school, you would first randomly sample a city and then from within that city you would randomly sample a school and use that as your unit of sampling. Does this seem right?

lminer · 2018-02-23T01:47:43Z

@swager, @jtibshirani I propose implementing the following.

We'll add two extra arguments

clusters: (optional argument) a vector of numbers or factors, specifying the clusters
samples_per_cluster: (optional argument) the number of observations to be sampled from each sampled cluster

We're only going to implement clustering at a single level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering will work as follows. If samples_per_cluster is not specified, we will set it equal to the size of the smallest cluster. When sampling for the bootstrap, we will half-sample by clusters. Rather than taking all observations in a sampled cluster, we will take a sample from selected cluster of size samples_per_cluster.

One question. Does it make sense to provide this option for all the forests in the package: instrumental, causal, quantile, regression? Yes, right?

swager · 2018-02-24T01:32:08Z

Sounds great, thanks! One minor thing: How about calling the second argument samples_per_cluster instead? And yes, we should provide the option to all the forest types.

lminer · 2018-02-27T00:36:15Z

@swager Digging into this more, I see that there are several different sampling options.

bootstrap
bootstrap without replacement
bootstrap weighted
bootstrap weighted without replacement

Do I need to implement each of these options for the hierarchical use case or is plain bootstrap enough?

Also, after I've chosen the clusters, how should I sample observations from the clusters? With replacement or without replacement?

lminer · 2018-03-05T20:10:00Z

@swager I think I've got the basic implementation down for training. Do I need to make any adjustments for prediction?

swager · 2018-03-05T23:52:15Z

Great! Unless I'm missing something, I think prediction should be OK as already implemented. As to your previous point (sorry for missing it earlier), everything in GRF so far runs on bootstrap sampling without replacement, so that's the most important case.

lminer · 2018-03-07T01:57:38Z

@swager that makes it easier. Last two questions. Should I sample from within the clusters with or without replacement? For the oob sample, do I include observations from sampled clusters that haven't themselves been sampled?

swager · 2018-03-07T02:53:34Z

It's most consistent with the rest if all sampling is without replacement (including within clusters). For the OOB sample, we should only include observations from clusters that haven't been used at all (because an OOB sample is supposed to be independent from the tree that was grown; and, in case of cluster-wide correlations, a sample may be correlated with a tree prediction whenever the tree used a training example from the same cluster as the sample).

Finally, for the same reason: In the case of subsample splitting for honesty, we should make sure that we split the subsample along cluster boundaries (i.e., all samples from the same cluster end up in the same half).

lminer · 2018-03-07T05:42:07Z

Got it. For the OOB sample, do we include all observations in a cluster or just samples_per_cluster?

lminer · 2018-03-07T19:48:50Z

@swager now that we need to subsample along cluster boundaries, I have another few questions. Basically, where in the code do I need to make this change?

Do I only subsample along clusters after checking for honesty here in treetrainer.cpp where we explicitly check for honest?
Or do we also subsample along clusters before then. So here as well.
Finally, do we only do this when calculating confidence intervals (tree.train called from train_ci_group), or should we also subsample along clusters for the point estimates?

swager · 2018-03-20T20:05:31Z

We should subsample along clusters in all cases during training, including those in the first two bullets.

Then, given that during training we did all subsampling along clusters, the prediction code should be able to run verbatim (including confidence intervals). The reason for this is that the uncertainty quantification at prediction time is driven by the sampling in the training phase, so if the sampling is already cluster-robust, the uncertainty quantification will also be cluster-robust.

lminer mentioned this issue Mar 21, 2018

[WIP] Clustered standard errors #180

Merged

jtibshirani closed this as completed in #180 Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for hierarchical structure #178

Account for hierarchical structure #178

lminer commented Feb 5, 2018

lminer commented Feb 5, 2018

swager commented Feb 6, 2018

lminer commented Feb 15, 2018

nredell commented Feb 16, 2018 •

edited

lminer commented Feb 19, 2018

lminer commented Feb 23, 2018 •

edited

swager commented Feb 24, 2018

lminer commented Feb 27, 2018

lminer commented Mar 5, 2018

swager commented Mar 5, 2018

lminer commented Mar 7, 2018

swager commented Mar 7, 2018

lminer commented Mar 7, 2018

lminer commented Mar 7, 2018 •

edited

swager commented Mar 20, 2018

Account for hierarchical structure #178

Account for hierarchical structure #178

Comments

lminer commented Feb 5, 2018

lminer commented Feb 5, 2018

swager commented Feb 6, 2018

lminer commented Feb 15, 2018

nredell commented Feb 16, 2018 • edited

lminer commented Feb 19, 2018

lminer commented Feb 23, 2018 • edited

swager commented Feb 24, 2018

lminer commented Feb 27, 2018

lminer commented Mar 5, 2018

swager commented Mar 5, 2018

lminer commented Mar 7, 2018

swager commented Mar 7, 2018

lminer commented Mar 7, 2018

lminer commented Mar 7, 2018 • edited

swager commented Mar 20, 2018

nredell commented Feb 16, 2018 •

edited

lminer commented Feb 23, 2018 •

edited

lminer commented Mar 7, 2018 •

edited