[WIP] Clustered standard errors #180

lminer · 2018-03-21T21:59:46Z

Resolves #178. Provides an option for calculating clustered standard errors.

This is accomplished by half-sampling with replacement at the cluster level, rather than at the individual observation level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering as follows. If samples_per_cluster is not specified, we set it equal to the size of the smallest cluster. When sampling for the bootstrap, we half-sample by clusters. Rather than taking all observations in a sampled cluster, we take a sample from selected cluster of size samples_per_cluster. Subsampling also occurs along cluster boundaries.

We only make this adjustment to sampling when calculating standard errors. Point estimate calculations are identical to the non-cluster case.

lminer · 2018-03-21T22:00:48Z

core/src/commons/Data.cpp

+    size_t obs_cluster = clusters[s];
+    this->cluster_map[obs_cluster].push_back(s);
+  }
+}


@swager I put the cluster information in Data, but it might be better to put it in ForestOptions.

Conceptually, I agree that this clustering information belongs in Data, but given the current code structure, I think it will be cleaner to put it in ForestOptions. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.

lminer · 2018-03-21T22:11:24Z

r-package/grf/tests/testthat/test_clustered_standard_errors.R

+    mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates)
+    mean_corrected <- mean(preds_corrected.oob$variance.estimates)
+    mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)
+


@swager I'm not sure how to evaluate the output here. I simulate clusters by just making 20 copies of the same data. mean_no_cluster is the mean of the OOB variance estimates on the unduplicated data. mean_uncorrected is the mean of the OOB variance estimates on the duplicated data, without correcting for clusters. mean_corrected is the mean of the OOB variance estimates on the duplicated data after correcting for clusters.

The results of this test are as follows:

mean_no_cluster = 0.033 mean_uncorrected = .009 mean_corrected = .167

Do these results pass the sniff test? I worry that the corrected variance estimates are too high.

Yup, looks like there's a bug!

jtibshirani · 2018-03-23T06:46:53Z

Thank you so much @lminer for the PR! We'll take a close look.

jtibshirani · 2018-04-02T19:46:46Z

@lminer -- I'm taking a look at this now. Unless you've already started, I'll resolve the merge conflicts that came up (since I created them in the first place!)

lminer · 2018-04-02T21:34:44Z

@jtibshirani I'm happy with you resolving the conflicts :)

jtibshirani · 2018-04-02T21:55:07Z

Great, would you be able to give me write permissions to your branch?

jtibshirani

I'm still debugging some issues with the merge, but wanted to get you some preliminary comments.

Many of these comments are pretty minor, but I noticed one bigger issue that may be leading to the poor results you're seeing. With clustering enabled, my understanding of the correct algorithm is as follows:

For each CI group, sample 50% of the clusters (without replacement). Each CI group is associated with a list of cluster IDs.
Within this half-sample of cluster IDs, sample 'sample_fraction' of the clusters (without replacement). Each tree is associated with a list of cluster IDs.
If honesty is enabled, split these cluster IDs in half, so that one half can be used for growing the tree, and the other half is used in repopulating the leaves.
To grow the tree, draw samples_per_cluster from each of the cluster IDs, and do the same when repopulating the leaves for honesty.

Currently, it looks like your code is drawing samples from clusters in the very first step (when we draw a half-sample for a CI group). My apologies if this wasn't very clear from your conversations, or if I'm misunderstanding your approach.

My suggestion for how to structure this sampling scheme:

Store all information about clusters inside of SamplingOptions, to limit logic around cluster sampling as much as possible to to RandomSampler.
Use essentially the same standard methods for sampling, subsampling, etc., but make sure that these are operating on cluster IDs instead of sample IDs. It'd be nice to add a comment somewhere explaining that we're working with cluster IDs instead of sample IDs when clustering is enabled.
At the very last step (growing trees), draw actual samples from clusters. Use these to grow the trees and if honesty is enabled, re-populate the leaves.

Tagging @swager for context on the above.

jtibshirani · 2018-04-02T21:34:10Z

core/src/sampling/SamplingOptions.cpp

-    sample_weights(0) {}
+    sample_weights(0),
+    samples_per_cluster(1)
+{}


(nitpick) These parentheses should be on the same line as the last parameter. This applies here and a few places below.

jtibshirani · 2018-04-02T21:37:49Z

core/src/sampling/SamplingOptions.h


  bool get_sample_with_replacement() const;
  const std::vector<double>& get_sample_weights() const;
+  unsigned int get_samples_per_cluster() const;


For consistency, we should use uint to refer to unsigned integer (it's been typedef'd in globals.h). This comment applies to a few different places in this PR.

I'm not sure this typedef is actually good practice, as we inherited this choice from the ranger repository. For now, it's best to go with the standard in other parts of the code.

jtibshirani · 2018-04-02T21:53:00Z

core/src/tree/TreeOptions.cpp

  mtry(mtry),
  min_node_size(min_node_size),
-  honesty(honesty) {}
+  honesty(honesty),
+  clustered(samples_per_cluster > 1) {}


It seems unnecessary to spread information about clustering across three objects (TreeOptions, SamplingOptions, and Data) -- could we move everything into SamplingOptions?

jtibshirani · 2018-04-02T22:00:23Z

core/src/commons/Data.cpp

+    size_t obs_cluster = clusters[s];
+    this->cluster_map[obs_cluster].push_back(s);
+  }
+}


Conceptually, I agree that this clustering information belongs in Data, but given the current code structure, I think it will be cleaner to put it in ForestOptions. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.

jtibshirani · 2018-04-02T22:19:22Z

r-package/grf/R/input_utilities.R

+  } else if (length(clusters) == 0) {
+    clusters <- vector(mode="numeric", length=0)
+  } else if (!is.vector(clusters) | !all(clusters == floor(clusters))) {
+    stop("clusters must be a vector of integers.")


(nitpick) Start of error should be capitalized. Same comment applies below.

jtibshirani · 2018-04-02T22:20:38Z

core/src/sampling/RandomSampler.cpp

+                                     std::vector<size_t>& oob_samples,
+                                     Data* data) {
+  if (options.get_samples_per_cluster() > 1) {
+    auto clusters = data->get_clusters();


The standard in this repo is to avoid using 'auto', except when dealing with iterators.

Got it. For my own edification, is there a rational for this style choice? Is auto not favored because it obscures type information?

jtibshirani · 2018-04-02T22:25:37Z

core/src/sampling/SamplingOptions.cpp

+    samples_per_cluster(1)
+{}
+
+SamplingOptions::SamplingOptions(bool sample_with_replacement, unsigned int samples_per_cluster):


Generally, I'm not a big fan of telescoping constructors like this. It'd be best to only have a default constructor (takes no arguments), and one that accepts all arguments. The implication in this case: we should nuke the constructor SamplingOptions(bool sample_with_replacement) above. This may apply to a few different places in this PR.

jtibshirani · 2018-04-02T22:30:31Z

r-package/grf/R/input_utilities.R

+
+validate_samples_per_cluster <- function(samples_per_cluster, clusters) {
+  if (is.null(clusters) || length(clusters) == 0) {
+    return(1)


I mentioned this earlier, but seems like this should be a different null marker, like 0? The value 1 could actually be valid when clustering is enabled.

jtibshirani · 2018-04-02T22:52:26Z

core/src/sampling/RandomSampler.cpp

+                                     std::vector<size_t>& subsamples,
+                                     std::vector<size_t>& oob_samples,
+                                     Data* data) {
+  if (options.get_samples_per_cluster() > 1) {


This doesn't seem right to me -- can't samples_per_cluster be 1, but clustering is still enabled? For example, the smallest cluster could have size 1.

I think we just want to check clusters.empty().

jtibshirani · 2018-04-02T23:32:13Z

r-package/grf/tests/testthat/test_clustered_standard_errors.R

+    mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)
+
+    expect_true(mean_uncorrected < mean_corrected)
+})


(nitpick) Should add newline here.

jtibshirani · 2018-04-03T00:50:03Z

core/src/forest/ForestTrainer.cpp

@@ -136,13 +135,15 @@ std::vector<std::shared_ptr<Tree>> ForestTrainer::train_ci_group(Data* data,
  std::vector<std::shared_ptr<Tree>> trees;

  std::vector<size_t> sample;
-  sampler.sample(data->get_num_rows(), 0.5, sample);
+  std::vector<size_t> dummy_oob_sample;


While this PR was open, I refactored RandomSampler to not accept 'OOB samples' unless it is really necessary. As you update your sampling approach, it be good to remove unnecessary oob_sample parameters as well.

lminer · 2018-04-04T21:24:21Z

@jtibshirani Thanks for the very thorough review. I believe I've made the changes that you suggested to the sampling strategy. It's definitely a lot cleaner now. Note, however, that I am seeing even larger standard errors for the one test I have of the whole pipeline. I've updated my comment to that test to give the actual numbers.

swager

Some comments from a quick read-through. I'm going to take a closer look and see if I can figure out what's driving the test results.

swager · 2018-04-05T18:46:56Z

core/src/forest/ForestTrainer.cpp

  std::vector<size_t> sample;
-  sampler.sample(data->get_num_rows(), 0.5, sample);
+  sampler.sample_for_ci(data, 0.5, sample);


We want to always use the same clustering strategy, whether ci_group_size is >1 or not. Sampling by groups changes the estimands (i.e., do we give equal weights to clusters or equal weights to samples), so this matters even if we're not building CIs.

swager · 2018-04-05T18:56:06Z

r-package/grf/tests/testthat/test_clustered_standard_errors.R

+
+    cluster_size <- 20
+    # data sim
+    X <- rnorm(1000)


Please define the simulation in terms of n, p, etc. Also, forests may behave very strangely when p = 1, so I'd recommend using at least p = 4.

Also, to make sure running tests doesn't slow down development, it might be better to use smaller sample sizes if they still highlight the desired phenomena (e.g., I'm using n=200; p = 4; cluster_size = 10 and the results are qualitatively similar):

> c(mean_no_cluster, mean_uncorrected, mean_corrected, mean_corrected_no_cluster) [1] 0.06357332 0.02013494 0.15627015 0.12970989

swager · 2018-04-05T18:56:38Z

r-package/grf/tests/testthat/test_clustered_standard_errors.R

+    mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates)
+    mean_corrected <- mean(preds_corrected.oob$variance.estimates)
+    mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)
+


Yup, looks like there's a bug!

swager · 2018-04-05T20:45:32Z

@jtibshirani and I went through the PR in detail. It looks like the errors were driven by sampling with replacement vs without in one spot. After fixing some issues, everything looks good now. We'll handle some last changes and then merge in the PR -- hopefully within a couple days. Thanks again for the very nice contribution!

swager · 2018-04-05T20:53:16Z

p.s. we also fixed the couple things I mentioned in my previous comment, so no need to worry about those.

lminer · 2018-04-05T21:09:47Z

@swager that's great news. Thanks for hunting the bug. Where was it exactly? Would be good to know for my own benefit.

swager

The sampling issue was on the line noted below. Two other things that mattered:

In the test, if we don't set samples_per_cluster=1, then the min_node_size parameter affects trees differently depending on the number of repeats (so we just set the parameter to 1), and
There was some inconsistency in how OOB samples were defined.

@jtibshirani will push an update shortly with the changes.

swager · 2018-04-05T21:23:55Z

core/src/sampling/RandomSampler.cpp

+                                  std::vector<size_t>& samples) {
+  if (options.clustering_enabled()) {
+    size_t num_samples = options.get_num_clusters();
+    bootstrap(num_samples, sample_fraction, samples);


This line should sample without replacement.

…les when re-populating the leaves.

* Prefer size_t to uint for representing IDs. * For efficiency, use a vector instead of unordered_map for storing cluster information. * Minor naming improvements.

… the drawn clusters.

jtibshirani · 2018-04-06T07:58:24Z

Almost there! 🌴

@swager could I pass off a few things to you before we merge?

It'd be great to do a pass at the R code/ interface additions and make sure you're happy with it.
I managed to pass 'test_clustered_standard_errors' even with some pretty big bugs around OOB prediction and non-honest trees. Would it be possible to add a couple more cases to that test, focusing on these areas?

jtibshirani · 2018-04-07T01:57:19Z

Okay, we're going to merge this, and @swager will add some tests in a future PR. Thanks again @lminer!

[WIP] Clustered standard errors

lminer added 4 commits March 8, 2018 14:40

Adds in option for calculating clustered standard errors.

e8c4b46

Adds basic test for clustered standard errors

9bbe0e1

Subsampling now occurs along cluster boundaries as well.

7837d6c

Remove unnecessary import

8d8c6ea

lminer commented Mar 21, 2018

View reviewed changes

jtibshirani reviewed Apr 2, 2018

View reviewed changes

Merge branch 'swager/master' into lim-2018.02.26-clustered_sampling.

919ad70

jtibshirani reviewed Apr 3, 2018

View reviewed changes

lminer added 3 commits April 4, 2018 11:27

Refactor to take account of suggestions in PR.

12cf416

slight formatting fix

35d211f

Move all clustering related parameters to samplingoptions.

93219f1

swager reviewed Apr 5, 2018

View reviewed changes

jtibshirani added 3 commits April 5, 2018 18:29

Merge branch 'swager/master' into lim-2018.02.26-clustered_sampling.

e88cb87

Ensure we sample from clusters even when CI are not enabled.

78e6adc

When honesty is enabled, make sure we draw 'samples per cluster' samp…

3c4b19f

…les when re-populating the leaves.

jtibshirani force-pushed the lim-2018.02.26-clustered_sampling branch from 8642b5b to 3c4b19f Compare April 6, 2018 01:36

jtibshirani added 4 commits April 6, 2018 00:44

Improvements to the R test for clustered standard errors.

c2549c4

Simplify the logic around clustering and honesty.

a8099de

Improvements to code quality and efficiency.

5d2a3ed

* Prefer size_t to uint for representing IDs. * For efficiency, use a vector instead of unordered_map for storing cluster information. * Minor naming improvements.

Ensure that the 'drawn samples' in each tree contain all samples from…

21b2919

… the drawn clusters.

Hold off on bumping the package version.

7570ff2

jtibshirani merged commit 9d248a3 into grf-labs:master Apr 7, 2018

jtibshirani deleted the lim-2018.02.26-clustered_sampling branch April 7, 2018 01:59

jtibshirani added a commit that referenced this pull request Apr 7, 2018

Merge pull request #180 from lminer/lim-2018.02.26-clustered_sampling

0b44c30

[WIP] Clustered standard errors

jtibshirani mentioned this pull request Nov 19, 2018

Fix a bug in causal forest tuning where imbalance.penalty was being cast to a bool. #335

Merged

[WIP] Clustered standard errors #180

[WIP] Clustered standard errors #180

Conversation

lminer commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lminer Mar 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Mar 23, 2018

jtibshirani commented Apr 2, 2018

lminer commented Apr 2, 2018

jtibshirani commented Apr 2, 2018

jtibshirani left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lminer Apr 3, 2018 • edited

Choose a reason for hiding this comment

jtibshirani Apr 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lminer commented Apr 4, 2018

swager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swager commented Apr 5, 2018

swager commented Apr 5, 2018

lminer commented Apr 5, 2018

swager left a comment • edited by jtibshirani

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Apr 6, 2018 • edited

jtibshirani commented Apr 7, 2018 • edited

lminer Mar 21, 2018 •

edited

jtibshirani left a comment •

edited

lminer Apr 3, 2018 •

edited

jtibshirani Apr 2, 2018 •

edited

swager left a comment •

edited by jtibshirani

jtibshirani commented Apr 6, 2018 •

edited

jtibshirani commented Apr 7, 2018 •

edited