Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Clustered standard errors #180

Merged
merged 16 commits into from
Apr 7, 2018

Conversation

lminer
Copy link
Contributor

@lminer lminer commented Mar 21, 2018

Resolves #178. Provides an option for calculating clustered standard errors.

This is accomplished by half-sampling with replacement at the cluster level, rather than at the individual observation level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering as follows. If samples_per_cluster is not specified, we set it equal to the size of the smallest cluster. When sampling for the bootstrap, we half-sample by clusters. Rather than taking all observations in a sampled cluster, we take a sample from selected cluster of size samples_per_cluster. Subsampling also occurs along cluster boundaries.

We only make this adjustment to sampling when calculating standard errors. Point estimate calculations are identical to the non-cluster case.

size_t obs_cluster = clusters[s];
this->cluster_map[obs_cluster].push_back(s);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swager I put the cluster information in Data, but it might be better to put it in ForestOptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I agree that this clustering information belongs in Data, but given the current code structure, I think it will be cleaner to put it in ForestOptions. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.

mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates)
mean_corrected <- mean(preds_corrected.oob$variance.estimates)
mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)

Copy link
Contributor Author

@lminer lminer Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swager I'm not sure how to evaluate the output here. I simulate clusters by just making 20 copies of the same data. mean_no_cluster is the mean of the OOB variance estimates on the unduplicated data. mean_uncorrected is the mean of the OOB variance estimates on the duplicated data, without correcting for clusters. mean_corrected is the mean of the OOB variance estimates on the duplicated data after correcting for clusters.

The results of this test are as follows:

mean_no_cluster = 0.033 
mean_uncorrected = .009
mean_corrected = .167

Do these results pass the sniff test? I worry that the corrected variance estimates are too high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks like there's a bug!

@jtibshirani
Copy link
Member

Thank you so much @lminer for the PR! We'll take a close look.

@jtibshirani
Copy link
Member

@lminer -- I'm taking a look at this now. Unless you've already started, I'll resolve the merge conflicts that came up (since I created them in the first place!)

@lminer
Copy link
Contributor Author

lminer commented Apr 2, 2018

@jtibshirani I'm happy with you resolving the conflicts :)

@jtibshirani
Copy link
Member

Great, would you be able to give me write permissions to your branch?

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still debugging some issues with the merge, but wanted to get you some preliminary comments.

Many of these comments are pretty minor, but I noticed one bigger issue that may be leading to the poor results you're seeing. With clustering enabled, my understanding of the correct algorithm is as follows:

  • For each CI group, sample 50% of the clusters (without replacement). Each CI group is associated with a list of cluster IDs.
  • Within this half-sample of cluster IDs, sample 'sample_fraction' of the clusters (without replacement). Each tree is associated with a list of cluster IDs.
  • If honesty is enabled, split these cluster IDs in half, so that one half can be used for growing the tree, and the other half is used in repopulating the leaves.
  • To grow the tree, draw samples_per_cluster from each of the cluster IDs, and do the same when repopulating the leaves for honesty.

Currently, it looks like your code is drawing samples from clusters in the very first step (when we draw a half-sample for a CI group). My apologies if this wasn't very clear from your conversations, or if I'm misunderstanding your approach.

My suggestion for how to structure this sampling scheme:

  • Store all information about clusters inside of SamplingOptions, to limit logic around cluster sampling as much as possible to to RandomSampler.
  • Use essentially the same standard methods for sampling, subsampling, etc., but make sure that these are operating on cluster IDs instead of sample IDs. It'd be nice to add a comment somewhere explaining that we're working with cluster IDs instead of sample IDs when clustering is enabled.
  • At the very last step (growing trees), draw actual samples from clusters. Use these to grow the trees and if honesty is enabled, re-populate the leaves.

Tagging @swager for context on the above.

sample_weights(0) {}
sample_weights(0),
samples_per_cluster(1)
{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) These parentheses should be on the same line as the last parameter. This applies here and a few places below.


bool get_sample_with_replacement() const;
const std::vector<double>& get_sample_weights() const;
unsigned int get_samples_per_cluster() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we should use uint to refer to unsigned integer (it's been typedef'd in globals.h). This comment applies to a few different places in this PR.

I'm not sure this typedef is actually good practice, as we inherited this choice from the ranger repository. For now, it's best to go with the standard in other parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

mtry(mtry),
min_node_size(min_node_size),
honesty(honesty) {}
honesty(honesty),
clustered(samples_per_cluster > 1) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unnecessary to spread information about clustering across three objects (TreeOptions, SamplingOptions, and Data) -- could we move everything into SamplingOptions?

size_t obs_cluster = clusters[s];
this->cluster_map[obs_cluster].push_back(s);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I agree that this clustering information belongs in Data, but given the current code structure, I think it will be cleaner to put it in ForestOptions. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.

} else if (length(clusters) == 0) {
clusters <- vector(mode="numeric", length=0)
} else if (!is.vector(clusters) | !all(clusters == floor(clusters))) {
stop("clusters must be a vector of integers.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) Start of error should be capitalized. Same comment applies below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

std::vector<size_t>& oob_samples,
Data* data) {
if (options.get_samples_per_cluster() > 1) {
auto clusters = data->get_clusters();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard in this repo is to avoid using 'auto', except when dealing with iterators.

Copy link
Contributor Author

@lminer lminer Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. For my own edification, is there a rational for this style choice? Is auto not favored because it obscures type information?

samples_per_cluster(1)
{}

SamplingOptions::SamplingOptions(bool sample_with_replacement, unsigned int samples_per_cluster):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I'm not a big fan of telescoping constructors like this. It'd be best to only have a default constructor (takes no arguments), and one that accepts all arguments. The implication in this case: we should nuke the constructor SamplingOptions(bool sample_with_replacement) above. This may apply to a few different places in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


validate_samples_per_cluster <- function(samples_per_cluster, clusters) {
if (is.null(clusters) || length(clusters) == 0) {
return(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this earlier, but seems like this should be a different null marker, like 0? The value 1 could actually be valid when clustering is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

std::vector<size_t>& subsamples,
std::vector<size_t>& oob_samples,
Data* data) {
if (options.get_samples_per_cluster() > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me -- can't samples_per_cluster be 1, but clustering is still enabled? For example, the smallest cluster could have size 1.

I think we just want to check clusters.empty().

mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)

expect_true(mean_uncorrected < mean_corrected)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) Should add newline here.

@@ -136,13 +135,15 @@ std::vector<std::shared_ptr<Tree>> ForestTrainer::train_ci_group(Data* data,
std::vector<std::shared_ptr<Tree>> trees;

std::vector<size_t> sample;
sampler.sample(data->get_num_rows(), 0.5, sample);
std::vector<size_t> dummy_oob_sample;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this PR was open, I refactored RandomSampler to not accept 'OOB samples' unless it is really necessary. As you update your sampling approach, it be good to remove unnecessary oob_sample parameters as well.

@lminer
Copy link
Contributor Author

lminer commented Apr 4, 2018

@jtibshirani Thanks for the very thorough review. I believe I've made the changes that you suggested to the sampling strategy. It's definitely a lot cleaner now. Note, however, that I am seeing even larger standard errors for the one test I have of the whole pipeline. I've updated my comment to that test to give the actual numbers.

Copy link
Member

@swager swager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from a quick read-through. I'm going to take a closer look and see if I can figure out what's driving the test results.

std::vector<size_t> sample;
sampler.sample(data->get_num_rows(), 0.5, sample);
sampler.sample_for_ci(data, 0.5, sample);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to always use the same clustering strategy, whether ci_group_size is >1 or not. Sampling by groups changes the estimands (i.e., do we give equal weights to clusters or equal weights to samples), so this matters even if we're not building CIs.


cluster_size <- 20
# data sim
X <- rnorm(1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define the simulation in terms of n, p, etc. Also, forests may behave very strangely when p = 1, so I'd recommend using at least p = 4.

Also, to make sure running tests doesn't slow down development, it might be better to use smaller sample sizes if they still highlight the desired phenomena (e.g., I'm using n=200; p = 4; cluster_size = 10 and the results are qualitatively similar):

> c(mean_no_cluster, mean_uncorrected, mean_corrected, mean_corrected_no_cluster)
[1] 0.06357332 0.02013494 0.15627015 0.12970989

mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates)
mean_corrected <- mean(preds_corrected.oob$variance.estimates)
mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks like there's a bug!

@swager
Copy link
Member

swager commented Apr 5, 2018

@jtibshirani and I went through the PR in detail. It looks like the errors were driven by sampling with replacement vs without in one spot. After fixing some issues, everything looks good now. We'll handle some last changes and then merge in the PR -- hopefully within a couple days. Thanks again for the very nice contribution!

@swager
Copy link
Member

swager commented Apr 5, 2018

p.s. we also fixed the couple things I mentioned in my previous comment, so no need to worry about those.

@lminer
Copy link
Contributor Author

lminer commented Apr 5, 2018

@swager that's great news. Thanks for hunting the bug. Where was it exactly? Would be good to know for my own benefit.

Copy link
Member

@swager swager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampling issue was on the line noted below. Two other things that mattered:

  • In the test, if we don't set samples_per_cluster=1, then the min_node_size parameter affects trees differently depending on the number of repeats (so we just set the parameter to 1), and
  • There was some inconsistency in how OOB samples were defined.

@jtibshirani will push an update shortly with the changes.

std::vector<size_t>& samples) {
if (options.clustering_enabled()) {
size_t num_samples = options.get_num_clusters();
bootstrap(num_samples, sample_fraction, samples);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should sample without replacement.

@jtibshirani jtibshirani force-pushed the lim-2018.02.26-clustered_sampling branch from 8642b5b to 3c4b19f Compare April 6, 2018 01:36
* Prefer size_t to uint for representing IDs.
* For efficiency, use a vector instead of unordered_map for storing cluster information.
* Minor naming improvements.
@jtibshirani
Copy link
Member

jtibshirani commented Apr 6, 2018

Almost there! 🌴

@swager could I pass off a few things to you before we merge?

  • It'd be great to do a pass at the R code/ interface additions and make sure you're happy with it.
  • I managed to pass 'test_clustered_standard_errors' even with some pretty big bugs around OOB prediction and non-honest trees. Would it be possible to add a couple more cases to that test, focusing on these areas?

@jtibshirani
Copy link
Member

jtibshirani commented Apr 7, 2018

Okay, we're going to merge this, and @swager will add some tests in a future PR. Thanks again @lminer!

@jtibshirani jtibshirani merged commit 9d248a3 into grf-labs:master Apr 7, 2018
@jtibshirani jtibshirani deleted the lim-2018.02.26-clustered_sampling branch April 7, 2018 01:59
jtibshirani added a commit that referenced this pull request Apr 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants