-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Clustered standard errors #180
[WIP] Clustered standard errors #180
Conversation
core/src/commons/Data.cpp
Outdated
size_t obs_cluster = clusters[s]; | ||
this->cluster_map[obs_cluster].push_back(s); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swager I put the cluster information in Data
, but it might be better to put it in ForestOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I agree that this clustering information belongs in Data
, but given the current code structure, I think it will be cleaner to put it in ForestOptions
. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.
mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates) | ||
mean_corrected <- mean(preds_corrected.oob$variance.estimates) | ||
mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swager I'm not sure how to evaluate the output here. I simulate clusters by just making 20 copies of the same data. mean_no_cluster
is the mean of the OOB variance estimates on the unduplicated data. mean_uncorrected
is the mean of the OOB variance estimates on the duplicated data, without correcting for clusters. mean_corrected
is the mean of the OOB variance estimates on the duplicated data after correcting for clusters.
The results of this test are as follows:
mean_no_cluster = 0.033
mean_uncorrected = .009
mean_corrected = .167
Do these results pass the sniff test? I worry that the corrected variance estimates are too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, looks like there's a bug!
Thank you so much @lminer for the PR! We'll take a close look. |
@lminer -- I'm taking a look at this now. Unless you've already started, I'll resolve the merge conflicts that came up (since I created them in the first place!) |
@jtibshirani I'm happy with you resolving the conflicts :) |
Great, would you be able to give me write permissions to your branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still debugging some issues with the merge, but wanted to get you some preliminary comments.
Many of these comments are pretty minor, but I noticed one bigger issue that may be leading to the poor results you're seeing. With clustering enabled, my understanding of the correct algorithm is as follows:
- For each CI group, sample 50% of the clusters (without replacement). Each CI group is associated with a list of cluster IDs.
- Within this half-sample of cluster IDs, sample 'sample_fraction' of the clusters (without replacement). Each tree is associated with a list of cluster IDs.
- If honesty is enabled, split these cluster IDs in half, so that one half can be used for growing the tree, and the other half is used in repopulating the leaves.
- To grow the tree, draw
samples_per_cluster
from each of the cluster IDs, and do the same when repopulating the leaves for honesty.
Currently, it looks like your code is drawing samples from clusters in the very first step (when we draw a half-sample for a CI group). My apologies if this wasn't very clear from your conversations, or if I'm misunderstanding your approach.
My suggestion for how to structure this sampling scheme:
- Store all information about clusters inside of
SamplingOptions
, to limit logic around cluster sampling as much as possible to toRandomSampler
. - Use essentially the same standard methods for sampling, subsampling, etc., but make sure that these are operating on cluster IDs instead of sample IDs. It'd be nice to add a comment somewhere explaining that we're working with cluster IDs instead of sample IDs when clustering is enabled.
- At the very last step (growing trees), draw actual samples from clusters. Use these to grow the trees and if honesty is enabled, re-populate the leaves.
Tagging @swager for context on the above.
sample_weights(0) {} | ||
sample_weights(0), | ||
samples_per_cluster(1) | ||
{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nitpick) These parentheses should be on the same line as the last parameter. This applies here and a few places below.
core/src/sampling/SamplingOptions.h
Outdated
|
||
bool get_sample_with_replacement() const; | ||
const std::vector<double>& get_sample_weights() const; | ||
unsigned int get_samples_per_cluster() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we should use uint
to refer to unsigned integer
(it's been typedef'd in globals.h
). This comment applies to a few different places in this PR.
I'm not sure this typedef is actually good practice, as we inherited this choice from the ranger repository. For now, it's best to go with the standard in other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
core/src/tree/TreeOptions.cpp
Outdated
mtry(mtry), | ||
min_node_size(min_node_size), | ||
honesty(honesty) {} | ||
honesty(honesty), | ||
clustered(samples_per_cluster > 1) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unnecessary to spread information about clustering across three objects (TreeOptions
, SamplingOptions
, and Data
) -- could we move everything into SamplingOptions
?
core/src/commons/Data.cpp
Outdated
size_t obs_cluster = clusters[s]; | ||
this->cluster_map[obs_cluster].push_back(s); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I agree that this clustering information belongs in Data
, but given the current code structure, I think it will be cleaner to put it in ForestOptions
. We'll avoid the situation where the data contains cluster information during training but not prediction, we won't create multiple data constructors, etc.
r-package/grf/R/input_utilities.R
Outdated
} else if (length(clusters) == 0) { | ||
clusters <- vector(mode="numeric", length=0) | ||
} else if (!is.vector(clusters) | !all(clusters == floor(clusters))) { | ||
stop("clusters must be a vector of integers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nitpick) Start of error should be capitalized. Same comment applies below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
core/src/sampling/RandomSampler.cpp
Outdated
std::vector<size_t>& oob_samples, | ||
Data* data) { | ||
if (options.get_samples_per_cluster() > 1) { | ||
auto clusters = data->get_clusters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard in this repo is to avoid using 'auto', except when dealing with iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. For my own edification, is there a rational for this style choice? Is auto
not favored because it obscures type information?
samples_per_cluster(1) | ||
{} | ||
|
||
SamplingOptions::SamplingOptions(bool sample_with_replacement, unsigned int samples_per_cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I'm not a big fan of telescoping constructors like this. It'd be best to only have a default constructor (takes no arguments), and one that accepts all arguments. The implication in this case: we should nuke the constructor SamplingOptions(bool sample_with_replacement)
above. This may apply to a few different places in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
r-package/grf/R/input_utilities.R
Outdated
|
||
validate_samples_per_cluster <- function(samples_per_cluster, clusters) { | ||
if (is.null(clusters) || length(clusters) == 0) { | ||
return(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this earlier, but seems like this should be a different null marker, like 0? The value 1 could actually be valid when clustering is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
core/src/sampling/RandomSampler.cpp
Outdated
std::vector<size_t>& subsamples, | ||
std::vector<size_t>& oob_samples, | ||
Data* data) { | ||
if (options.get_samples_per_cluster() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right to me -- can't samples_per_cluster be 1, but clustering is still enabled? For example, the smallest cluster could have size 1.
I think we just want to check clusters.empty().
mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates) | ||
|
||
expect_true(mean_uncorrected < mean_corrected) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nitpick) Should add newline here.
core/src/forest/ForestTrainer.cpp
Outdated
@@ -136,13 +135,15 @@ std::vector<std::shared_ptr<Tree>> ForestTrainer::train_ci_group(Data* data, | |||
std::vector<std::shared_ptr<Tree>> trees; | |||
|
|||
std::vector<size_t> sample; | |||
sampler.sample(data->get_num_rows(), 0.5, sample); | |||
std::vector<size_t> dummy_oob_sample; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this PR was open, I refactored RandomSampler
to not accept 'OOB samples' unless it is really necessary. As you update your sampling approach, it be good to remove unnecessary oob_sample
parameters as well.
@jtibshirani Thanks for the very thorough review. I believe I've made the changes that you suggested to the sampling strategy. It's definitely a lot cleaner now. Note, however, that I am seeing even larger standard errors for the one test I have of the whole pipeline. I've updated my comment to that test to give the actual numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from a quick read-through. I'm going to take a closer look and see if I can figure out what's driving the test results.
core/src/forest/ForestTrainer.cpp
Outdated
std::vector<size_t> sample; | ||
sampler.sample(data->get_num_rows(), 0.5, sample); | ||
sampler.sample_for_ci(data, 0.5, sample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to always use the same clustering strategy, whether ci_group_size
is >1 or not. Sampling by groups changes the estimands (i.e., do we give equal weights to clusters or equal weights to samples), so this matters even if we're not building CIs.
|
||
cluster_size <- 20 | ||
# data sim | ||
X <- rnorm(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define the simulation in terms of n
, p
, etc. Also, forests may behave very strangely when p = 1
, so I'd recommend using at least p = 4
.
Also, to make sure running tests doesn't slow down development, it might be better to use smaller sample sizes if they still highlight the desired phenomena (e.g., I'm using n=200; p = 4; cluster_size = 10
and the results are qualitatively similar):
> c(mean_no_cluster, mean_uncorrected, mean_corrected, mean_corrected_no_cluster)
[1] 0.06357332 0.02013494 0.15627015 0.12970989
mean_uncorrected <- mean(preds_uncorrected.oob$variance.estimates) | ||
mean_corrected <- mean(preds_corrected.oob$variance.estimates) | ||
mean_corrected_no_cluster <- mean(preds_corrected_no_cluster.oob$variance.estimates) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, looks like there's a bug!
@jtibshirani and I went through the PR in detail. It looks like the errors were driven by sampling with replacement vs without in one spot. After fixing some issues, everything looks good now. We'll handle some last changes and then merge in the PR -- hopefully within a couple days. Thanks again for the very nice contribution! |
p.s. we also fixed the couple things I mentioned in my previous comment, so no need to worry about those. |
@swager that's great news. Thanks for hunting the bug. Where was it exactly? Would be good to know for my own benefit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sampling issue was on the line noted below. Two other things that mattered:
- In the test, if we don't set
samples_per_cluster=1
, then themin_node_size
parameter affects trees differently depending on the number of repeats (so we just set the parameter to 1), and - There was some inconsistency in how OOB samples were defined.
@jtibshirani will push an update shortly with the changes.
core/src/sampling/RandomSampler.cpp
Outdated
std::vector<size_t>& samples) { | ||
if (options.clustering_enabled()) { | ||
size_t num_samples = options.get_num_clusters(); | ||
bootstrap(num_samples, sample_fraction, samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should sample without replacement.
8642b5b
to
3c4b19f
Compare
* Prefer size_t to uint for representing IDs. * For efficiency, use a vector instead of unordered_map for storing cluster information. * Minor naming improvements.
… the drawn clusters.
Almost there! 🌴 @swager could I pass off a few things to you before we merge?
|
[WIP] Clustered standard errors
Resolves #178. Provides an option for calculating clustered standard errors.
This is accomplished by half-sampling with replacement at the cluster level, rather than at the individual observation level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering as follows. If
samples_per_cluster
is not specified, we set it equal to the size of the smallest cluster. When sampling for the bootstrap, we half-sample by clusters. Rather than taking all observations in a sampled cluster, we take a sample from selected cluster of sizesamples_per_cluster
. Subsampling also occurs along cluster boundaries.We only make this adjustment to sampling when calculating standard errors. Point estimate calculations are identical to the non-cluster case.