Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output tree$nodes[[i]]$samples #258

Closed
predt opened this issue Jul 10, 2018 · 5 comments
Closed

Output tree$nodes[[i]]$samples #258

predt opened this issue Jul 10, 2018 · 5 comments

Comments

@predt
Copy link

predt commented Jul 10, 2018

Hello @jtibshirani
Quick question: For a given final node "i" of a tree (i.e. a leaf),
does the output tree$nodes[[i]]$samples correspond to the observations of the training sub sample used to build the tree (i.e. J1 in paper) falling in that leaf, or are they the observations from the other sub sample (J2) falling in that leaf?
Thanks!

@predt I'm sorry I missed your question earlier! Would you be able to open a new issue with this question, and I will add a detailed answer there? Keeping each issue scoped to one topic helps ensure that other users with the same question will be able to find the answer as well. To answer briefly, that vector only contains examples from the second subsample (J2).

Thanks, @jtibshirani. Since tree$nodes[[i]]$samples corresponds to J2, the complement in "drawn_samples" should give me the set of samples in J1. Is that correct?
I'm working in the appendix of an application of the GRF. I'm using a tree example figure to make more pedagogical the explanation of building a tree. I wanted to add the theta.hat.P values that results after splitting of a node ( theta.hat.P is the notation in the paper) to illustrate how splits favor heterogeneity in the context of a generalized causal forest. That is the reason of looking for the J1 samples. Thanks.

@jtibshirani
Copy link
Member

You're right, drawn_samples will include all samples that went into constructing the tree. If honesty is enabled, this set includes both the samples used to perform splits (J1), and the samples that populate the leaf nodes (J2). If honesty is not enabled, these two sets are the same, and drawn_samples will be equal to the union of all samples in the leaf nodes.

I've kept this issue open and tagged it with 'documentation', so we remember to add an explanation to get_tree about the different list elements that are returned.

@susanathey
Copy link
Collaborator

It would be better to keep track of which is which (J1 and J2) for the use case of using the results from a single tree; may matter for different methods of calculating standard errors as well.

@jtibshirani
Copy link
Member

@susanathey to clarify the exchange above, because you have access to both the leaf samples of a tree, and the overall 'drawn samples' for that tree, both J1 and J2 can be calculated fairly easily. In particular, J2 can be calculated by taking the union of all samples in nodes[[i]]$samples, then J1 can be found by taking the difference of drawn_samples and J2.

My intuition is that unless accessing both J1 and J2 is part of a common (and performance-sensitive) workflow, we shouldn't return those sets separately to avoid duplicating the same set in J1 and J2 when honesty isn't enabled. Let me know if that seems off.

@susanathey
Copy link
Collaborator

@jtibshirani Sorry I misunderstood. Maybe we can post a code sample and/or add it to our testing or demo code for users who might want to access them.

@jtibshirani
Copy link
Member

I've updated the documentation in #268.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants