Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for factor variables. #109

Closed
jtibshirani opened this issue Jul 23, 2017 · 5 comments
Closed

Add support for factor variables. #109

jtibshirani opened this issue Jul 23, 2017 · 5 comments
Assignees
Labels
feature requires research An issue that needs additional thought and experimentation before it can be implemented.

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Jul 23, 2017

We should likely implement the approach suggested in ESL where in each node, factor variables are ordered by their mean outcome before performing the split. This should be properly generalized to handle non-regression forests.

@jtibshirani jtibshirani changed the title Add support for factor variables Add support for factor variables. Jul 23, 2017
@lwu9
Copy link

lwu9 commented Mar 26, 2018

When there is a categorical variable X1 in X, it is possible that there is a child node where those observations contain only one category, i.e. all values of the variable X1 are the same. In this case, how can we get pseudo-outcomes? The pseudo-outcomes is obtained through the inverse of Ap which may be singular. Isn't it?

@swager
Copy link
Member

swager commented Mar 28, 2018

The X-values aren't used to compute the pseudo-outcomes in the leaf in the standard GRF formulation; rather, only the "outcomes" matter (e.g., W and Y for causal_forest). The features X enter into the problem by determining which leaf an observation falls into.

@lwu9
Copy link

lwu9 commented Mar 28, 2018

Thanks for you reply @swager! I understand in your causal_forest, pseudo-outcomes only have W and Y. But if we want to do the local linear regression, so our psi should be:
psi(Y_i)=Y_i-theta * X_i, (here theta is related to the query point x), shouldn't it? So when we calculate Ap, we need take derivative of psi w.r.t. theta, and the result will include feature X. Is there anything wrong in my understanding?

@jtibshirani jtibshirani added the help wanted Community members are welcome to submit a pull request to address the issue. label May 28, 2018
@jtibshirani jtibshirani added requires research An issue that needs additional thought and experimentation before it can be implemented. feature and removed help wanted Community members are welcome to submit a pull request to address the issue. labels Dec 9, 2018
@jtibshirani
Copy link
Member Author

We added the sufrep package, which contains a collection of methods for handling categorical variables, and a tutorial for how to use sufrep with grf.

@lwu9
Copy link

lwu9 commented Feb 23, 2020

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature requires research An issue that needs additional thought and experimentation before it can be implemented.
Projects
None yet
Development

No branches or pull requests

4 participants