Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does causal_forest support a sparse.model.matrix? #99

Closed
jeffwong opened this issue Jul 12, 2017 · 7 comments
Closed

Does causal_forest support a sparse.model.matrix? #99

jeffwong opened this issue Jul 12, 2017 · 7 comments

Comments

@jeffwong
Copy link

I am interested in using causal_forest with a treatment variable, W, and covariates X that are categorical variables. I would like to use sparse.model.matrix(~ ., data = X) to generate the covariates since the 1 hot encoding will be very sparse.

I believe the training function ultimately calls this cpp function https://github.com/swager/grf/blob/15c7af3b1cf39fc1ae231fe325a8d800ad78ff14/r-package/grf/bindings/RegressionForestBindings.cpp#L12. The Rcpp NumericMatrix class is for dense matrices, is it possible to get this for sparse data?

@jtibshirani
Copy link
Member

Unfortunately this isn't currently possible. It seems very useful and doesn't look too difficult to add though -- hopefully we'll be able to get to it in the next week or so!

@jeffwong
Copy link
Author

Great! I imagine it could improve the performance a lot too

@jeffwong
Copy link
Author

jeffwong commented Jul 17, 2017

I was looking at the code again and I think the only time the input NumericMatrix object is used is in convert_data

Data* RcppUtilities::convert_data(Rcpp::NumericMatrix input_data,
                                  const std::vector<std::string>& variable_names) {
  size_t num_rows = input_data.nrow();
  size_t num_cols = input_data.ncol();

  Data* data = new Data(input_data.begin(), variable_names, num_rows, num_cols);
  data->sort();
  return data;
}

Do you just need a convert_data that takes in a sparse matrix instead? I think the sparse matrix types will still have an iterator like .begin(). If you think it is a simple change I can even help with a PR, although I don't know exactly how else the class Data will use the iterator

@jtibshirani
Copy link
Member

Unfortunately the Data constructor takes in an array with one item per element in the matrix, not an iterator. So I think we'll need a new subtype of Data that works based on a sparse matrix (similar to this file from ranger, which is what grf is originally based on: https://github.com/imbs-hl/ranger/blob/master/src/DataSparse.h).

If the sparse matrix is primarily to handle the one hot encoding, you could try an alternate approach to handling categorical variables suggested in ESL: represent the categories from 1 .. n, with the categories sorted by their mean outcome. For this to be true to the recommendation, we should perform a new ordering at every split (and likely take the mean of gradients and not outcomes), but this may work pretty well in the short-term.

@jtibshirani
Copy link
Member

Closing, as we've now added support for passing in a sparse matrix of type dgCMatrix.

This should help cut down on memory usage, but we still have a lot of work to do to improve speed when there is a large number of features. I also wanted to note that we now set a much more reasonable default for mtry, the number of parameters to consider in each split (#121). So it's worth upgrading to release v0.9.4, or setting mtry explicitly.

@laixx214
Copy link

Hi, I am wondering if allowing sparse matrix as inputs for the grf functions is still being implemented. I am using a moderately large dataset (~5G) and realized that causal_forest takes a lot memory. Being able to use sparse matrix could help reduce memory consumption and speed up processing. Any information will be highly appreciated! Thanks a lot!

@erikcs
Copy link
Member

erikcs commented Jan 19, 2024

Hi @laixx214, see #939

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants