-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does causal_forest support a sparse.model.matrix? #99
Comments
Unfortunately this isn't currently possible. It seems very useful and doesn't look too difficult to add though -- hopefully we'll be able to get to it in the next week or so! |
Great! I imagine it could improve the performance a lot too |
I was looking at the code again and I think the only time the input NumericMatrix object is used is in convert_data
Do you just need a convert_data that takes in a sparse matrix instead? I think the sparse matrix types will still have an iterator like .begin(). If you think it is a simple change I can even help with a PR, although I don't know exactly how else the class |
Unfortunately the If the sparse matrix is primarily to handle the one hot encoding, you could try an alternate approach to handling categorical variables suggested in ESL: represent the categories from 1 .. n, with the categories sorted by their mean outcome. For this to be true to the recommendation, we should perform a new ordering at every split (and likely take the mean of gradients and not outcomes), but this may work pretty well in the short-term. |
Closing, as we've now added support for passing in a sparse matrix of type This should help cut down on memory usage, but we still have a lot of work to do to improve speed when there is a large number of features. I also wanted to note that we now set a much more reasonable default for mtry, the number of parameters to consider in each split (#121). So it's worth upgrading to release v0.9.4, or setting mtry explicitly. |
Hi, I am wondering if allowing sparse matrix as inputs for the grf functions is still being implemented. I am using a moderately large dataset (~5G) and realized that causal_forest takes a lot memory. Being able to use sparse matrix could help reduce memory consumption and speed up processing. Any information will be highly appreciated! Thanks a lot! |
I am interested in using causal_forest with a treatment variable, W, and covariates X that are categorical variables. I would like to use
sparse.model.matrix(~ ., data = X)
to generate the covariates since the 1 hot encoding will be very sparse.I believe the training function ultimately calls this cpp function https://github.com/swager/grf/blob/15c7af3b1cf39fc1ae231fe325a8d800ad78ff14/r-package/grf/bindings/RegressionForestBindings.cpp#L12. The Rcpp NumericMatrix class is for dense matrices, is it possible to get this for sparse data?
The text was updated successfully, but these errors were encountered: