-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R package GBM cannot handle sparse matrices (xgboost can) #525
Comments
Thanks @szilard. This should be an easy fix. We got a little too aggressive on the type checking. ;-) |
Sounds great :) |
@szilard We just chatted about this and @terrytangyuan pointed out that reticulate does not yet support sparse matrices, so we will have to patch reticulate to fix this. |
Hi, It seems the R package does not support matrix of type One potential solution is to cast the matrix to type |
@navdeep-G I think it would be best to wait for the fix to reticulate instead of allowing people to assume that we are fully supporting sparse matrices when we don't. |
@ledell Yes, I meant a user can cast themselves. I'll update my response to make that clear. |
Ah, OK. Btw, I'm gonna give it a try in Python, that should be able to take advantage of sparse representation, right? |
@szilard Yes, that should be able to pass the data. If it doesn't please let us know in another issue and I can investigate. Thx! |
Here are the results from Python (using sparse matrices in 1-hot encoding): h2o4gpu: 12sec h2o4gpu code: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.py |
@szilard Is this CPU result? BTW, I am adding support for sparse matrix in reticulate. Will update this issue when it's ready. |
No, all timings in this GitHub issue are on GPU (Tesla V100 on EC2 p3.2xlarge) |
Update: support for sparse matrices has been added in reticulate. I'll start working on this issue soon. |
@szilard I've added support for this. Could you try it out? |
I reinstalled the reticulate and h2o4gpu R packages but I still get:
|
@szilard Try restarting your R/RStudio session? |
I'm just using R from the terminal. Do I need to reinstall the python part? |
Ah, the latest h2o4gpu from github installs reticulate from CRAN:
I reinstalled reticulate manually from GitHub, now it works :) |
Cool, it runs now much faster. With sparse matrices: h2o4gpu from R: 15 sec My guess is that h2o4gpu R calls h2o4gpu python which in turn calls xgboost, right? It seems there is some overhead with wrapping (probably copying stuff or CPU/GPU transfer issues). Anyway, it's probably good enough. Maybe I can try to run it also on larger data to see if the overhead is relatively smaller or else. |
For 10M records: h2o4gpu R: 88 s (all of the above using sparse matrices) I can see that h2o4gpu (both R and py) spends a good while doing something on the CPU (on 1 core only) before the GPU getting busy. Any idea what's that overhead? xgboost doesn't show this behavior, it starts almost right away using the GPU. Code: |
Thanks. It's probably extra copies of the data like you mentioned. Should be a simple matter to cleanup such things. We'll get to work on it. |
Cool @pseudotensor , let me know if updated, so I can re-run it. |
Since sparse matrices have been implemented and this issue is already closed, I moved the remaining performance problem into a new GitHub issue #549 |
Great. Thanks! |
Train on airline data, 1M records
GBM n_estimators = 100L, max_depth = 10L, learning_rate = 0.1
h2o4gpu in R cannot handle sparse matrixes (this makes it slower/using more memory)
(xgboost in R can)
h2o4gpu (non-sparse) 48sec
xgboost non-sparse 26sec
xgboost sparse 8sec
(sparse uses Matrix:sparse.model.matrix while non-sparse uses model.matrix for 1-hot encoding)
AUC is the same for all
h2o4gpu code: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.R
xgboost code: https://github.com/szilard/GBM-perf/blob/master/gpu/run/2-xgboost.R
data and setup here: https://github.com/szilard/GBM-perf
The text was updated successfully, but these errors were encountered: