R package GBM cannot handle sparse matrices (xgboost can) #525

szilard · 2018-03-28T21:57:40Z

OS platform: Linux Ubuntu 16.04
Installed from (source or binary): Python wheel + R package
Version: 0.2-nccl-cuda9/h2o4gpu-0.2.0-cp36-cp36m-linux_x86_64.whl
Python version (optional): 3.6
CUDA/cuDNN version: 9.0
GPU model (optional): EC2 p3.2xlarge (Tesla V100)
CPU model: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
RAM available: 60GB

Train on airline data, 1M records
GBM n_estimators = 100L, max_depth = 10L, learning_rate = 0.1

h2o4gpu in R cannot handle sparse matrixes (this makes it slower/using more memory)

Error in resolve_model_input(x) :
  Input x of type "dgCMatrix" is not currently supported.

(xgboost in R can)

h2o4gpu (non-sparse) 48sec
xgboost non-sparse 26sec
xgboost sparse 8sec

(sparse uses Matrix:sparse.model.matrix while non-sparse uses model.matrix for 1-hot encoding)

AUC is the same for all

h2o4gpu code: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.R
xgboost code: https://github.com/szilard/GBM-perf/blob/master/gpu/run/2-xgboost.R

data and setup here: https://github.com/szilard/GBM-perf

The text was updated successfully, but these errors were encountered:

ledell · 2018-03-28T22:24:11Z

Thanks @szilard. This should be an easy fix. We got a little too aggressive on the type checking. ;-)

szilard · 2018-03-28T22:25:52Z

Sounds great :)

ledell · 2018-03-28T22:39:01Z

@szilard We just chatted about this and @terrytangyuan pointed out that reticulate does not yet support sparse matrices, so we will have to patch reticulate to fix this.

navdeep-G · 2018-03-28T22:39:06Z

Hi,

It seems the R package does not support matrix of typedgCMatrix at the current moment. This is because reticulate itself does not handle this type of matrix.

One potential solution is to cast the matrix to type matrix -> as.matrix(m) before passing to h2o4gpu, but that might give a performance hit.

cc @ledell @terrytangyuan

ledell · 2018-03-28T22:40:54Z

@navdeep-G I think it would be best to wait for the fix to reticulate instead of allowing people to assume that we are fully supporting sparse matrices when we don't.

navdeep-G · 2018-03-28T22:43:32Z

@ledell Yes, I meant a user can cast themselves. I'll update my response to make that clear.

szilard · 2018-03-28T22:49:15Z

Ah, OK.

Btw, I'm gonna give it a try in Python, that should be able to take advantage of sparse representation, right?

navdeep-G · 2018-03-28T22:52:58Z

@szilard Yes, that should be able to pass the data. If it doesn't please let us know in another issue and I can investigate. Thx!

szilard · 2018-03-29T11:50:39Z

Here are the results from Python (using sparse matrices in 1-hot encoding):

h2o4gpu: 12sec
xgboost: 8 sec

h2o4gpu code: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.py
xgboost code: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/xgboost.py

terrytangyuan · 2018-03-29T17:26:34Z

@szilard Is this CPU result? BTW, I am adding support for sparse matrix in reticulate. Will update this issue when it's ready.

szilard · 2018-03-29T17:43:50Z

No, all timings in this GitHub issue are on GPU (Tesla V100 on EC2 p3.2xlarge)

terrytangyuan · 2018-04-02T17:17:10Z

Update: support for sparse matrices has been added in reticulate. I'll start working on this issue soon.

terrytangyuan · 2018-04-06T12:11:09Z

@szilard I've added support for this. Could you try it out?

szilard · 2018-04-11T19:53:42Z

I reinstalled the reticulate and h2o4gpu R packages but I still get:

  Input x of type "dgCMatrix" is not currently supported.

terrytangyuan · 2018-04-11T19:56:13Z

@szilard Try restarting your R/RStudio session?

szilard · 2018-04-11T20:46:48Z

I'm just using R from the terminal. Do I need to reinstall the python part?

szilard · 2018-04-11T20:57:38Z

Ah, the latest h2o4gpu from github installs reticulate from CRAN:

> devtools::install_github("h2oai/h2o4gpu", subdir = "src/interface_r")
Downloading GitHub repo h2oai/h2o4gpu@master
from URL https://api.github.com/repos/h2oai/h2o4gpu/zipball/master
Installing h2o4gpu
trying URL 'https://cloud.r-project.org/src/contrib/reticulate_1.6.tar.gz'

I reinstalled reticulate manually from GitHub, now it works :)

szilard · 2018-04-11T21:10:00Z

Cool, it runs now much faster. With sparse matrices:

h2o4gpu from R: 15 sec
h2o4gpu from python: 12 sec
xgboost: 8 sec

My guess is that h2o4gpu R calls h2o4gpu python which in turn calls xgboost, right? It seems there is some overhead with wrapping (probably copying stuff or CPU/GPU transfer issues). Anyway, it's probably good enough. Maybe I can try to run it also on larger data to see if the overhead is relatively smaller or else.

szilard · 2018-04-11T21:51:34Z

For 10M records:

h2o4gpu R: 88 s
h2o4gpu py: 66 s
xgb R: 26 s
xgb py: 26 s

(all of the above using sparse matrices)

I can see that h2o4gpu (both R and py) spends a good while doing something on the CPU (on 1 core only) before the GPU getting busy. Any idea what's that overhead? xgboost doesn't show this behavior, it starts almost right away using the GPU.

Code:
h2o4gpu R: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.R
h2o4gpu py: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/run.py
xgboost R: https://github.com/szilard/GBM-perf/blob/master/gpu/run/2-xgboost.R
xgboost py: https://github.com/szilard/GBM-perf/blob/master/wip-testing/h2o4gpu/xgboost.py

pseudotensor · 2018-04-11T21:53:59Z

Thanks. It's probably extra copies of the data like you mentioned. Should be a simple matter to cleanup such things. We'll get to work on it.

szilard · 2018-04-11T21:55:58Z

Cool @pseudotensor , let me know if updated, so I can re-run it.

szilard · 2018-04-11T22:02:52Z

Since sparse matrices have been implemented and this issue is already closed, I moved the remaining performance problem into a new GitHub issue #549

terrytangyuan · 2018-04-12T01:31:22Z

Great. Thanks!

szilard mentioned this issue Mar 28, 2018

h2o4gpu results szilard/GBM-perf#7

Open

ledell added the R label Mar 28, 2018

ledell assigned ledell and terrytangyuan Mar 28, 2018

terrytangyuan mentioned this issue Apr 5, 2018

Fixes #525: Support sparse feature matrix for xgboost models #537

Merged

terrytangyuan closed this as completed in 8c0bf80 Apr 6, 2018

terrytangyuan added this to the 0.2.1 milestone Apr 6, 2018

szilard mentioned this issue Apr 11, 2018

h2o4gpu GBM slower than xgboost #549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R package GBM cannot handle sparse matrices (xgboost can) #525

R package GBM cannot handle sparse matrices (xgboost can) #525

szilard commented Mar 28, 2018 •

edited

Loading

ledell commented Mar 28, 2018

szilard commented Mar 28, 2018

ledell commented Mar 28, 2018

navdeep-G commented Mar 28, 2018 •

edited

Loading

ledell commented Mar 28, 2018

navdeep-G commented Mar 28, 2018 •

edited

Loading

szilard commented Mar 28, 2018

navdeep-G commented Mar 28, 2018

szilard commented Mar 29, 2018 •

edited

Loading

terrytangyuan commented Mar 29, 2018

szilard commented Mar 29, 2018 •

edited

Loading

terrytangyuan commented Apr 2, 2018

terrytangyuan commented Apr 6, 2018 •

edited

Loading

szilard commented Apr 11, 2018

terrytangyuan commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018 •

edited

Loading

pseudotensor commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

terrytangyuan commented Apr 12, 2018

R package GBM cannot handle sparse matrices (xgboost can) #525

R package GBM cannot handle sparse matrices (xgboost can) #525

Comments

szilard commented Mar 28, 2018 • edited Loading

ledell commented Mar 28, 2018

szilard commented Mar 28, 2018

ledell commented Mar 28, 2018

navdeep-G commented Mar 28, 2018 • edited Loading

ledell commented Mar 28, 2018

navdeep-G commented Mar 28, 2018 • edited Loading

szilard commented Mar 28, 2018

navdeep-G commented Mar 28, 2018

szilard commented Mar 29, 2018 • edited Loading

terrytangyuan commented Mar 29, 2018

szilard commented Mar 29, 2018 • edited Loading

terrytangyuan commented Apr 2, 2018

terrytangyuan commented Apr 6, 2018 • edited Loading

szilard commented Apr 11, 2018

terrytangyuan commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018 • edited Loading

pseudotensor commented Apr 11, 2018

szilard commented Apr 11, 2018

szilard commented Apr 11, 2018

terrytangyuan commented Apr 12, 2018

szilard commented Mar 28, 2018 •

edited

Loading

navdeep-G commented Mar 28, 2018 •

edited

Loading

navdeep-G commented Mar 28, 2018 •

edited

Loading

szilard commented Mar 29, 2018 •

edited

Loading

szilard commented Mar 29, 2018 •

edited

Loading

terrytangyuan commented Apr 6, 2018 •

edited

Loading

szilard commented Apr 11, 2018 •

edited

Loading