Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow for a not so big task #1738

Closed
ajing opened this issue Mar 27, 2017 · 23 comments

Comments

@ajing
Copy link

commented Mar 27, 2017

I create a task with 298 samples and 22160 features, then do some resampling for CV. However, I got a stack overflow error. Is there any way to avoid this?

To reproduce the error (according to mb706's comment)

xt = cbind(as.data.frame(matrix(rnorm(298*22160), ncol=22160)), as.factor(sample(c('a', 'b'), 298, T)))
names(xt)[length(xt)] = 'x'
av_task = makeClassifTask("xt", xt, target='x')
r = resample(learner = makeLearner("classif.svm", predict.type = "prob"), task = av_task, resampling = makeResampleDesc(method = "CV", iters = 3), show.info = FALSE, measures = auc)
Error: protect(): protection stack overflow
@SteveBronder

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

It seems like this is a problem with svm and not mlr. See the stackoverflow question here

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

And for problems like this, please always post a complete, reproducible example.

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

The problem occurs because mlr calls svm via the formula interface:

f = getTaskFormula(.task)
e1071::svm(f, data = getTaskData(.task, .subset), probability = .learner$predict.type == "prob", ...)

Instead, we could use

td = getTaskData(.task, .subset, target.extra = TRUE)
e1071::svm(td$data, td$target, probability = .learner$predict.type == "prob", ...)

which appears to work a bit better at first glance (though I only checked the training step as I have limited memory r/n.)

I reproduced it like this:

xt = cbind(as.data.frame(matrix(rnorm(298*22160), ncol=22160)), as.factor(sample(c('a', 'b'), 298, T)))
names(xt)[length(xt)] = 'x'
av_task = makeClassifTask("xt", xt, target='x')
r = resample(learner = makeLearner("classif.svm", predict.type = "prob"), task = av_task, resampling = makeResampleDesc(method = "CV", iters = 3), show.info = FALSE, measures = auc)
@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

Ah, well spotted! Could you make a PR using the other interface so we can quickly see if that breaks anything else please?

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

I can do that tomorrow. I have seen quite a lot of learners that use formula instead of data.frame. In many of these learners, the .formula generic only goes on to extract the data and then call the regular (.data.frame) generic anyways. Is there a reason mlr prefers formula over data.frame?

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

Not as far as I know, but @berndbischl and @mllg would know better.

@mllg

This comment has been minimized.

Copy link
Member

commented Mar 27, 2017

Is there a reason mlr prefers formula over data.frame?

Not really, just a "habit". Nearly all package work with a formula interface and only very few provide an alternative. It is usually safe to switch to a character/data.frame interface, but be aware that it is not unlikely that you will discover some bugs as these are less tested 😞
The speedup can be tremendous though.

@ajing

This comment has been minimized.

Copy link
Author

commented Mar 27, 2017

Is there any simple fix for this? I currently only need glmnet and svm. So, I try to create two new models by makeRLearnerClassif. What do I need to modify in the trainLearner.classif.glmnet function? Removing .formula in the args and add x, y in args?

@mllg

This comment has been minimized.

Copy link
Member

commented Mar 28, 2017

glmnet is not using the formula interface. As soon as @mb706 manages to create a PR for e1071's SVM, you should be all set after switching to the devel version.

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

Is there a reason mlr prefers formula over data.frame?

To answer my own question: the reason is that the formula interface accepts factors (converting them to dummy variables) while the data.frame interface (really a matrix interface) does not.

To solve this, mlr would need to do the dummy encoding. It would also need to watch out that the "scale" parameter of svm must be padded to account for the different number of columns, and must be FALSE for the dummy columns.

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

@ajing if you really need to use svm, you can try to use the fix_1738_svm_no_formula branch (maybe installing it with

devtools::install_github("mlr-org/mlr", ref = "fix_1738_svm_no_formula")

works); although I can't recommend it: I cannot guarantee correctness, and in particular, the svm learner in that branch does not handle categorical features.

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

@ajing the actual solution is to use the command line option --max-ppsize, e.g.

R --max-ppsize 100000

to increase the size of the protect call stack. The default is 50,000, the maximum is 500,000 (but leads to some slowdown due to slower GC), see help(Memory).

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

Well, as an alternative, what do you think of having a second learner that doesn't use the formula interface and doesn't support factors? I don't think it's a good idea to do this kind of conversion manually.

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

Right, it's the latter part that I'm worried about. At the moment, the mlr learners are relatively thin wrappers around the actual learners, and making this wrapper thicker would increase maintenance costs. As we have a lot of learners, I think we should avoid this.

Thoughts @berndbischl @mllg @jakob-r

@ajing

This comment has been minimized.

Copy link
Author

commented Mar 28, 2017

Thanks for the prompt reply!

Another relevant question... Is there any simple way to debug customized functions like makeRLearnerRegr(), makePreprocWrapper(), makeFilterWrapper()? Using setBreakpoint? When I was using traceback(), it usually returns lots of data information.

Thanks!

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

Well, that depends on your definition of "simple". You will get a lot of information, but usually the last stack frame(s) is the only thing you need to have a look at.

@ajing

This comment has been minimized.

Copy link
Author

commented Mar 28, 2017

Thanks @mb706 !

R --max-ppsize 500000

works well.

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

Is there a reason mlr prefers formula over data.frame?
Not really, just a "habit". Nearly all package work with a formula interface and only very few provide an alternative. It is usually safe to switch to a character/data.frame interface, but be aware that it is not unlikely that you will discover some bugs as these are less tested
The speedup can be tremendous though

the comment about that this is just a (bad) habit, that we use the formula interface, is REALLY not true.
i have tought about changing to a matrix interface, or conditionally supporting this, quite often. and it also is NOT true that it is "a simple and safe" switch to a "char/df" interface. I dont even know that that means? Most packages support a NUMERICAL MATRIX interface.

but guess what: i discovered exactly the same problems that @mb706 reports here.

  • you have to convert factors now manually
  • and what is even worse: how will this affect now certain operations in the algorithm? like scale and so on. (EDIT: or think about algos that perform internal foward search?)
  • i also dont like that the learner interfaces become even longer

so this is FAR from a simple switch. and the MAIN reason we use the formula interface is, that this is the only interface we can SAFELY rely on.

i see this as an important point. and i would love to have the package here be even more performant. but if we do this, this will be a very big change. which needs to be

  • properly discussed
  • implemented correctly (for more than 1 learner)
  • tested correctly
    and so on.

if somebody sees a clear and doable solution please post.

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

Thanks for the prompt reply!

Another relevant question... Is there any simple way to debug customized functions like makeRLearnerRegr(), makePreprocWrapper(), makeFilterWrapper()? Using setBreakpoint? When I was using traceback(), it usually returns lots of data information.

Thanks!

what i always do if i have problems with client code in other packages i dont control:
a) download the src. you can do that manually, from cran or github.
but much easier is to install the "rt" tool from @mllg and use the rclone command in the shell
https://github.com/rdatsci/rt

so like

rclone mlr

b) when you have the source code locally, dont load library(mlr), use devtools::load_all("mlr")
c) now you can set breakpoint in the mlr sources, or add prints, or change code....

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

there are some important point in this thread we should really put into an FAQ

@mb706

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

We could modify getTaskFormula to give a warning (instructing the user to use --max-ppsize) if the task has more than 16k features.

(Ideally we would want to suppress this warning if the user is already using --max-ppsize. The "ppsize" is stored in a variable called R_PPStackSize, but I have not found it in any of the R include files on my system, so I'm not sure what the options are in trying to query it with a C function.)

(We could make this warning appear only once per R session.)

@jitendrashahiitb

This comment has been minimized.

Copy link

commented May 31, 2018

@mb706 you wrote
"the actual solution is to use the command line option --max-ppsize, e.g.

R --max-ppsize 100000

to increase the size of the protect call stack. "

But how does one do that using rstudio in UBUNTU.
I have seen how this is done in windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.