Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e1071::svm(): Use formula interface only if factors are present #1740

Merged
merged 10 commits into from Jun 17, 2019

Conversation

@mb706
Copy link
Contributor

commented Mar 28, 2017

See if this fixes #1738

@mb706

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2017

Note 1: You guys could systematically go through the learners and look for learners that use the formula interface but could conceivably use a superior data.frame interface. Anecdotically, I have seen the shim that is e1071:::svm.formula many times before (where they just get the data.frame out of the formula and then use that). EDIT: This turns out to be a bit more complicated, since the formula interface is used for dummy-encoding of factor features. It could still be done with some work. Things to look out for are parameters that somehow refer to columns (e.g. weighting or scaling of different features) and need to be padded to the new number of columns, and whether or not an intercept column is added.

Note 2: I don't really like the test I implemented, it takes a relatively large amount of time and memory while really only testing for a design decision. Feel free to leave it out.

Note 3: If you think the test for many-feature-dataframes are a good idea, it would be possible to systematically apply that test to all the learners that are supposed to handle many-features-situations.

mb706
@mb706

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2017

This turns out to be harder than I thought, see my comment in the issue

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

i would not advise to merge this. see what i also posted in the issue. we should create a clean issue out of this.

@mllg

This comment has been minimized.

Copy link
Member

commented Apr 3, 2017

The latest changes (only using formula if factors are present) looks good to me.

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2017

Yep, looks good to me.

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

Yep, looks good to me.

how about commenting on the general issue i raised? this is more then "hey code looks good here".
i mean you can argue that we can implement this only for this learner as this is a clear improvement, but then please do so. and that we should possibly open up a more general issue for the general problem and not block this improvement.

also does anybody see any PROBLEM that thi might introduce? or is it just a clear enhancement?

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

i mean, just not the possible side effects this has with this PR #1763

what happens if we merge that too?
no the formula is taken ONLY if we task ALSO contains factors? pretty weird semantics right?
or stated otherwise: does the svm then support the (now not existing) "formula" property...?

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

As I've said in the other thread, an alternative would be to have two separate learners for this. That would avoid issues with being able to specify formulas.

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

As I've said in the other thread,

where exactly please?

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

@berndbischl

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

#1738 (comment)

i am not a big fan of this. it is
a) confusing for users and overly complex to undertand
b) it copies code A LOT.
c) the svm mlr learner should really do this internally

@giuseppec

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

I have to agree with bernd. We should not try to autodetect this and internally do a if else here that changes the behaivour. I thought that our convention was to always use the formula interface whenever possible (e.g. randomForest). I think the user should be able to choose this himself.
Have to think about this a bit longer but one possible solution could be to introduce a formula (and a data.frame) property and let the user decide what to use (e.g. using a configureMlr or as an option of the learner or by checking if the task contains a formula and always prefer the formula interface if the learner supports this, otherwise use the data interface).

@giuseppec

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2017

What cases do we have?

  • the learner supports both, using a formula and data.frame (or matrix)
  • the learner supports only using a formula
  • the learner supports only using a data.frame (or matrix)
  • ???

Concrete suggestion:

  • add a new option prefer.formula.interface to configureMlr
  • default should be prefer.formula.interface = TRUE which always prefers the formula interface whenever possible (i.e. when the learner has a formula property which we have to add)
  • learners that allow using a formula should have a formula properties

Consequences:

  • this way, we don't change the current behaivour and it should be always clear what mlr is doing (currently this is not even clear for users because it is not documented which learner uses which interface)
  • a user will be able to change the interface (and speedup) learners that also support using data.frames (or matrices)
  • contra: our tests will take longer as we have to test both options.
@mb706

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2017

My opinion: I suggest not to complicate the UI but instead make getTaskData to also conform to the formula if one is given (see my comment on the pr).

Different solution: Use the formula if a user specified formula was attached to the task (would add another condition to the if clause), and use the data.frame directly otherwise.

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2017

What's the status here? Have we decided on a course of action?

@giuseppec

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2017

I think we haven't made a concrete decision yet. Do we even need this PR here which only solves this issue for SVM? Or do we want to solve this on an abstract level?
I still have this incomplete PR #1763 with some concrete TODOs where we could also try to address the issue mentioned here.

@giuseppec giuseppec referenced this pull request Jul 24, 2017
3 of 3 tasks complete
@pat-s

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

pushing this discussion as it sets the base for some PRs now (#1899, #1763, #1740)

@giuseppec is again working on #1763, specifying the formula within the task.
Is this agreed on? I think is not that practical because than the task can only be used with models supporting this particular formula notation (correct me if I'm wrong?)

There must always be one place where the formula is specified manually. Why is the ParamSet of a learner (as in #1899) a bad place?

@giuseppec

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2017

  1. Yes, I started working at this again. But I am currently just playing around, still not finished ; -).
  2. No, we don't have an agreement yet. What I want to do: Everything that currently works should of course still work +I would like to add a possibility to specify the formula in the task (just optional). But I agree with you that this can then only be used for learners that support formula objects. However, as @mb706 suggested we could extend getTaskData so that it uses only the features mentioned in the formula.
  3. I don't think that adding a formula to the learner-object is bad in general (I rather think that maybe both should be possible, specifying the formula in task AND in learner objects).
    Suppose you want to apply a learner to 2 or more different tasks (e.g. using the benchmark function). Setting the formula in the learner will only work for one task (as you need to add the feature names).
  4. I think it is very important that we do this stepwise (maybe multiple smaller PR). A aig PR has a high chance that it will never be merged.
@pat-s

This comment has been minimized.

Copy link
Member

commented Aug 3, 2017

But I agree with you that this can then only be used for learners that support formula objects.

This would be ok if the same formula notation is accepted by multiple learners. However, in the case of mgcv::gam the s() notation is unique I think (at least I do not know of any other smooth notation). So if we would implement it this way, we would need to create an extra task only for mgcv::gam.

@berndbischl
what is the main issue that a formula should not be in the paramset for specific learners?

@berndbischl berndbischl force-pushed the master branch 2 times, most recently from cf3db65 to 57f099e Mar 3, 2018

@ja-thomas ja-thomas force-pushed the master branch from 57f099e to 20d5a39 Mar 3, 2018

@pat-s pat-s changed the title Fix 1738 svm no formula e1071::svm(): Use formula interface only if factors are present Jun 6, 2019

@pat-s

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

Since the discussion about formula handling was ported to mlr3 and will probably not solved anymore here (which blocked the merging of this PR), is there something that would still speak against merging this SVM fix here?

@pat-s

pat-s approved these changes Jun 17, 2019

@pat-s

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

@larskotthoff ready to merge?

pat-s and others added some commits Jun 17, 2019

@larskotthoff larskotthoff merged commit e6f045e into master Jun 17, 2019

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
deploy/netlify Deploy preview ready!
Details

@larskotthoff larskotthoff deleted the fix_1738_svm_no_formula branch Jun 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.