Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipeOpFeatureUnion with differing row IDs? #216

Open
mb706 opened this issue Aug 6, 2019 · 2 comments
Open

PipeOpFeatureUnion with differing row IDs? #216

mb706 opened this issue Aug 6, 2019 · 2 comments
Labels
Priority: Medium Status: Needs Design Needs some thought and design decisions. Tag: POFU

Comments

@mb706
Copy link
Collaborator

mb706 commented Aug 6, 2019

PipeOpFeatureUnion could under some circumstances want to unite tasks that have differing row IDs, e.g. after PipeOpSubsample on two different paths sampled (and $filter()ed) different sets of rows.

graph = greplicate(PipeOpSubsample$new() %>>%
    PipeOpLearnerCV$new("classif.rpart"), 2) %>>%
  PipeOpFeatureUnion$new()
graph$plot()  # this is what it looks like

graph$train("iris")  # assertion error

mlr-org/mlr3#309 could solve part of this, but the problem goes deeper:

  • what if we do sampling with replacement?
  • what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping?
@mb706 mb706 added the Status: Needs Discussion We still need to think about what the solution should look like label Aug 6, 2019
@pfistfl
Copy link
Sponsor Member

pfistfl commented Apr 19, 2020

what if we do sampling with replacement?

If we subsample with replacement, we basically have the row_id twice in the data used for training the learner. cv-ed predictions will be the same for duplicated instances, this should not be a problem. We might get a problem when we do something stochastic after PipeOpSubsample, but I currently can not think about anything.
As a solution, I would suggest to simply choose the first element if multiple are provided until we encounter a solution where this yields invalid result.
Note also, that the first problem we'd encounter here is an invalid Resampling objects as rows could end up in train as well as in test data I guess.

what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping?
We currently only allow cv for learnercv and therefore this should not be a problem.

A more future proof version would be doing aggregation, but how correct aggregation would look like is unclear and depends on the situation. Therefore I would argue to not tackle this now.

Currently blocked by mlr-org/mlr3#309

@mb706 mb706 removed the Status: Needs Discussion We still need to think about what the solution should look like label Sep 29, 2021
@mb706
Copy link
Collaborator Author

mb706 commented Sep 29, 2021

Official stance of mlr3 is now that we should solve this "manually" from within mlr3pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: Needs Design Needs some thought and design decisions. Tag: POFU
Projects
None yet
Development

No branches or pull requests

2 participants