PipeOpFeatureUnion with differing row IDs? #216

mb706 · 2019-08-06T11:11:36Z

PipeOpFeatureUnion could under some circumstances want to unite tasks that have differing row IDs, e.g. after PipeOpSubsample on two different paths sampled (and $filter()ed) different sets of rows.

graph = greplicate(PipeOpSubsample$new() %>>%
    PipeOpLearnerCV$new("classif.rpart"), 2) %>>%
  PipeOpFeatureUnion$new()
graph$plot()  # this is what it looks like

graph$train("iris")  # assertion error

mlr-org/mlr3#309 could solve part of this, but the problem goes deeper:

what if we do sampling with replacement?
what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping?

The text was updated successfully, but these errors were encountered:

pfistfl · 2020-04-19T14:11:07Z

what if we do sampling with replacement?

If we subsample with replacement, we basically have the row_id twice in the data used for training the learner. cv-ed predictions will be the same for duplicated instances, this should not be a problem. We might get a problem when we do something stochastic after PipeOpSubsample, but I currently can not think about anything.
As a solution, I would suggest to simply choose the first element if multiple are provided until we encounter a solution where this yields invalid result.
Note also, that the first problem we'd encounter here is an invalid Resampling objects as rows could end up in train as well as in test data I guess.

what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping?
We currently only allow cv for learnercv and therefore this should not be a problem.

A more future proof version would be doing aggregation, but how correct aggregation would look like is unclear and depends on the situation. Therefore I would argue to not tackle this now.

Currently blocked by mlr-org/mlr3#309

mb706 · 2021-09-29T14:55:18Z

Official stance of mlr3 is now that we should solve this "manually" from within mlr3pipelines.

mb706 mentioned this issue Aug 6, 2019

cbinding data with different row number or row ids mlr-org/mlr3#309

Closed

mb706 added the Status: Needs Discussion We still need to think about what the solution should look like label Aug 6, 2019

pfistfl added the Status: Blocked label Apr 19, 2020

mb706 added Priority: Medium Tag: POFU labels Aug 31, 2020

mb706 mentioned this issue Sep 11, 2020

Extend PipeOpLearnerCV for other resamplings #500

Open

mb706 removed the Status: Needs Discussion We still need to think about what the solution should look like label Sep 29, 2021

mb706 removed the Status: Blocked label Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipeOpFeatureUnion with differing row IDs? #216

PipeOpFeatureUnion with differing row IDs? #216

mb706 commented Aug 6, 2019

pfistfl commented Apr 19, 2020 •

edited

Loading

mb706 commented Sep 29, 2021

PipeOpFeatureUnion with differing row IDs? #216

PipeOpFeatureUnion with differing row IDs? #216

Comments

mb706 commented Aug 6, 2019

pfistfl commented Apr 19, 2020 • edited Loading

mb706 commented Sep 29, 2021

pfistfl commented Apr 19, 2020 •

edited

Loading