add cross-method for training of encode impact #471

sumny · 2020-08-11T17:29:02Z

closes #423

mb706

Feature > 2 levels: encode by learner, 2 level encode 0 1 by default, but give an option to also encode by learner (performance penalty). User should give a learner during construction. Maybe we have to implement a few trivial learners because learner of our choice may not be able to handle high cardianlity features. Maybe glmnet works when we use sparce matrix. Things like smoothed class average could be implemented as its own (fast) learner. Classif Tasks should probably use prob learners, Regr Tasks just regr learners.

R/PipeOpEncodeImpact.R

…encoder classes

sumny · 2020-09-29T16:39:05Z

This PR now proposes a new experimental way to handle resampled impact encodings.
As proposed by @mb706 and @pfistfl resampled impact encoding should propbably be done via PipeOpLearnerCV; this approach is the most straightforward and easy to extend etc.
This requires impact encoders to be Learners (because we want to use the resampling infrastructure).
We cannot simply inherit from LearnerRegr or LearnerClassif without doing some mlr3 base changes, because neither response, se, or prob predict_types are flexible enough to handle impact encodings (and assertions will fail). Therefore, this PR relies on experimental mlr3 changes in the pipelines_encode_impact_cv branch, see here (https://github.com/mlr-org/mlr3/tree/pipelines_encode_impact_cv).

I propose:

Add new impact encoder classes ImpactEncoderClassif and ImpactEncoderRegr that inherit from Learner but have a new predict_type called impact
ImpactEncoders work like a normal Learner during train, i.e., save the impact encoding as the $state$model but during predict, impacts are returned and responses are always set to <NA> (only returned to not break everything)
in mlr3 this only requires minor changes to extend the predictions for classif and regr

Simple example:

library(devtools)

load_all("mlr3")  # pipelines_encode_impact_cv branch
load_all("mlr3pipelines")  # encode_impact_cv branch

taskr = TaskRegr$new("task", backend = data.table(
    x1 = factor(c("a", "a", "b", "b", "b", "b", "b", "a")),
    x2 = factor(c("c", "d", "c", "e", "c", "e", "e", "c")),
    y = c(1, 0.5, -1, -1.5, -0.5, 0, -0.5, 1.5)
  ), target = "y")

encoder1 = ImpactEncoderRegrSimple$new()  # works like PipeOpEncodeImpact
encoder1$train(taskr)
encoder1$model

$x1
                    [,1]
a              1.0624646
b             -0.6374873
.TEMP.MISSING         NA

$x2
                    [,1]
c              0.3124922
d              0.5624438
e             -0.6041465
.TEMP.MISSING         NA

attr(,"class")
[1] "encode.impact.regr.simple_model"

encoder1$predict(taskr)

<PredictionRegr> for 8 observations:
    row_id truth response  impact.x1  impact.x2
         1   1.0       NA  1.0624646  0.3124922
         2   0.5       NA  1.0624646  0.5624438
         3  -1.0       NA -0.6374873  0.3124922
---                                            
         6   0.0       NA -0.6374873 -0.6041465
         7  -0.5       NA -0.6374873 -0.6041465
         8   1.5       NA  1.0624646  0.3124922

More cool stuff (3-fold cross-validated lme4::lmer encoding) that can directly be used in a Graph:

set.seed(1234)
pocv = PipeOpLearnerCV$new( ImpactEncoderRegrGlmm$new(), param_vals = list(resampling.folds = 3))
pocv$train(list(taskr))[[1]]$data()

      y encode.impact.regr.glmm.impact.x1 encode.impact.regr.glmm.impact.x2
1:  1.0                         0.1666667                        -0.5000000
2:  0.5                         1.1904766                         0.2000000
3: -1.0                        -0.7803032                         0.4446710
4: -1.5                        -0.4603177                         0.2000000
5: -0.5                        -0.4603177                         0.2000000
6:  0.0                        -0.7803032                        -0.5248605
7: -0.5                        -0.6666667                        -0.5000000
8:  1.5                         0.1666667                        -0.5000000

pocv$state$model
$x1
                    [,1]
a              0.9411766
b             -0.6647060
.TEMP.MISSING  0.1382353

$x2
                     [,1]
c             -0.03997347
d             -0.05185426
e             -0.09608396
.TEMP.MISSING -0.06263723

attr(,"class")
[1] "encode.impact.regr.glmm_model"

If the general consens is that we are fine with introducing Learners for impact encoding and a new predict_type, I can finish this up with more impact encoders, tests and docu but let's discuss this first.

add cross-method for training of encode impact

8ed95d5

This comment has been minimized.

Sign in to view

sumny marked this pull request as ready for review August 24, 2020 18:58

sumny added 2 commits August 24, 2020 20:59

Merge remote-tracking branch 'origin/master' into encode_impact_cv

ba74d16

update NEWS

84311b4

sumny added the Status: Revision Needed label Aug 28, 2020

mb706 requested changes Aug 28, 2020

View reviewed changes

R/PipeOpEncodeImpact.R Outdated Show resolved Hide resolved

R/PipeOpEncodeImpact.R Outdated Show resolved Hide resolved

test

c72f75c

sumny added the Status: Needs Discussion We still need to think about what the solution should look like label Sep 24, 2020

sumny added 4 commits September 28, 2020 19:36

reset changes to PipeOpEncodeImpact.R and PipeOpLearnerCV.R, add new …

be4a53a

…encoder classes

add some todos

f01a2c0

Merge remote-tracking branch 'origin/master' into encode_impact_cv

6179d7b

revert some earlier changes

0e49c68

sumny added Status: In Progress and removed Status: Revision Needed labels Sep 29, 2020

sumny requested a review from mb706 September 29, 2020 16:39

sumny added 2 commits October 1, 2020 18:07

drop response if predict_type = "impact"

2489e60

Merge remote-tracking branch 'origin/master' into encode_impact_cv

ac2a22e

sumny removed the Status: In Progress label Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cross-method for training of encode impact #471

add cross-method for training of encode impact #471

sumny commented Aug 11, 2020 •

edited by mb706

Loading

This comment has been minimized.

mb706 left a comment

sumny commented Sep 29, 2020 •

edited

Loading

add cross-method for training of encode impact #471

Are you sure you want to change the base?

add cross-method for training of encode impact #471

Conversation

sumny commented Aug 11, 2020 • edited by mb706 Loading

This comment has been minimized.

mb706 left a comment

Choose a reason for hiding this comment

sumny commented Sep 29, 2020 • edited Loading

sumny commented Aug 11, 2020 •

edited by mb706

Loading

sumny commented Sep 29, 2020 •

edited

Loading