Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cross-method for training of encode impact #471

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

sumny
Copy link
Member

@sumny sumny commented Aug 11, 2020

closes #423

@codecov-commenter

This comment has been minimized.

@sumny sumny marked this pull request as ready for review August 24, 2020 18:58
Copy link
Collaborator

@mb706 mb706 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature > 2 levels: encode by learner, 2 level encode 0 1 by default, but give an option to also encode by learner (performance penalty). User should give a learner during construction. Maybe we have to implement a few trivial learners because learner of our choice may not be able to handle high cardianlity features. Maybe glmnet works when we use sparce matrix. Things like smoothed class average could be implemented as its own (fast) learner. Classif Tasks should probably use prob learners, Regr Tasks just regr learners.

R/PipeOpEncodeImpact.R Outdated Show resolved Hide resolved
R/PipeOpEncodeImpact.R Outdated Show resolved Hide resolved
@sumny sumny added the Status: Needs Discussion We still need to think about what the solution should look like label Sep 24, 2020
@sumny
Copy link
Member Author

sumny commented Sep 29, 2020

This PR now proposes a new experimental way to handle resampled impact encodings.
As proposed by @mb706 and @pfistfl resampled impact encoding should propbably be done via PipeOpLearnerCV; this approach is the most straightforward and easy to extend etc.
This requires impact encoders to be Learners (because we want to use the resampling infrastructure).
We cannot simply inherit from LearnerRegr or LearnerClassif without doing some mlr3 base changes, because neither response, se, or prob predict_types are flexible enough to handle impact encodings (and assertions will fail). Therefore, this PR relies on experimental mlr3 changes in the pipelines_encode_impact_cv branch, see here (https://github.com/mlr-org/mlr3/tree/pipelines_encode_impact_cv).

I propose:

  • Add new impact encoder classes ImpactEncoderClassif and ImpactEncoderRegr that inherit from Learner but have a new predict_type called impact
  • ImpactEncoders work like a normal Learner during train, i.e., save the impact encoding as the $state$model but during predict, impacts are returned and responses are always set to <NA> (only returned to not break everything)
  • in mlr3 this only requires minor changes to extend the predictions for classif and regr

Simple example:

library(devtools)

load_all("mlr3")  # pipelines_encode_impact_cv branch
load_all("mlr3pipelines")  # encode_impact_cv branch

taskr = TaskRegr$new("task", backend = data.table(
    x1 = factor(c("a", "a", "b", "b", "b", "b", "b", "a")),
    x2 = factor(c("c", "d", "c", "e", "c", "e", "e", "c")),
    y = c(1, 0.5, -1, -1.5, -0.5, 0, -0.5, 1.5)
  ), target = "y")
encoder1 = ImpactEncoderRegrSimple$new()  # works like PipeOpEncodeImpact
encoder1$train(taskr)
encoder1$model
$x1
                    [,1]
a              1.0624646
b             -0.6374873
.TEMP.MISSING         NA

$x2
                    [,1]
c              0.3124922
d              0.5624438
e             -0.6041465
.TEMP.MISSING         NA

attr(,"class")
[1] "encode.impact.regr.simple_model"
encoder1$predict(taskr)
<PredictionRegr> for 8 observations:
    row_id truth response  impact.x1  impact.x2
         1   1.0       NA  1.0624646  0.3124922
         2   0.5       NA  1.0624646  0.5624438
         3  -1.0       NA -0.6374873  0.3124922
---                                            
         6   0.0       NA -0.6374873 -0.6041465
         7  -0.5       NA -0.6374873 -0.6041465
         8   1.5       NA  1.0624646  0.3124922

More cool stuff (3-fold cross-validated lme4::lmer encoding) that can directly be used in a Graph:

set.seed(1234)
pocv = PipeOpLearnerCV$new( ImpactEncoderRegrGlmm$new(), param_vals = list(resampling.folds = 3))
pocv$train(list(taskr))[[1]]$data()
      y encode.impact.regr.glmm.impact.x1 encode.impact.regr.glmm.impact.x2
1:  1.0                         0.1666667                        -0.5000000
2:  0.5                         1.1904766                         0.2000000
3: -1.0                        -0.7803032                         0.4446710
4: -1.5                        -0.4603177                         0.2000000
5: -0.5                        -0.4603177                         0.2000000
6:  0.0                        -0.7803032                        -0.5248605
7: -0.5                        -0.6666667                        -0.5000000
8:  1.5                         0.1666667                        -0.5000000
pocv$state$model
$x1
                    [,1]
a              0.9411766
b             -0.6647060
.TEMP.MISSING  0.1382353

$x2
                     [,1]
c             -0.03997347
d             -0.05185426
e             -0.09608396
.TEMP.MISSING -0.06263723

attr(,"class")
[1] "encode.impact.regr.glmm_model"

If the general consens is that we are fine with introducing Learners for impact encoding and a new predict_type, I can finish this up with more impact encoders, tests and docu but let's discuss this first.

@sumny sumny requested a review from mb706 September 29, 2020 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Discussion We still need to think about what the solution should look like
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipeEncodeImpact: Add CV
3 participants