Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable using fully predefined indices in resampling #2412

Merged
merged 38 commits into from Sep 27, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
02261b2
add 'grouping' option
pat-s Aug 14, 2018
a9ab8aa
account for grouping and blocking
pat-s Aug 14, 2018
77955eb
add tests
pat-s Aug 14, 2018
51bf71d
Merge branch 'master' into factor-cv
pat-s Aug 14, 2018
b034ad2
use fixed instead of grouping
pat-s Aug 14, 2018
c5749de
hardcode iters when fixed = TRUE
pat-s Aug 14, 2018
bb85a2c
update tests and rename
pat-s Aug 14, 2018
5ae5272
fix doc of makeResampleDesc
pat-s Aug 14, 2018
a5ae62f
also apply args blocking.cv and fixed to RepCV to fix blocking tests
pat-s Aug 14, 2018
cac576d
update tests
pat-s Aug 14, 2018
192355c
update resample::blocking
pat-s Aug 16, 2018
beb3a92
update documentation of arg 'fixed'
pat-s Sep 17, 2018
afd50eb
grouping -> fixed
pat-s Sep 17, 2018
c03fff1
Merge branch 'master' into factor-cv
pat-s Sep 17, 2018
75c33a4
set defaults for fixed and blocking.cv in makeResampleDesc
pat-s Sep 17, 2018
8d75335
fixed -> grouping
pat-s Sep 17, 2018
ae9568e
explain why are doing a hard levels reset
pat-s Sep 17, 2018
83d122c
simplify code
pat-s Sep 17, 2018
f77f818
hand over new default args of makeResampleDesc
pat-s Sep 17, 2018
6269677
update tests
pat-s Sep 17, 2018
5e3d77d
fix test expectations (handled by seed?)
pat-s Sep 17, 2018
7d9d079
fixed and blocking.cv are official params and not just items
pat-s Sep 24, 2018
af8a40d
fix doc
pat-s Sep 24, 2018
54a76ad
Deploy from Travis build 12597 [ci skip]
pat-s Sep 24, 2018
66b0c23
update docs from master
pat-s Sep 24, 2018
7dc7606
merge
pat-s Sep 24, 2018
312763e
update docs
pat-s Sep 24, 2018
c31d4e9
Deploy from Travis build 12603 [ci skip]
pat-s Sep 24, 2018
367979e
Deploy from Travis build 12602 [ci skip]
pat-s Sep 24, 2018
8f01847
update vignette
pat-s Sep 27, 2018
50f51a4
update help page
pat-s Sep 27, 2018
9f6ab3e
merge master
pat-s Sep 27, 2018
87e1195
update NEWS
pat-s Sep 27, 2018
f5bc3a2
Merge branch 'master' into factor-cv
pat-s Sep 27, 2018
4f6e39d
Deploy from Travis build 12624 [ci skip]
pat-s Sep 27, 2018
dadd833
Deploy from Travis build 12622 [ci skip]
pat-s Sep 27, 2018
24b82d2
Deploy from Travis build 12621 [ci skip]
pat-s Sep 27, 2018
210cc85
Merge branch 'master' into factor-cv
larskotthoff Sep 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions NEWS.md
@@ -1,5 +1,8 @@
# mlr 2.14:

## general
* add option to use fully predefined indices in resampling (`makeResampleDesc(fixed = TRUE)`)

## learners - new
* classif.liquidSVM
* regr.liquidSVM
Expand Down
26 changes: 21 additions & 5 deletions R/ResampleDesc.R
Expand Up @@ -65,6 +65,17 @@
#' of the resampling indices will be 20. Default is \dQuote{horizon} which gives mutually exclusive chunks
#' of test indices.}
#' }
#' @param fixed (`logical(1)`)\cr
#' Whether indices supplied via argument 'blocking' in the task should be used as
#' fully pre-defined indices. Default is `FALSE` which means
#' they will be used following the 'blocking' approach.
#' `fixed` only works with ResampleDesc `CV` and the supplied indices must match
#' the number of observations. When `fixed = TRUE`, the `iters` argument will be ignored
#' and is interally set to the number of supplied factor levels in `blocking`.
#' @param blocking.cv (`logical(1)`)\cr
#' Should 'blocking' be used in `CV`? Default to `FALSE`.
#' This is different to `fixed = TRUE` and cannot be combined. Please check the mlr online tutorial
#' for more details.
#' @param stratify (`logical(1)`)\cr
#' Should stratification be done for the target variable?
#' For classification tasks, this means that the resampling strategy is applied to all classes
Expand Down Expand Up @@ -92,7 +103,8 @@
#'
#' # Holdout a.k.a. test sample estimation
#' makeResampleDesc("Holdout")
makeResampleDesc = function(method, predict = "test", ..., stratify = FALSE, stratify.cols = NULL) {
makeResampleDesc = function(method, predict = "test", ..., stratify = FALSE,
stratify.cols = NULL, fixed = FALSE, blocking.cv = FALSE) {
assertChoice(method, choices = c("Holdout", "CV", "LOO", "RepCV",
"Subsample", "Bootstrap", "SpCV", "SpRepCV",
"GrowingWindowCV", "FixedWindowCV"))
Expand All @@ -106,6 +118,8 @@ makeResampleDesc = function(method, predict = "test", ..., stratify = FALSE, str
d$predict = predict
d$stratify = stratify
d$stratify.cols = stratify.cols
d$fixed = fixed
d$blocking.cv = blocking.cv
addClasses(d, stri_paste(method, "Desc"))
}

Expand Down Expand Up @@ -134,9 +148,10 @@ makeResampleDescHoldout = function(iters, split = 2 / 3) {
makeResampleDescInternal("holdout", iters = 1L, split = split)
}

makeResampleDescCV = function(iters = 10L) {
makeResampleDescCV = function(iters = 10L, fixed = FALSE, blocking.cv = FALSE) {
iters = asInt(iters, lower = 2L)
makeResampleDescInternal("cross-validation", iters = iters)
makeResampleDescInternal("cross-validation", iters = iters, fixed = fixed,
blocking.cv = blocking.cv)
}

makeResampleDescSpCV = function(iters = 10L) {
Expand All @@ -159,10 +174,11 @@ makeResampleDescBootstrap = function(iters = 30L) {
makeResampleDescInternal("OOB bootstrapping", iters = iters)
}

makeResampleDescRepCV = function(reps = 10L, folds = 10L) {
makeResampleDescRepCV = function(reps = 10L, folds = 10L, fixed = FALSE, blocking.cv = FALSE) {
reps = asInt(reps, lower = 2L)
folds = asInt(folds, lower = 2L)
makeResampleDescInternal("repeated cross-validation", iters = folds * reps, folds = folds, reps = reps)
makeResampleDescInternal("repeated cross-validation", iters = folds * reps, folds = folds, reps = reps,
fixed = fixed, blocking.cv = blocking.cv)
}

makeResampleDescSpRepCV = function(reps = 10L, folds = 10L) {
Expand Down
14 changes: 12 additions & 2 deletions R/ResampleInstance.R
Expand Up @@ -11,7 +11,7 @@
#' \item{train.inds (list of [integer])}{List of of training indices for all iterations.}
#' \item{test.inds (list of [integer])}{List of of test indices for all iterations.}
#' \item{group ([factor])}{Optional grouping of resampling iterations. This encodes whether
#' specfic iterations 'belong together' (e.g. repeated CV), and it can later be used to
#' specific iterations 'belong together' (e.g. repeated CV), and it can later be used to
#' aggregate performance values accordingly. Default is 'factor()'.}
#' }
#'
Expand Down Expand Up @@ -61,7 +61,17 @@ makeResampleInstance = function(desc, task, size, ...) {
if (length(blocking) && desc$stratify)
stop("Blocking can currently not be mixed with stratification in resampling!")

if (length(blocking)) {
# 'fixed' only exists by default for 'CV' -> is.null(desc$fixed)
# only use this way of blocking if 'fixed = FALSE' -> is.null(desc$fixed)

fixed = desc$fixed
blocking.cv = desc$blocking.cv
if (fixed == FALSE) {
### check if blocking should be used or not
blocking.cv = desc$blocking.cv
}
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With default values for fixed and blocking.cv this could be simply fixed = desc$fixed; blocking.cv = desc$blocking.cv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this simplifies things. I've set defaults in makeResampleDesc().


if (length(blocking) > 0 && !fixed && blocking.cv) {
if (is.null(task))
stop("Blocking always needs the task!")
levs = levels(blocking)
Expand Down
62 changes: 57 additions & 5 deletions R/ResampleInstances.R
Expand Up @@ -8,10 +8,55 @@ instantiateResampleInstance.HoldoutDesc = function(desc, size, task = NULL) {
}

instantiateResampleInstance.CVDesc = function(desc, size, task = NULL) {
if (desc$iters > size)
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size)
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters)
makeResampleInstanceInternal(desc, size, test.inds = test.inds)

# Random sampling CV
if (!desc$fixed) {
if (desc$iters > size) {
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size)
}
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters)
makeResampleInstanceInternal(desc, size, test.inds = test.inds)
} else {

# CV with only predefined indices ("fixed")

if(is.null(task$blocking)) {
stopf("To use blocking in resampling, you need to pass a factor variable when creating the task!")
}

# In the inner call, the implementation is able to adapt by automatically reducing one level (see line if (0 %in% length.test.inds)).
# So having always `length(iters) = length(levels(task$blocking)` is the most safe environment for the function to work.
if (desc$iters != length(levels(task$blocking))) {
desc$iters = length(levels(task$blocking))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the number of iterations being adjusted here? There should be a warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. If the user inserts a different number than levels used for fixed in makeResampleDesc(), the function will error.
  2. In the inner call, the function is able to adapt by automatically reducing one level. So having always length(iters) = length(levels(task$blocking) is the most safe environment for the function to work.

I added the explanation as a comment.

warningf("Adjusting levels to match number of blocking levels.")
}
levs = levels(task$blocking)
n_levels = length(levs)

# Why do we need the helper desc? If we would call 'instantiateResampleInstance()' here,
# we would call the function within itself and will receive an 'error-c-stack-usage-is-too-close-to-the-limit' error
# So we simply change the class name to mimic a new function..
attr(desc, "class")[1] = "CVHelperDesc"
# create fake ResampleInstance
inst = instantiateResampleInstance(desc, n_levels, task)
attr(desc, "class")[1] = "CVDesc"

# now exchange block indices with indices of elements of this block and shuffle
test.inds = lapply(inst$test.inds, function(i) which(task$blocking %in% levs[i]))

# Nested resampling: We need to create a list with length(levels) first.
# Then one fold will be length(0) because we are missing one factor level because we are in the inner level
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the number of outer folds is less than the number of levels (or a simple train/test split) and more than one factor level is missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In

if (desc$iters != length(levels(task$blocking))) {
desc$iters = length(levels(task$blocking))
we check that the number of folds is always = number of levels.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Could you add information on what the number of levels was and what it was set to in the warning please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem we have here is the following:

  • The warning is triggered for both inner and outer level
  • Determining the level (inner/outer) is tricky
  • The inner level resets to the factor levels of the outer level first and is then further adjusted. So we actually get a false positive for the inner level even if its set correctly (e.g. outer = 5, inner = 4).

Even if this doesn't sound logical, I would even vote for the complete removal of the warning. Users who set fixed = T usually know what they want and what they need to set.
Its contra-productive if the warning is raised even if the specification is correct (e.g. outer = 5, inner = 4).

Not sure if this thinking if easy to follow here. Let me know if I should explain it again in more detail.

I would propose to mention the adjustment in the tutorial (and in the help page?).

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at some point you can figure out whether there are enough levels for the folds, right? So there shouldn't need to be any false positives?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down in the function I have the final number of levels. But then I have no information if the level count was adjusted or not. And even if I would add a flag, I am missing the original level count since desc$iters is reassigned.

I think the effort of implementing a robust warning for both inner and outer is not worth the effort.
I would prefer to note it in the help page (in details) and in the tutorial.

Something like "Setting iters with fixed = T has no effect. iters will be set to length(blocking.levels) in the outer and length(blocking.levels) - 1 in the inner level".

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for discussing this Lars! It's also not a solution I am completely happy with and a problem I thought a lot about.

I'll do the required changes soon, including the tutorial page. Finally getting this done.

# We check for this and remove this fold
# There is no other way to do this. If we initially set "desc$iters" to length(levels) - 1, test.inds will not be created correctly
length.test.inds = unlist(lapply(test.inds, function(x) length(x)))
if (0 %in% length.test.inds) {
index = match(0, length.test.inds)
test.inds[[index]] = NULL
size = length(task$env$data[,1])
desc$iters = length(test.inds)
}
makeResampleInstanceInternal(desc, size, test.inds = test.inds)
}
}

instantiateResampleInstance.SpCVDesc = function(desc, size, task = NULL) {
Expand Down Expand Up @@ -52,7 +97,7 @@ instantiateResampleInstance.BootstrapDesc = function(desc, size, task = NULL) {

instantiateResampleInstance.RepCVDesc = function(desc, size, task = NULL) {
folds = desc$iters / desc$reps
d = makeResampleDesc("CV", iters = folds)
d = makeResampleDesc("CV", iters = folds, blocking.cv = desc$blocking.cv, fixed = desc$fixed)
i = replicate(desc$reps, makeResampleInstance(d, size = size), simplify = FALSE)
train.inds = Reduce(c, lapply(i, function(j) j$train.inds))
test.inds = Reduce(c, lapply(i, function(j) j$test.inds))
Expand All @@ -78,3 +123,10 @@ instantiateResampleInstance.GrowingWindowCVDesc = function(desc, size, task = NU
makeResamplingWindow(desc, size, task, coords, "GrowingWindowCV")
}

instantiateResampleInstance.CVHelperDesc = function(desc, size, task = NULL) {

if (desc$iters > size)
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size)
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters)
makeResampleInstanceInternal(desc, size, test.inds = test.inds)
}
36 changes: 18 additions & 18 deletions docs/articles/tutorial/advanced_tune.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions docs/articles/tutorial/bagging.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.