New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable using fully predefined indices in resampling #2412
Changes from all commits
02261b2
a9ab8aa
77955eb
51bf71d
b034ad2
c5749de
bb85a2c
5ae5272
a5ae62f
cac576d
192355c
beb3a92
afd50eb
c03fff1
75c33a4
8d75335
ae9568e
83d122c
f77f818
6269677
5e3d77d
7d9d079
af8a40d
54a76ad
66b0c23
7dc7606
312763e
c31d4e9
367979e
8f01847
50f51a4
9f6ab3e
87e1195
f5bc3a2
4f6e39d
dadd833
24b82d2
210cc85
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -8,10 +8,55 @@ instantiateResampleInstance.HoldoutDesc = function(desc, size, task = NULL) { | |||||
} | ||||||
|
||||||
instantiateResampleInstance.CVDesc = function(desc, size, task = NULL) { | ||||||
if (desc$iters > size) | ||||||
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size) | ||||||
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters) | ||||||
makeResampleInstanceInternal(desc, size, test.inds = test.inds) | ||||||
|
||||||
# Random sampling CV | ||||||
if (!desc$fixed) { | ||||||
if (desc$iters > size) { | ||||||
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size) | ||||||
} | ||||||
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters) | ||||||
makeResampleInstanceInternal(desc, size, test.inds = test.inds) | ||||||
} else { | ||||||
|
||||||
# CV with only predefined indices ("fixed") | ||||||
|
||||||
if(is.null(task$blocking)) { | ||||||
stopf("To use blocking in resampling, you need to pass a factor variable when creating the task!") | ||||||
} | ||||||
|
||||||
# In the inner call, the implementation is able to adapt by automatically reducing one level (see line if (0 %in% length.test.inds)). | ||||||
# So having always `length(iters) = length(levels(task$blocking)` is the most safe environment for the function to work. | ||||||
if (desc$iters != length(levels(task$blocking))) { | ||||||
desc$iters = length(levels(task$blocking)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the number of iterations being adjusted here? There should be a warning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two reasons:
I added the explanation as a comment. |
||||||
warningf("Adjusting levels to match number of blocking levels.") | ||||||
} | ||||||
levs = levels(task$blocking) | ||||||
n_levels = length(levs) | ||||||
|
||||||
# Why do we need the helper desc? If we would call 'instantiateResampleInstance()' here, | ||||||
# we would call the function within itself and will receive an 'error-c-stack-usage-is-too-close-to-the-limit' error | ||||||
# So we simply change the class name to mimic a new function.. | ||||||
attr(desc, "class")[1] = "CVHelperDesc" | ||||||
# create fake ResampleInstance | ||||||
inst = instantiateResampleInstance(desc, n_levels, task) | ||||||
attr(desc, "class")[1] = "CVDesc" | ||||||
|
||||||
# now exchange block indices with indices of elements of this block and shuffle | ||||||
test.inds = lapply(inst$test.inds, function(i) which(task$blocking %in% levs[i])) | ||||||
|
||||||
# Nested resampling: We need to create a list with length(levels) first. | ||||||
# Then one fold will be length(0) because we are missing one factor level because we are in the inner level | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if the number of outer folds is less than the number of levels (or a simple train/test split) and more than one factor level is missing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Lines 27 to 28 in 192355c
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok. Could you add information on what the number of levels was and what it was set to in the warning please? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem we have here is the following:
Even if this doesn't sound logical, I would even vote for the complete removal of the warning. Users who set Not sure if this thinking if easy to follow here. Let me know if I should explain it again in more detail. I would propose to mention the adjustment in the tutorial (and in the help page?). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But at some point you can figure out whether there are enough levels for the folds, right? So there shouldn't need to be any false positives? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Further down in the function I have the final number of levels. But then I have no information if the level count was adjusted or not. And even if I would add a flag, I am missing the original level count since I think the effort of implementing a robust warning for both inner and outer is not worth the effort. Something like "Setting iters with fixed = T has no effect. iters will be set to length(blocking.levels) in the outer and length(blocking.levels) - 1 in the inner level". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for discussing this Lars! It's also not a solution I am completely happy with and a problem I thought a lot about. I'll do the required changes soon, including the tutorial page. Finally getting this done. |
||||||
# We check for this and remove this fold | ||||||
# There is no other way to do this. If we initially set "desc$iters" to length(levels) - 1, test.inds will not be created correctly | ||||||
length.test.inds = unlist(lapply(test.inds, function(x) length(x))) | ||||||
if (0 %in% length.test.inds) { | ||||||
index = match(0, length.test.inds) | ||||||
test.inds[[index]] = NULL | ||||||
size = length(task$env$data[,1]) | ||||||
desc$iters = length(test.inds) | ||||||
} | ||||||
makeResampleInstanceInternal(desc, size, test.inds = test.inds) | ||||||
} | ||||||
} | ||||||
|
||||||
instantiateResampleInstance.SpCVDesc = function(desc, size, task = NULL) { | ||||||
|
@@ -52,7 +97,7 @@ instantiateResampleInstance.BootstrapDesc = function(desc, size, task = NULL) { | |||||
|
||||||
instantiateResampleInstance.RepCVDesc = function(desc, size, task = NULL) { | ||||||
folds = desc$iters / desc$reps | ||||||
d = makeResampleDesc("CV", iters = folds) | ||||||
d = makeResampleDesc("CV", iters = folds, blocking.cv = desc$blocking.cv, fixed = desc$fixed) | ||||||
i = replicate(desc$reps, makeResampleInstance(d, size = size), simplify = FALSE) | ||||||
train.inds = Reduce(c, lapply(i, function(j) j$train.inds)) | ||||||
test.inds = Reduce(c, lapply(i, function(j) j$test.inds)) | ||||||
|
@@ -78,3 +123,10 @@ instantiateResampleInstance.GrowingWindowCVDesc = function(desc, size, task = NU | |||||
makeResamplingWindow(desc, size, task, coords, "GrowingWindowCV") | ||||||
} | ||||||
|
||||||
instantiateResampleInstance.CVHelperDesc = function(desc, size, task = NULL) { | ||||||
|
||||||
if (desc$iters > size) | ||||||
stopf("Cannot use more folds (%i) than size (%i)!", desc$iters, size) | ||||||
test.inds = chunk(seq_len(size), shuffle = TRUE, n.chunks = desc$iters) | ||||||
makeResampleInstanceInternal(desc, size, test.inds = test.inds) | ||||||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With default values for
fixed
andblocking.cv
this could be simplyfixed = desc$fixed; blocking.cv = desc$blocking.cv
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this simplifies things. I've set defaults in
makeResampleDesc()
.