# Enable using fully predefined indices in resampling #2412

merged 38 commits into from Sep 27, 2018

## Conversation

superseeds #2267

### To do

• Write tutorial section and explain the difference to "blocking"

### Idea

Use fully predefined indices via the blocking argument in the task for resampling.

#### Single resampling

4 classes -> 4 folds.

##### Nested Resampling

Outer: 4 classes -> 4 folds
Inner: 3 classes -> 3 folds

Only compatible with 1 repetition, hence only with "CV".

To use predefined indices in "RepCV" one should use the already existing "Blocking" implementation. This diiffers in the way what the class number does not also define the number of folds and hence more combinations than just the number of folds can be generated.

### Implementation

The user needs to initiliaze this special resampling by setting fixed = TRUE in makeResampleDesc(). Otherwise, the "blocking" implementation will be used.

As said, the approach uses the factor variable supplied in the task via the blocking argument. I know this can cause confusion between "blocking" and "grouping" but we need to differentiate both approaches somehow.
On the other hand we do not need a new argument taking a factor vector in the task.

In a nested setting, a possible workflow would look as follows:

inner = makeResampleDesc("CV", iters = 4, fixed = TRUE)
outer = makeResampleDesc("CV", iters = 5, fixed = TRUE)
tune_wrapper = makeTuneWrapper(lrn, resampling = inner, par.set = ps,
control = ctrl, show.info = FALSE)

p = resample(tune_wrapper, ct, outer, show.info = FALSE,
extract = getTuneResult)

So rather than doing a random sampling, we use the predefined indices specified in "blocking".

The function is smart enough to also deal with a little mispecification by issueing a warning:

inner = makeResampleDesc("CV", iters = 5, fixed = TRUE)
outer = makeResampleDesc("CV", iters = 5, fixed = TRUE)

"iters (5) is not equal to length of blocking levels (4)!

If inner > outer, an error will be thrown.

By logic, the inner fold count needs always to be one less the outer count.

Users can also combine using fixed indices in the outer and random sampling in the inner:

inner = makeResampleDesc("CV", iters = 5)
outer = makeResampleDesc("CV", iters = 5, fixed = TRUE)
tune_wrapper = makeTuneWrapper(lrn, resampling = inner, par.set = ps,
control = ctrl, show.info = FALSE)
expect_success(resample(tune_wrapper, ct, outer, show.info = FALSE,
extract = getTuneResult))

To explicitly avoid clashes between "fixed" and "blocking" when a "blocking" factor was given in the task, I had to add a little helper arg.
To use "blocking" in single "CV", the user now needs to explicitly enable it by using makeResampleDesc("CV", iters = 5, blocking.cv = TRUE).
But I think people would always use "blocking" in "repCV" anyways I guess?

Just for clarification: This PR changes nothing on the existing "blocking" implemen tation besides the need to explicitly trigger it when using "CV".

### pat-s requested review from larskotthoff , mllg and jakob-rAug 14, 2018

 use fixed instead of grouping 
 b034ad2 
### larskotthoff commented Aug 14, 2018

 Thanks. What exactly does fixed = TRUE do here? I didn't understand the difference between your first and second example. Since the number of iterations is fixed by the number of levels, why not make this the automatic choice instead of asking the user to specify it again?

### pat-s referenced this pull request Aug 15, 2018

Merged

#### getResamplingIndices(): Translate inner resampling indices to outer indices #2413

 update resample::blocking 
 192355c 
### pat-s commented Aug 16, 2018

 Vignette update added. Please review using the netlify preview: https://deploy-preview-2412--nervous-hopper-4136be.netlify.com/articles/resample.html

### larskotthoff requested changes Aug 16, 2018

 @@ -64,6 +64,9 @@ #' else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment #' of the resampling indices will be 20. Default is \dQuote{horizon} which gives mutually exclusive chunks #' of test indices.} #' \item{fixed (logical(1))}{Whether indices supplied via argument 'blocking' in the task should be used in resampling. Default is FALSE.

#### larskotthoff Aug 16, 2018

Contributor

This wording suggests that blocking is ignored in resampling, and the documentation for blocking says the opposite.

#### pat-s Sep 17, 2018

Author Member

Made it more clear now. Please take another look :)

 @@ -64,6 +64,9 @@ #' else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment #' of the resampling indices will be 20. Default is \dQuote{horizon} which gives mutually exclusive chunks #' of test indices.} #' \item{fixed (logical(1))}{Whether indices supplied via argument 'blocking' in the task should be used in resampling. Default is FALSE. #' 'grouping' only works with 'CV' and the supplied indices must match the number of observations.}

#### larskotthoff Aug 16, 2018

Contributor

Where does the grouping come from?

#### pat-s Sep 17, 2018

Author Member

Leftover, should be fixed

 @@ -64,6 +64,9 @@ #' else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment #' of the resampling indices will be 20. Default is \dQuote{horizon} which gives mutually exclusive chunks #' of test indices.} #' \item{fixed (logical(1))}{Whether indices supplied via argument 'blocking' in the task should be used in resampling. Default is FALSE. #' 'grouping' only works with 'CV' and the supplied indices must match the number of observations.} #' \item{blocking.cv (logical(1))}{Should 'blocking' be used in 'CV'? Default to FALSE}

#### larskotthoff Aug 16, 2018

Contributor

It sounds like this does the same thing as fixed.

#### pat-s Sep 17, 2018

Author Member

I added more detail to the doc and a link to the (not yet existing) tutorial page.

 if (length(blocking)) { # 'fixed' only exists by default for 'CV' -> is.null(desc$grouping) # only use this way of blocking of 'fixed = FALSE' -> is.null(desc$grouping) if(is.null(desc$fixed)) { #### larskotthoff Aug 16, 2018 Contributor Space after if (isn't this supposed to be checked automatically?). #### larskotthoff Aug 16, 2018 Contributor Also I would give fixed a default value in makeResampleDesc so that this can never be NULL to remove the additional check. #### pat-s Sep 17, 2018 Author Member Set a default value.  } else { fixed = TRUE } } #### larskotthoff Aug 16, 2018 Contributor With default values for fixed and blocking.cv this could be simply fixed = desc$fixed; blocking.cv = desc$blocking.cv. #### pat-s Sep 17, 2018 Author Member Yes this simplifies things. I've set defaults in makeResampleDesc().  } if (desc$iters != length(levels(task$blocking))) { desc$iters = length(levels(task$blocking)) #### larskotthoff Aug 16, 2018 Contributor Why is the number of iterations being adjusted here? There should be a warning. #### pat-s Sep 17, 2018 Author Member Two reasons: 1. If the user inserts a different number than levels used for fixed in makeResampleDesc(), the function will error. 2. In the inner call, the function is able to adapt by automatically reducing one level. So having always length(iters) = length(levels(task$blocking) is the most safe environment for the function to work.

I added the explanation as a comment.

 desc$iters = length(levels(task$blocking)) } levs = levels(task$blocking) n_levels = length(levels(task$blocking))

#### larskotthoff Aug 16, 2018

Contributor

Or just length(levs).

#### pat-s Sep 17, 2018

Author Member

Yes, thanks.

 test.inds = lapply(inst$test.inds, function(i) which(task$blocking %in% levs[i])) # Nested resampling: We need to create a list with length(levels) first. # Then one fold will be length(0) because we are missing one factor level because we are in the inner level

#### larskotthoff Aug 16, 2018

Contributor

What happens if the number of outer folds is less than the number of levels (or a simple train/test split) and more than one factor level is missing?

#### pat-s Sep 17, 2018

Author Member

In

Lines 27 to 28 in 192355c

 if (desc$iters != length(levels(task$blocking))) { desc$iters = length(levels(task$blocking))
we check that the number of folds is always = number of levels.

#### larskotthoff Sep 17, 2018

Contributor

Ah ok. Could you add information on what the number of levels was and what it was set to in the warning please?

#### pat-s Sep 24, 2018

Author Member

The problem we have here is the following:

• The warning is triggered for both inner and outer level
• Determining the level (inner/outer) is tricky
• The inner level resets to the factor levels of the outer level first and is then further adjusted. So we actually get a false positive for the inner level even if its set correctly (e.g. outer = 5, inner = 4).

Even if this doesn't sound logical, I would even vote for the complete removal of the warning. Users who set fixed = T usually know what they want and what they need to set.
Its contra-productive if the warning is raised even if the specification is correct (e.g. outer = 5, inner = 4).

Not sure if this thinking if easy to follow here. Let me know if I should explain it again in more detail.

I would propose to mention the adjustment in the tutorial (and in the help page?).

#### larskotthoff Sep 24, 2018

Contributor

But at some point you can figure out whether there are enough levels for the folds, right? So there shouldn't need to be any false positives?

Author Member

#### larskotthoff Aug 16, 2018

Contributor

Should also check whether the right observations are together.

#### pat-s Sep 17, 2018

Author Member

Done.

tests/testthat/test_base_fixed_indices_cv.R

### pat-s commented Sep 17, 2018

 @larskotthoff Sorry for the delay, I was on vacation and busy with some other stuff. Hope you still know whats going on here :) Once the technical part is approved, I'll update the tutorial section.
 fix test expectations (handled by seed?) 
 5e3d77d 

### larskotthoff reviewed Sep 17, 2018

 {r} str(getResamplingIndices(p, inner = TRUE)) 

#### larskotthoff Sep 17, 2018

Contributor

Could you please mention here that the number of inner folds is automatically adjusted based on the available levels?

#### pat-s Sep 24, 2018

Author Member

Yes, as said, I will write a new section comparing blocking and grouping and explain what happens.

 update vignette 
 8f01847 
 update help page 
 50f51a4 
 merge master 
Merge branch 'master' into factor-cv

# Conflicts:
#	docs/articles/tutorial/bagging.html
#	docs/articles/tutorial/bagging_files/figure-html/makeBaggingWrapper_regressionPlot-1.png
#	docs/articles/tutorial/benchmark_experiments.html
#	docs/articles/tutorial/benchmark_experiments_files/figure-html/unnamed-chunk-29-1.png
#	docs/articles/tutorial/classifier_calibration_files/figure-html/unnamed-chunk-4-1.png
#	docs/articles/tutorial/classifier_calibration_files/figure-html/unnamed-chunk-5-1.png
#	docs/articles/tutorial/classifier_calibration_files/figure-html/unnamed-chunk-6-1.png
#	docs/articles/tutorial/configureMlr.html
#	docs/articles/tutorial/cost_sensitive_classif.html
#	docs/articles/tutorial/cost_sensitive_classif_files/figure-html/unnamed-chunk-10-1.png
#	docs/articles/tutorial/create_filter.html
#	docs/articles/tutorial/create_measure.html
#	docs/articles/tutorial/create_measure_files/figure-html/unnamed-chunk-8-1.png
#	docs/articles/tutorial/feature_selection.html
#	docs/articles/tutorial/feature_selection_files/figure-html/unnamed-chunk-14-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects.html
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/sa_single-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/single_crash-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/single_nested-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/two-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/two_crash-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/two_nested-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/two_optima-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/two_showargs-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/unnamed-chunk-3-1.png
#	docs/articles/tutorial/hyperpar_tuning_effects_files/figure-html/unnamed-chunk-4-1.png
#	docs/articles/tutorial/impute.html
#	docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveACCx-1.png
#	docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveTPFP-1.png
#	docs/articles/tutorial/learning_curve_files/figure-html/unnamed-chunk-2-1.png
#	docs/articles/tutorial/multilabel.html
#	docs/articles/tutorial/nested_resampling.html
#	docs/articles/tutorial/nested_resampling_files/figure-html/unnamed-chunk-8-1.png
#	docs/articles/tutorial/out_of_bag_predictions.html
#	docs/articles/tutorial/over_and_undersampling.html
#	docs/articles/tutorial/parallelization.html
#	docs/articles/tutorial/partial_dependence.html
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-15-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-16-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-17-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-18-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-19-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-20-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-22-1.png
#	docs/articles/tutorial/partial_dependence_files/figure-html/unnamed-chunk-23-1.png
#	docs/articles/tutorial/performance.html
#	docs/articles/tutorial/predict.html
#	docs/articles/tutorial/predict_files/figure-html/unnamed-chunk-20-1.png
#	docs/articles/tutorial/predict_files/figure-html/unnamed-chunk-22-1.png
#	docs/articles/tutorial/predict_files/figure-html/unnamed-chunk-23-1.png
#	docs/articles/tutorial/preproc.html
#	docs/articles/tutorial/resample.html
#	docs/articles/tutorial/roc_analysis.html
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-10-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-11-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-11-2.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-14-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-15-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-16-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-17-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-19-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-20-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-22-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-23-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-24-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-4-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-6-1.png
#	docs/articles/tutorial/roc_analysis_files/figure-html/unnamed-chunk-8-1.png
#	docs/articles/tutorial/train.html
#	docs/articles/tutorial/tune.html
#	docs/articles/tutorial/tune_files/figure-html/unnamed-chunk-21-1.png
#	docs/articles/tutorial/wrapper.html
#	docs/favicon.ico
#	docs/reference/makeModelMultiplexer.html
#	docs/reference/makeWeightedClassesWrapper.html
#	docs/reference/tuneParams.html
 9f6ab3e 
 update NEWS 
 87e1195 
### pat-s commented Sep 27, 2018

 @larskotthoff updated help page and tutorial - please take a look. Remember that you can use the netlify preview of the docs (https://deploy-preview-2412--nervous-hopper-4136be.netlify.com/articles/tutorial/resample.html) once the pkgdown files have been deployed by Travis.

### larskotthoff commented Sep 27, 2018

 Thanks, merging.

to join this conversation on GitHub. Already have an account? Sign in to comment