Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter: Add ensemble filter methods #2456

Merged
merged 115 commits into from Jul 19, 2019

Conversation

@pat-s
Copy link
Member

commented Oct 19, 2018

Purpose

In the filtering/feature selection field, the idea of ensemble filter methods becomes more and more popular. Ensemble filters aggregate the rankings of multiple single filters and create a new ranking.

This approach has shown to be superior to single filter methods, e.g. https://ieeexplore.ieee.org/document/8250495.

Implementation

I decided to establish a new class "FilterEnsemble" which distinguishes the ensemble filter normal the single filters. This decision has (as always) positive and negative side-effects.

Ensemble filters are created in the same way as normal filters in their own file, R/FilterEnsemble.R.
They share the same class structure with some minor differences:

  • no pkg, supported.tasks, supported.features arguments (all checked by the simple filters)
  • new argument basal.methods which stands for "single filter methods"

Calculation is done as usual via generateFilterValuesData() or filterFeatures():

  1. First, a FilterValues object is created as usual by calling generateFilterValuesData() with the single filters.
  2. Then, the specific ensemble filter calculations are done on the FilterValues object (e.g. taking the mean across all voters for each feature).

Notation

Notation differs a bit among the functions.
In generateFilterValuesData(), an ensemble method is passed in a list together with its required simple methods, .e.g.:

generateFilterValuesData(iris.task, 
  method = list("E-min", c('gain.ratio','information.gain')))

To make makeFilterWrapper() flexible in the sense that the single methods, which an ensemble method uses, should be tunable, a new argument base.methods was introduced. It depends on a ensemble method set either in filterFeatures(method = "") or in makefilterWrapper(fw.method = "").

makeFilterWrapper(lrn, fw.method = "E-min", 
  fw.base.methods = c("gain.ratio", "information.gain"),
filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)

This gives the user the option to tune

  • over multiple filters set in fw.method
  • over multiple single filters if an ensemble filter is within fw.method

Tuning simple filters is not supported due to the lack of sampling without replacement for DiscreteVectorParams in ParamHelpers berndbischl/ParamHelpers#206

As multiple rankings are calculated and returned when using an ensemble filter, filterFeatures() will always prioritize the ensemble method unless a different method is set via the new select.method argument.
This only applies if filterFeatures() is called directly as in the wrapper only one filter method is for subsetting anyway (and in the ensemble case, the prioritizing of the ensemble method applies).

Other changes

  • getFilterValuesData() now returns a tbl instead of a data.frame (I think there is no reason not to use enhanced data.frame output. I does not harm any internal processes.)
  • plotFilterValues() got a bit "smarter" and easier now regarding the ordering of multiple facets
  • I added multiple examples to the help pages of filterFeatures(), generateFilterValuesData() and makeFilterWrapper()
  • Instead of a wide data.frame the values are now returned in a long (tidy) data.frame. This makes it easier to apply post-processing methods (like group_by() calls etc)

To-do

  • tests

  • Cache filterValues in a tuning process (and don't recalculate them all the time) #1995

Examples

library(mlr)
#> Loading required package: ParamHelpers
#> Warning: replacing previous import 'stats::filter' by 'dplyr::filter' when
#> loading 'mlr'
fval = generateFilterValuesData(iris.task, 
  method = list("E-mean", c("gain.ratio", "information.gain")))
fval
#> FilterValues:
#> Task: iris-example
#> # A tibble: 12 x 4
#>    name         type    method           value
#>    <chr>        <chr>   <chr>            <dbl>
#>  1 Petal.Width  numeric E-mean           4    
#>  2 Petal.Length numeric E-mean           3    
#>  3 Sepal.Length numeric E-mean           2    
#>  4 Sepal.Width  numeric E-mean           1    
#>  5 Petal.Width  numeric gain.ratio       0.871
#>  6 Petal.Length numeric gain.ratio       0.858
#>  7 Sepal.Length numeric gain.ratio       0.420
#>  8 Sepal.Width  numeric gain.ratio       0.247
#>  9 Petal.Width  numeric information.gain 0.955
#> 10 Petal.Length numeric information.gain 0.940
#> 11 Sepal.Length numeric information.gain 0.452
#> 12 Sepal.Width  numeric information.gain 0.267

filterFeatures(iris.task, method = "E-min", 
  base.methods = c("gain.ratio", "information.gain"), abs = 2)
#> Supervised task: iris-example
#> Type: classif
#> Target: Species
#> Observations: 150
#> Features:
#>    numerics     factors     ordered functionals 
#>           2           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: FALSE
#> Classes: 3
#>     setosa versicolor  virginica 
#>         50         50         50 
#> Positive class: NA


### makeFilterWrapper(), can ofc also be used within tuneParams()
task = makeClassifTask(data = iris, target = "Species")
lrn = makeLearner("classif.lda")
inner = makeResampleDesc("Holdout")
outer = makeResampleDesc("CV", iters = 2)

# usage of an ensemble filter
lrn = makeFilterWrapper(makeLearner("classif.lda"), fw.method = "E-Borda",
  fw.base.methods = c("gain.ratio", "information.gain"),
  fw.perc = 0.5)
r = resample(lrn, task, outer, extract = function(model) {
  getFilteredFeatures(model)
})
#> Resampling: cross-validation
#> Measures:             mmce
#> [Resample] iter 1:    0.0533333
#> [Resample] iter 2:    0.0533333
#> 
#> Aggregated Result: mmce.test.mean=0.0533333
#> 
print(r$extract)
#> [[1]]
#> [1] "Petal.Length" "Petal.Width" 
#> 
#> [[2]]
#> [1] "Petal.Length" "Petal.Width"

plotFilterValues(fval)

Created on 2018-10-22 by the reprex package (v0.2.1)

pat-s added some commits Oct 12, 2018

wip
@@ -26,12 +26,18 @@
#' Mutually exclusive with arguments `fw.perc` and `fw.abs`.
#' @param fw.mandatory.feat ([character])\cr
#' Mandatory features which are always included regardless of their scores
#' @param ensemble.method ([character])\cr
#' Which ensemble method should be used. Can only be used with >= 2 filter methods.

This comment has been minimized.

Copy link
@larskotthoff

larskotthoff Oct 19, 2018

Contributor

How exactly does this work? You can only specify one method in the wrapper, can't you?

This comment has been minimized.

Copy link
@larskotthoff

larskotthoff Oct 19, 2018

Contributor

Also why is this a character? Comments and code below suggest that this is a logical value.

This comment has been minimized.

Copy link
@pat-s

pat-s Oct 21, 2018

Author Member

No, you can use multiple methods.

Also why is this a character? Comments and code below suggest that this is a logical value.

I'll check again. But as said, you can use multiple ones.

#' @template arg_task
#' @param method ([character])\cr
#' Filter method(s), see above.
#' Default is \dQuote{randomForestSRC.rfsrc}.
#' @param nselect (`integer(1)`)\cr
#' Number of scores to request. Scores are getting calculated for all features per default.
#' @param ensemble.method ([character])\cr
#' Ensemble filter method to use. Can only be used with >= 2 filter methods.

This comment has been minimized.

Copy link
@larskotthoff

larskotthoff Oct 19, 2018

Contributor

Should be consistent with wrapper -- character or logical?

This comment has been minimized.

Copy link
@pat-s

pat-s Oct 21, 2018

Author Member

Nothing is finished yet :)


### ensemble rank aggregation

if (any(c("E-min", "E-mean", "E-median", "E-max", "E-Borda") %in% ensemble.method)) {

This comment has been minimized.

Copy link
@larskotthoff

larskotthoff Oct 19, 2018

Contributor

Possible values for ensemble method should be documented.

This comment has been minimized.

Copy link
@pat-s

pat-s Oct 21, 2018

Author Member

Yep, I will ofc do this.


test_that("ensemble methods work", {
fi = generateFilterValuesData(multiclass.task, method = c('gain.ratio','information.gain'),
ensemble.method = c("E-Borda", "E-min"))

This comment has been minimized.

Copy link
@larskotthoff

larskotthoff Oct 19, 2018

Contributor

What does it mean if multiple ensemble methods are specified?

This comment has been minimized.

Copy link
@pat-s

pat-s Oct 21, 2018

Author Member

The same as if you would use multiple single ones. You get back a DF with all listed rankings when you use generateFilterValuesData().

@mb706

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2018

Instead of putting lots of special code into the generateFilterValuesData function the ensemble stuff should probably happen somewhere else. My suggestion is that the filter code should be changed to also accept functions or Filter objects (i.e. the objects found in mlr:::.FilterRegister). Ensembles (and other interesting things) could then be implemented using functionals that create new filters from existing ones

filterFeatures(pid.task, "univariate.model.score", abs = 3,
  perf.learner = "classif.logreg")
# (there should probably be a better way to access this than by ":::")
filterFeatures(pid.task, mlr:::.FilterRegister$univariate.model.score, abs = 3,
  perf.learner = "classif.logreg")
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")
# alternative:
filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
    filter.args = list(univariate.model.score = list(perf.learn = "classif.logreg")),
  abs = 3)

In these examples, the makeFilterEnsemble method would return a Filter object (mostly a function with some metadata about allowed task types) that does the ensemble things internally; "generateFilterValuesData" should not be involved in this and call the metafilter just the same way it calls an ordinary filter.

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Oct 20, 2018

Hi guys,

This is all WIP here. The main idea is to use them in makeFilterWrapper(). I'll come back to your comments later.

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Oct 21, 2018

@mb706 Thanks for your input.

Sounds like a good idea. I prefer the following notation:

filterFeatures(pid.task,
  makeFilterEnsemble("median", c("univariate.model.score", "variance", "anova.test")),
  abs = 3,
  univariate.model.score.perf.learn = "classif.logreg")
@pat-s

This comment has been minimized.

Copy link
Member Author

commented Oct 21, 2018

  • new class FilterEnsemble
  • new listFilterEnsembleMethods() etc
  • wrapper working
  • filterFeatures() and generateFilterValuesData() working

Doc, tests and more concrete examples in the next days.

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2018

@larskotthoff @mb706
Looking forward to your comments now - see first post.

@larskotthoff

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

Looks like builds are failing...

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2018

Tests etc are still missing. It's more about the general approach before fixing all the details and then changing everything again.

Would be great if we could talk about the "big picture" 🙂

pat-s and others added some commits Jun 19, 2019

@@ -1,5 +1,9 @@
# mlr 2.14.0.9000

## Breaking

- Instead of a wide `data.frame` filter values are now returned in a long (tidy) `tibble`. This makes it easier to apply post-processing methods (like `group_by()`, etc) (@pat-s, #2456)

This comment has been minimized.

Copy link
@jakob-r

jakob-r Jun 19, 2019

Member

I could imagine that this is actually something more people programmed against. So a breaking change here might irritate some people. Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.
Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

This comment has been minimized.

Copy link
@pat-s

pat-s Jun 19, 2019

Author Member

I could imagine that this is actually something more people programmed against.

This refers to returning a long DF?
Changing this again would involve several hours since I have to re-arrange all outputs..

Regarding internal calculations: As we have data.table as a dependency - why don't we just use that?

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Also I don't see the need to use a tibble here. If someone wants to do anything to the values one is free to transform and convert them as one pleases.

Yes, sure. For now it is only used for printing, not internally. (i.e. the DF is coerced right before its returned). Which ofc makes no difference for the Import of tibble. I just hate it to print a DF that fills my console to Inf...

This comment has been minimized.

Copy link
@jakob-r

jakob-r Jun 21, 2019

Member

This refers to returning a long DF?

Yes

Changing this again would involve several hours since I have to re-arrange all outputs..

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

Besides one expection for which I failed with base R, we are not using any dplyr stuff internally.

Can you point me to it?

I just hate it to print a DF that fills my console to Inf...

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

if (interactive() && "tibble" %in% rownames(utils::installed.packages())) {
  print.data.frame = function(x, ...) {
    tibble:::print.tbl(tibble::as_tibble(x), ...)
  }
}

This comment has been minimized.

Copy link
@pat-s

pat-s Jun 21, 2019

Author Member

Just saying that it could break someones code. It is something that could make a reverse dependency check worth it before going on cran.

It will break code since the structure of the returned filter value is different, yes.

It is something that could make a reverse dependency check worth it before going on cran.

Yes, I always do that.

Then just add the following to your .Rprofile No need to add a whole package to the dependencies.

Nice hack. I'll use it :) - and get rid of using tibble then in the package.

Can you point me to it?

out = tidyr::gather(out, method, "value", !!dplyr::enquo(method))

I tried a lot of non-dplyr stuff here but eventually gave up.

This comment has been minimized.

Copy link
@jakob-r

jakob-r Jun 21, 2019

Member

I updated it using melt from data.table

This comment has been minimized.

Copy link
@pat-s

pat-s Jun 21, 2019

Author Member

🚀
so easy, damn...

pat-s and others added some commits Jun 21, 2019

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Jun 22, 2019

I assume that the ensemble filters are not using caching since I see long runtimes for them in my study. The aggregation step cannot cause this so I assume the simple filters are not being taken from the cache. Have to inspect.

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Jun 23, 2019

I made a mistake during merging (most likely) - caching was not used so far in this PR because the memoized function was not used. See 049969f. Fixed it now.

I was wondering heavily why everything took so long in my project.. 🙄 🤦‍♂

pat-s and others added some commits Jun 23, 2019

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2019

@larskotthoff @jakob-r
I guess we're good for now here. If I encounter more issues along the way I'll fix them separately.

@jakob-r If you approve your review, feel free to merge.

DESCRIPTION Outdated
@@ -207,6 +208,7 @@ Suggests:
LiblineaR,
lintr (>= 1.0.0.9001),
MASS,
magrittr,

This comment has been minimized.

Copy link
@jakob-r

jakob-r Jun 25, 2019

Member

Not needed.

DESCRIPTION Outdated
@@ -147,6 +147,7 @@ Imports:
ggplot2,
methods,
parallelMap (>= 1.3),
rlang,

This comment has been minimized.

Copy link
@jakob-r

jakob-r Jun 25, 2019

Member

where is .data used?

This comment has been minimized.

Copy link
@pat-s

pat-s Jun 25, 2019

Author Member

Leftover 👍 See commit.

pat-s and others added some commits Jun 25, 2019

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2019

@jakob-r Mergeable now?

pat-s and others added some commits Jul 18, 2019

@pat-s

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

Merging now.

@pat-s pat-s merged commit 3092400 into master Jul 19, 2019

1 check passed

deploy/netlify Deploy preview ready!
Details

@pat-s pat-s deleted the fs-ensemble branch Jul 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.