Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterFindCorrelation #62

Merged
merged 13 commits into from Feb 24, 2020
Merged

FilterFindCorrelation #62

merged 13 commits into from Feb 24, 2020

Conversation

@mb706
Copy link
Contributor

mb706 commented Feb 11, 2020

See #61: Filter that emulates caret::findFilterCorrelation(exact = FALSE). Only trouble is: findFilterCorrelation(cutoff = 0.3) excludes more features than findFilterCorrelation(cutoff = 0.7), whereas for filter scores, lower cutoff values mean fewer features get excluded. Therefore this filter produces negative scores.

> task = tsk("sonar")
> #what are the features names dropped by findCorrelation(cutoff = 0.9)?
> task$feature_names[caret::findCorrelation(cor(task$data(cols = task$feature_names)), cutoff = 0.9, exact = FALSE)]
[1] "V18" "V15" "V20"
>
> # what are the features with scores < -0.9?
> which(flt("findcorrelation")$calculate(task)$scores < -0.9)
V20 V15 V18 
 58  59  60 
>
> # what are the features dropped by PipeOpFilter(filter.cutoff = -0.9)?
> filtered_task = po("filter", flt("findcorrelation"), filter.cutoff = -0.9)$train(list(task))[[1]]
> setdiff(task$feature_names, filtered_task$feature_names)
[1] "V15" "V18" "V20"

This fixes #61 (at least all that can be fixed).

mb706 added 3 commits Feb 11, 2020
Copy link
Member

pat-s left a comment

We do not necessarily need to match the cutoff numbers of {caret} with the cutoff used in mlr3.

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

In this case a mlr3 cutoff value of 0.1 would correspond to {carets}'s 0.9 value: A feature with a high value is "better" and means that it has a low correlation value, according to the findcorrelation filter.
Imo documenting this should suffice.

This way users can trust the "1 to x" ranking of filters in mlr3 which means that the "best" feature of a filter comes in at rank 1.

Reprex

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3db
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

task = tsk("sonar")

filtered_task = po("filter", flt("findcorrelation"), 
                   filter.cutoff = 0.1)$train(list(task))[[1]]
setdiff(task$feature_names, filtered_task$feature_names)
#> [1] "V15" "V18" "V20"

which(flt("findcorrelation")$calculate(task)$scores < 0.1)
#> V20 V15 V18 
#>  58  59  60

Created on 2020-02-19 by the reprex package (v0.3.0)

R/FilterFindCorrelation.R Outdated Show resolved Hide resolved
public = list(
initialize = function() {
super$initialize(
id = "correlation",

This comment has been minimized.

Copy link
@pat-s

pat-s Feb 19, 2020

Member

We already have a filter with id = "correlation".

Can we find a more matching name that deviates clearly from the existing one?

This comment has been minimized.

Copy link
@mb706

mb706 Feb 20, 2020

Author Contributor

This is obviously a typo, it should be "findcorrelation", because it imitates the caret::findCorrelation function. What other name would you suggest / what naming scheme do you use in mlr3filters?

DESCRIPTION Outdated
@@ -43,7 +43,8 @@ Suggests:
mlr3measures,
praznik,
rpart,
testthat
testthat,
caret

This comment has been minimized.

Copy link
@pat-s

pat-s Feb 19, 2020

Member

Is this already implemented in {tidymodels}? I am worried about adding this to suggest because {caret}'s lifecycle is coming to its end in the foreseeable future.

This comment has been minimized.

Copy link
@mb706

mb706 Feb 20, 2020

Author Contributor

caret is only used for comparison in tests and not actually loaded by the filter, but I can have a look if they copied the behaviour to somewhere in another package.

@github-actions github-actions bot deployed to production Feb 19, 2020 Active
@github-actions github-actions bot deployed to production Feb 19, 2020 Active
@github-actions github-actions bot deployed to production Feb 19, 2020 Active
@github-actions github-actions bot deployed to production Feb 20, 2020 Active
@github-actions github-actions bot deployed to production Feb 20, 2020 Active
@github-actions github-actions bot deployed to production Feb 20, 2020 Active
@github-actions github-actions bot deployed to production Feb 20, 2020 Active
@mb706

This comment has been minimized.

Copy link
Contributor Author

mb706 commented Feb 20, 2020

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores? Currently a findCorrelation cutoff value of 0.9 corresponds to a filter value of -0.9 .

@pat-s

This comment has been minimized.

Copy link
Member

pat-s commented Feb 21, 2020

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores?

Yes, because positive scores / a ranking with the "best" features in front aligns with the way cutoff can be tuned. I don't care about the sign so much but having it positive stays in line with all other filters.

We just need a clear statement in the help page that features with high values actually mean low correlation with others and hence they are ranked best.

@pat-s pat-s merged commit 8c9bc21 into master Feb 24, 2020
8 checks passed
8 checks passed
windows-latest (release)
Details
windows-latest (release)
Details
macOS-latest (release)
Details
macOS-latest (release)
Details
macOS-latest (devel)
Details
macOS-latest (devel)
Details
ubuntu-18.04 (release)
Details
ubuntu-18.04 (release)
Details
@pat-s pat-s deleted the findcorrelation branch Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.