diff --git a/docs/articles/tutorial/cost_sensitive_classif.html b/docs/articles/tutorial/cost_sensitive_classif.html index e58a04febb..ef7df3b71e 100644 --- a/docs/articles/tutorial/cost_sensitive_classif.html +++ b/docs/articles/tutorial/cost_sensitive_classif.html @@ -391,7 +391,7 @@

## Prediction: 1000 observations ## predict.type: prob ## threshold: Bad=0.50,Good=0.50 -## time: 0.01 +## time: 0.02 ## id truth prob.Bad prob.Good response ## 1 1 Good 0.03525092 0.9647491 Good ## 2 2 Bad 0.63222363 0.3677764 Bad @@ -418,7 +418,7 @@

## Prediction: 1000 observations ## predict.type: prob ## threshold: Bad=0.17,Good=0.83 -## time: 0.01 +## time: 0.02 ## id truth prob.Bad prob.Good response ## 1 1 Good 0.03525092 0.9647491 Good ## 2 2 Bad 0.63222363 0.3677764 Bad diff --git a/docs/articles/tutorial/create_filter.html b/docs/articles/tutorial/create_filter.html index afb8dd7d92..101614c0d4 100644 --- a/docs/articles/tutorial/create_filter.html +++ b/docs/articles/tutorial/create_filter.html @@ -357,22 +357,40 @@

## Supported features: numerics,factors,ordered

The nonsense.filter is now registered in mlr and shown by listFilterMethods().

listFilterMethods()$id
-##  [1] anova.test                 auc                       
-##  [3] carscore                   cforest.importance        
-##  [5] chi.squared                FSelectorRcpp.gainratio   
-##  [7] FSelectorRcpp.infogain     FSelectorRcpp.symuncert   
-##  [9] kruskal.test               linear.correlation        
-## [11] mrmr                       nonsense.filter           
-## [13] oneR                       permutation.importance    
-## [15] praznik.CMIM               praznik.DISR              
-## [17] praznik.JMI                praznik.JMIM              
-## [19] praznik.MIM                praznik.MRMR              
-## [21] praznik.NJMIM              randomForest.importance   
-## [23] randomForestSRC.rfsrc      randomForestSRC.var.select
-## [25] ranger.impurity            ranger.permutation        
-## [27] rank.correlation           relief                    
-## [29] univariate.model.score     variance                  
-## 33 Levels: anova.test auc carscore cforest.importance ... variance
+## [1] anova.test +## [2] auc +## [3] carscore +## [4] cforest.importance +## [5] FSelector_chi.squared +## [6] FSelector_gain.ratio +## [7] FSelector_information.gain +## [8] FSelector_oneR +## [9] FSelector_relief +## [10] FSelector_symmetrical.uncertainty +## [11] FSelectorRcpp_gain.ratio +## [12] FSelectorRcpp_information.gain +## [13] FSelectorRcpp_symmetrical.uncertainty +## [14] kruskal.test +## [15] linear.correlation +## [16] mrmr +## [17] nonsense.filter +## [18] permutation.importance +## [19] praznik_CMIM +## [20] praznik_DISR +## [21] praznik_JMI +## [22] praznik_JMIM +## [23] praznik_MIM +## [24] praznik_MRMR +## [25] praznik_NJMIM +## [26] randomForest.importance +## [27] randomForestSRC.rfsrc +## [28] randomForestSRC.var.select +## [29] ranger.impurity +## [30] ranger.permutation +## [31] rank.correlation +## [32] univariate.model.score +## [33] variance +## 36 Levels: anova.test auc carscore ... variance

You can use it like any other filter method already integrated in mlr (i.e., via the method argument of generateFilterValuesData() or the fw.method argument of makeFilterWrapper(); see also the page on feature selection.

d = generateFilterValuesData(iris.task, method = c("nonsense.filter", "anova.test"))
 d
diff --git a/docs/articles/tutorial/feature_selection.html b/docs/articles/tutorial/feature_selection.html
index 00c5bd5d94..311773cd53 100644
--- a/docs/articles/tutorial/feature_selection.html
+++ b/docs/articles/tutorial/feature_selection.html
@@ -316,24 +316,30 @@ 

Calculating the feature importance

Different methods for calculating the feature importance are built into mlr’s function generateFilterValuesData(). Currently, classification, regression and survival analysis tasks are supported. A table showing all available methods can be found in article filter methods.

The most basic approach is to use generateFilterValuesData() directly on a Task() with a character string specifying the filter method.

-

fv is a FilterValues() object and fv$data contains a data.frame that gives the importance values for all features. Optionally, a vector of filter methods can be passed.

- +

A bar plot of importance values for the individual features can be obtained using function plotFilterValues().

plotFilterValues(fv2) + ggpubr::theme_pubr()

@@ -352,7 +358,7 @@

Function filterFeatures() supports these three methods as shown in the following example. Moreover, you can either specify the method for calculating the feature importance or you can use previously computed importance values via argument fval.

# Keep the 2 most important features
-filtered.task = filterFeatures(iris.task, method = "FSelectorRcpp.infogain", abs = 2)
+filtered.task = filterFeatures(iris.task, method = "FSelectorRcpp_information.gain", abs = 2)
 
 # Keep the 25% most important features
 filtered.task = filterFeatures(iris.task, fval = fv, perc = 0.25)
@@ -384,12 +390,13 @@ 

Using fixed parameters

In the following example we calculate the 10-fold cross-validated error rate mmce of the k-nearest neighbor classifier (FNN::fnn()) with preceding feature selection on the iris (datasets::iris()) data set. We use information.gain as importance measure with the aim to subset the dataset to the two features with the highest importance. In each resampling iteration feature selection is carried out on the corresponding training data set before fitting the learner.

-
lrn = makeFilterWrapper(learner = "classif.fnn", fw.method = "FSelectorRcpp.infogain", fw.abs = 2)
-rdesc = makeResampleDesc("CV", iters = 10)
-r = resample(learner = lrn, task = iris.task, resampling = rdesc, show.info = FALSE, models = TRUE)
-r$aggr
-## mmce.test.mean 
-##           0.04
+
lrn = makeFilterWrapper(learner = "classif.fnn", 
+  fw.method = "FSelectorRcpp_information.gain", fw.abs = 2)
+rdesc = makeResampleDesc("CV", iters = 10)
+r = resample(learner = lrn, task = iris.task, resampling = rdesc, show.info = FALSE, models = TRUE)
+r$aggr
+## mmce.test.mean 
+##           0.04

You may want to know which features have been used. Luckily, we have called resample() with the argument models = TRUE, which means that r$models contains a list of models (makeWrappedModel()) fitted in the individual resampling iterations. In order to access the selected feature subsets we can call getFilteredFeatures() on each model.

sfeats = sapply(r$models, getFilteredFeatures)
 table(sfeats)
@@ -408,7 +415,7 @@ 

  • The threshold of the filter method (fw.threshold)
  • In the following regression example we consider the BostonHousing (mlbench::BostonHousing()) data set. We use a Support Vector Machine and determine the optimal percentage value for feature selection such that the 3-fold cross-validated mean squared error (mse()) of the learner is minimal. Additionally, we tune the hyperparameters of the algorithm at the same time. As search strategy for tuning a random search with five iterations is used.

    -
    lrn = makeFilterWrapper(learner = "regr.ksvm", fw.method = "chi.squared")
    +
     

    After tuning we can generate a new wrapped learner with the optimal percentage value for further use (e.g. to predict to new data).

    -

    sfeatsis a FeatSelResult (selectFeatures()) object. The selected features and the corresponding performance can be accessed as follows:

    Further information about the sequential feature selection process can be obtained by function analyzeFeatSelResult().

    analyzeFeatSelResult(sfeats)
    @@ -579,7 +586,7 @@ 

    The selected features are:

    sfeats$x
    @@ -601,27 +608,27 @@ 

    lapply(r$models, getFeatSelResult)
     ## [[1]]
     ## FeatSel result:
    -## Features (18): mean_radius, mean_perimeter, mean_compactness, mean_concavity, mean_concavepoints, SE_texture, SE_perimeter, SE_compactness, SE_concavepoints, SE_symmetry, SE_fractaldim, worst_perimeter, worst_area, worst_smoothness, worst_compactness, worst_concavity, worst_symmetry, pnodes
    +## Features (18): mean_radius, mean_perimeter, mean_compactness, ...
     ## cindex.test.mean=0.5382065
     ## 
     ## [[2]]
     ## FeatSel result:
    -## Features (18): mean_radius, mean_perimeter, mean_area, mean_smoothness, mean_concavity, mean_symmetry, mean_fractaldim, SE_texture, SE_area, SE_compactness, SE_concavepoints, worst_radius, worst_texture, worst_area, worst_compactness, worst_concavepoints, tsize, pnodes
    +## Features (18): mean_radius, mean_perimeter, mean_area, mean_sm...
     ## cindex.test.mean=0.6349051
     ## 
     ## [[3]]
     ## FeatSel result:
    -## Features (20): mean_texture, mean_smoothness, mean_concavity, mean_concavepoints, mean_symmetry, mean_fractaldim, SE_texture, SE_perimeter, SE_compactness, SE_concavepoints, SE_symmetry, SE_fractaldim, worst_texture, worst_area, worst_smoothness, worst_compactness, worst_concavity, worst_concavepoints, worst_symmetry, pnodes
    +## Features (20): mean_texture, mean_smoothness, mean_concavity, ...
     ## cindex.test.mean=0.6812985
     ## 
     ## [[4]]
     ## FeatSel result:
    -## Features (11): mean_perimeter, mean_concavity, mean_concavepoints, mean_symmetry, SE_perimeter, SE_symmetry, worst_smoothness, worst_compactness, worst_concavity, worst_symmetry, tsize
    +## Features (11): mean_perimeter, mean_concavity, mean_concavepoi...
     ## cindex.test.mean=0.6924829
     ## 
     ## [[5]]
     ## FeatSel result:
    -## Features (14): mean_area, mean_smoothness, mean_fractaldim, SE_texture, SE_area, SE_compactness, SE_concavity, SE_concavepoints, SE_symmetry, SE_fractaldim, worst_area, worst_compactness, tsize, pnodes
    +## Features (14): mean_area, mean_smoothness, mean_fractaldim, SE...
     ## cindex.test.mean=0.6701811

    diff --git a/docs/articles/tutorial/feature_selection_files/figure-html/unnamed-chunk-4-1.png b/docs/articles/tutorial/feature_selection_files/figure-html/unnamed-chunk-4-1.png index 24e3aa36d2..65e69f8e53 100644 Binary files a/docs/articles/tutorial/feature_selection_files/figure-html/unnamed-chunk-4-1.png and b/docs/articles/tutorial/feature_selection_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/articles/tutorial/filter_methods.html b/docs/articles/tutorial/filter_methods.html index 2d8cdc63ad..8eaca17820 100644 --- a/docs/articles/tutorial/filter_methods.html +++ b/docs/articles/tutorial/filter_methods.html @@ -290,20 +290,20 @@

    Current methods

    - +
    -+------++++++ @@ -377,7 +377,7 @@

    - + @@ -391,21 +391,77 @@

    - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -419,7 +475,7 @@

    - + @@ -433,6 +489,20 @@

    + + + + + + + + + + + + + + @@ -446,7 +516,7 @@

    - + @@ -460,7 +530,7 @@

    - + @@ -474,20 +544,6 @@

    - - - - - - - - - - - - - - @@ -503,7 +559,7 @@

    - + @@ -517,7 +573,7 @@

    - + @@ -531,7 +587,7 @@

    - + @@ -545,7 +601,7 @@

    - + @@ -559,7 +615,7 @@

    - + @@ -573,7 +629,7 @@

    - + @@ -587,7 +643,7 @@

    - + @@ -685,20 +741,6 @@

    - - - - - - - - - - - - - - @@ -712,7 +754,7 @@

    - + diff --git a/docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveTPFP-1.png b/docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveTPFP-1.png index d2a6e48ea3..53aa3e4b61 100644 Binary files a/docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveTPFP-1.png and b/docs/articles/tutorial/learning_curve_files/figure-html/LearningCurveTPFP-1.png differ diff --git a/docs/articles/tutorial/nested_resampling.html b/docs/articles/tutorial/nested_resampling.html index 46468e0549..0e94d05561 100644 --- a/docs/articles/tutorial/nested_resampling.html +++ b/docs/articles/tutorial/nested_resampling.html @@ -581,7 +581,7 @@

    Filter methods assign an importance value to each feature. Based on these values you can select a feature subset by either keeping all features with importance higher than a certain threshold or by keeping a fixed number or percentage of the highest ranking features. Often, neither the theshold nor the number or percentage of features is known in advance and thus tuning is necessary.

    In the example below the threshold value (fw.threshold) is tuned in the inner resampling loop. For this purpose the base Learner (makeLearner()) "regr.lm" is wrapped two times. First, makeFilterWrapper() is used to fuse linear regression with a feature filtering preprocessing step. Then a tuning step is added by makeTuneWrapper().

    +## Hyperparameters: fw.method=FSelector_ch...

    The result of the feature selection can be extracted by function getFilteredFeatures(). Almost always all 13 features are selected.

    You can access results for individual learners and tasks and inspect them further.

    feats = getBMRFeatSelResults(res, learner.id = "regr.lm.featsel", drop = TRUE)
    @@ -891,7 +891,7 @@ 

    Example 3: One task, two learners, feature filtering with tuning

    Here is a minimal example for feature filtering with tuning of the feature subset size.

    # Feature filtering with tuning in the inner resampling loop
    -lrn = makeFilterWrapper(learner = "regr.lm", fw.method = "chi.squared")
    +lrn = makeFilterWrapper(learner = "regr.lm", fw.method = "FSelector_chi.squared")
     ps = makeParamSet(makeDiscreteParam("fw.abs", values = seq_len(getTaskNFeats(bh.task))))
     ctrl = makeTuneControlGrid()
     inner = makeResampleDesc("CV", iter = 2)
    diff --git a/docs/news/index.html b/docs/news/index.html
    index 5fad52e039..1734c05dca 100644
    --- a/docs/news/index.html
    +++ b/docs/news/index.html
    @@ -323,20 +323,45 @@ 

    filter - new

      -
    • praznik.JMI
    • -
    • praznik.DISR
    • -
    • praznik.JMIM
    • -
    • praznik.MIM
    • -
    • praznik.NJMIM
    • -
    • praznik.MRMR
    • -
    • praznik.CMIM
    • +
    • praznik_JMI
    • +
    • praznik_DISR
    • +
    • praznik_JMIM
    • +
    • praznik_MIM
    • +
    • praznik_NJMIM
    • +
    • praznik_MRMR
    • +
    • praznik_CMIM
    • +
    • FSelectorRcpp_gain.ratio
    • +
    • FSelectorRcpp_information.gain
    • +
    • FSelectorRcpp_symuncert

    filter - general

      -
    • Replaced filters information.gain, gainratio and symmetrical.uncertaintydepending on package FSelector by package FSelectorRcpp. The change comes with a ~ 100 times speedup.
    • +
    • Added filters FSelectorRcpp_gain.ratio, FSelectorRcpp_information.gain and FSelectorRcpp_symmetrical.uncertainty from package FSelectorRcpp. These filters are ~ 100 times faster than the implementation of the FSelector pkg. Please note that both implementations do things slightly different internally and the FSelectorRcpp methods should not be seen as direct replacement for the FSelector pkg.
    • +
    • prefixed all filters from pkg FSelector with FSelector to distinguish them from the new FSelectorRcpp filters +
        +
      • +information.gain -> FSelector_information.gain +
      • +
      • +gain.ratio -> FSelector_gain.ratio +
      • +
      • +symmetrical.uncertainty -> FSelector_symmetrical.uncertainty +
      • +
      • +chi.squared -> FSelector_chi.squared +
      • +
      • +relief -> FSelector_relief +
      • +
      • +oneR -> FSelector_oneR +
      • +
      +
    @@ -347,6 +372,15 @@

  • regr.liquidSVM
  • +
    +

    +featSel - general

    +
      +
    • The FeatSelResult object now contains an additional slot x.bit.names that stores the optimal bits
    • +
    • The slot x now always contains the real feature names and not the bit.names
    • +
    • This fixes a bug and makes makeFeatSelWrapper usable with custom bit.names.
    • +
    +

    MethodX
    chi.squaredFSelector_chi.squared FSelector Chi-squared statistic of independence between feature and target X
    FSelectorRcpp.gainratioFSelectorRcppEntropy-based Filters: Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attributeFSelector_gain.ratioFSelectorEntropy-based gain ratio between feature and target X X X X
    FSelector_information.gainFSelectorEntropy-based information gain between feature and target X X XX
    FSelector_oneRFSelectoroneR association ruleXXX X
    FSelectorRcpp.infogainFSelector_reliefFSelectorRELIEF algorithmXXXX
    FSelector_symmetrical.uncertaintyFSelectorEntropy-based symmetrical uncertainty between feature and targetXXXX
    FSelectorRcpp_gain.ratio FSelectorRcpp Entropy-based Filters: Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attribute X
    FSelectorRcpp.symuncertFSelectorRcpp_information.gain FSelectorRcpp Entropy-based Filters: Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attribute X
    FSelectorRcpp_symmetrical.uncertaintyFSelectorRcppEntropy-based Filters: Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attributeXXXXXXX
    kruskal.test Kruskal Test for binary and multiclass classification tasksX
    linear.correlation Pearson correlation between feature and targetX
    mrmr mRMRe Minimum redundancy, maximum relevance filterX X
    oneRFSelectoroneR association ruleXXXX
    permutation.importance X
    praznik.CMIMpraznik_CMIM praznik Minimal conditional mutual information maximisation filter X
    praznik.DISRpraznik_DISR praznik Double input symmetrical relevance filter X
    praznik.JMIpraznik_JMI praznik Joint mutual information filter X
    praznik.JMIMpraznik_JMIM praznik Minimal joint mutual information maximisation filter X
    praznik.MIMpraznik_MIM praznik conditional mutual information based feature selection filters X
    praznik.MRMRpraznik_MRMR praznik Minimum redundancy maximal relevancy filter X
    praznik.NJMIMpraznik_NJMIM praznik Minimal normalised joint mutual information maximisation filter X
    reliefFSelectorRELIEF algorithmXXXX
    univariate.model.score Resamples an mlr learner for each input feature individually. The resampling performance is used as filter score, with rpart as default learner.X X
    variance A simple variance filter