Skip to content

mlr-org/mlr3fselect

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

mlr3fselect

Package website: release | dev

r-cmd-check CRAN Status StackOverflow Mattermost

mlr3fselect is the feature selection package of the mlr3 ecosystem. It selects the optimal feature set for any mlr3 learner. The package works with several optimization algorithms e.g. Random Search, Recursive Feature Elimination, and Genetic Search. Moreover, it can automatically optimize learners and estimate the performance of optimized feature sets with nested resampling. The package is built on the optimization framework bbotk.

Resources

There are several section about feature selection in the mlr3book.

The gallery features a collection of case studies and demos about optimization.

The cheatsheet summarizes the most important functions of mlr3fselect.

Installation

Install the last release from CRAN:

install.packages("mlr3fselect")

Install the development version from GitHub:

remotes::install_github("mlr-org/mlr3fselect")

Example

We run a feature selection for a support vector machine on the Spam data set.

library("mlr3verse")

tsk("spam")
## <TaskClassif:spam> (4601 x 58): HP Spam Detection
## * Target: type
## * Properties: twoclass
## * Features (57):
##   - dbl (57): address, addresses, all, business, capitalAve, capitalLong, capitalTotal,
##     charDollar, charExclamation, charHash, charRoundbracket, charSemicolon,
##     charSquarebracket, conference, credit, cs, data, direct, edu, email, font, free,
##     george, hp, hpl, internet, lab, labs, mail, make, meeting, money, num000, num1999,
##     num3d, num415, num650, num85, num857, order, original, our, over, parts, people, pm,
##     project, re, receive, remove, report, table, technology, telnet, will, you, your

We construct an instance with the fsi() function. The instance describes the optimization problem.

instance = fsi(
  task = tsk("spam"),
  learner = lrn("classif.svm", type = "C-classification"),
  resampling = rsmp("cv", folds = 3),
  measures = msr("classif.ce"),
  terminator = trm("evals", n_evals = 20)
)
instance
## <FSelectInstanceSingleCrit>
## * State:  Not optimized
## * Objective: <ObjectiveFSelect:classif.svm_on_spam>
## * Search Space:
##             id    class lower upper nlevels
##  1:    address ParamLgl    NA    NA       2
##  2:  addresses ParamLgl    NA    NA       2
##  3:        all ParamLgl    NA    NA       2
##  4:   business ParamLgl    NA    NA       2
##  5: capitalAve ParamLgl    NA    NA       2
## ---                                        
## 53: technology ParamLgl    NA    NA       2
## 54:     telnet ParamLgl    NA    NA       2
## 55:       will ParamLgl    NA    NA       2
## 56:        you ParamLgl    NA    NA       2
## 57:       your ParamLgl    NA    NA       2
## * Terminator: <TerminatorEvals>

We select a simple random search as the optimization algorithm.

fselector = fs("random_search", batch_size = 5)
fselector
## <FSelectorRandomSearch>: Random Search
## * Parameters: batch_size=5
## * Properties: single-crit, multi-crit
## * Packages: mlr3fselect

To start the feature selection, we simply pass the instance to the fselector.

fselector$optimize(instance)

The fselector writes the best hyperparameter configuration to the instance.

instance$result_feature_set
##  [1] "address"           "addresses"         "all"               "business"         
##  [5] "capitalAve"        "capitalLong"       "capitalTotal"      "charDollar"       
##  [9] "charExclamation"   "charHash"          "charRoundbracket"  "charSemicolon"    
## [13] "charSquarebracket" "conference"        "credit"            "cs"               
## [17] "data"              "direct"            "edu"               "email"            
## [21] "font"              "free"              "george"            "hp"               
## [25] "internet"          "lab"               "labs"              "mail"             
## [29] "make"              "meeting"           "money"             "num000"           
## [33] "num1999"           "num3d"             "num415"            "num650"           
## [37] "num85"             "num857"            "order"             "our"              
## [41] "parts"             "people"            "pm"                "project"          
## [45] "re"                "receive"           "remove"            "report"           
## [49] "table"             "technology"        "telnet"            "will"             
## [53] "you"               "your"

And the corresponding measured performance.

instance$result_y
## classif.ce 
## 0.07042005

The archive contains all evaluated hyperparameter configurations.

as.data.table(instance$archive)
##     address addresses   all business capitalAve capitalLong capitalTotal charDollar charExclamation
##  1:    TRUE      TRUE  TRUE     TRUE       TRUE        TRUE         TRUE       TRUE            TRUE
##  2:    TRUE      TRUE  TRUE    FALSE      FALSE        TRUE         TRUE       TRUE            TRUE
##  3:    TRUE      TRUE FALSE    FALSE       TRUE        TRUE         TRUE       TRUE            TRUE
##  4:    TRUE      TRUE  TRUE     TRUE       TRUE        TRUE         TRUE       TRUE            TRUE
##  5:   FALSE     FALSE FALSE    FALSE      FALSE       FALSE        FALSE       TRUE           FALSE
## ---                                                                                                
## 16:   FALSE     FALSE FALSE    FALSE      FALSE       FALSE        FALSE      FALSE           FALSE
## 17:   FALSE     FALSE FALSE     TRUE       TRUE        TRUE        FALSE      FALSE            TRUE
## 18:   FALSE     FALSE  TRUE     TRUE      FALSE       FALSE        FALSE       TRUE           FALSE
## 19:    TRUE      TRUE  TRUE     TRUE      FALSE        TRUE         TRUE       TRUE            TRUE
## 20:    TRUE     FALSE  TRUE    FALSE      FALSE        TRUE        FALSE       TRUE           FALSE
## 55 variables not shown: [charHash, charRoundbracket, charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu, ...]

We fit a final model with the optimized feature set to make predictions on new data.

task = tsk("spam")
learner = lrn("classif.svm", type = "C-classification")

task$select(instance$result_feature_set)
learner$train(task)