Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
379 lines (282 sloc) 13.2 KB
---
title: "Comparing mlr3pipelines to other frameworks"
author: "Florian Pfisterer"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Comparing mlr3pipelines to other frameworks}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
cache = FALSE,
collapse = TRUE,
comment = "#>"
)
set.seed(8008135)
compiler::enableJIT(0)
library("mlr3")
library("mlr3pipelines")
```
# Comparing mlr3pipelines to other frameworks
Below, we collected some examples, where **mlr3pipelines** is compared to different other software packages,
such as **mlr**, **recipes** and **sklearn**.
Before diving deeper, we give a short introduction to PipeOps.
## An introduction to "PipeOp"s
In this example, we create a ĺinear Pipeline.
After scaling all input features, we rotate our data using principal component analysis.
After this transformation, we use a simple Decision Tree learner for classification.
As exemplary data, we will use the "`iris`" classification task.
This object contains the famous iris dataset and some meta-information, such as the target variable.
```{r}
library("mlr3")
task = mlr_tasks$get("iris")
```
We quickly split our data into a train and a test set:
```{r}
test.idx = sample(seq_len(task$nrow), 30)
train.idx = setdiff(seq_len(task$nrow), test.idx)
# Set task to only use train indexes
task$row_roles$use = train.idx
```
A Pipeline (or `Graph`) contains multiple pipeline operators ("`PipeOp`"s), where each `PipeOp`
transforms the data when it flows through it. For this use case, we require 3 transformations:
- A `PipeOp` that scales the data
- A `PipeOp` that performs PCA
- A `PipeOp` that contains the **Decision Tree** learner
A list of available `PipeOp`s can be obtained from
```{r}
library("mlr3pipelines")
mlr_pipeops$keys()
```
First we define the required `PipeOp`s:
```{r}
op1 = PipeOpScale$new()
op2 = PipeOpPCA$new()
op3 = PipeOpLearner$new(learner = mlr_learners$get("classif.rpart"))
```
### A quick glance into a PipeOp
In order to get a better understanding of what the respective PipeOps do,
we quickly look at one of them in detail:
The most important slots in a PipeOp are:
- `$train()`: A function used to train the PipeOp.
- `$predict()`: A function used to predict with the PipeOp.
The `$train()` and `$predict()` functions define the core functionality of our PipeOp.
In many cases, in order to not leak information from the training set into the test set it is imperative to treat train and test data separately. For this we require a `$train()` function that learns the appropriate transformations from the training set and a `$predict()` function that applies the transformation on future data.
In the case of `PipeOpPCA` this means the following:
- `$train()` learns a rotation matrix from its input and saves this matrix to
an additional slot, `$state`. It returns the rotated input data stored
in a new `Task`.
- `$predict()` uses the rotation matrix stored in `$state` in order to rotate
future, unseen data. It returns those in a new `Task`.
### Constructing the Pipeline
We can now connect the `PipeOp`s constructed earlier to a **Pipeline**.
We can do this using the `%>>%` operator.
```{r}
linear_pipeline = op1 %>>% op2 %>>% op3
```
The result of this operation is a "`Graph`".
A `Graph` connects the input and output of each `PipeOp` to the following `PipeOp`.
This allows us to specify linear processing pipelines.
In this case, we connect the output of the **scaling** PipeOp to the input of the **PCA** PipeOp
and the output of the **PCA** PipeOp to the input of **PipeOpLearner**.
We can now train the `Graph` using the `iris` **Task**.
```{r}
linear_pipeline$train(task)
```
When we now train the graph, the data flows through the graph as follows:
- The Task flows into the `PipeOpScale`. The `PipeOp` scales each column in the data
contained in the Task and returns a new Task that contains the scaled data to its output.
- The scaled Task flows into the `PipeOpPCA`. PCA transforms the data and returns a (possibly smaller)
Task, that contains the transformed data.
- This transformed data then flows into the learner, in our case **classif.rpart**.
It is then used to train the learner, and as a result saves a model that can be
used to predict new data.
In order to predict on new data, we need to save the relevant transformations our data went through
while training. As a result, each `PipeOp` saves a state, where information requried to appropriately
transform future data is stored. In our case, this is **mean** and **standard deviation** of each column for `PipeOpScale`, the PCA rotation matrix for `PipeOpPCA` and the learned model for `PipeOpLearner`.
```{r}
# predict on test.idx
task$row_roles$use = test.idx
linear_pipeline$predict(task)
```
## mlr3pipelines vs. mlr
In order to showcase the benefits of **mlr3pipelines** over **mlr's** `Wrapper` mechanism,
we compare the case of imputing missing values before filtering the top 2 features and then applying a learner.
While **mlr** wrappers are generally less verbose and require a little fewer code, this heavily inhibits flexibility.
Wrappers can generally not process data in parallel, i.e.
### mlr
```{r, eval = FALSE}
library("mlr")
# We first create a learner
lrn = makeLearner("classif.rpart")
# Wrap this learner in a FilterWrapper
lrn.wrp = makeFilterWrapper(lrn, fw.abs = 2L)
# And wrap the resulting wrapped learner into an ImputeWrapper.
lrn.wrp = makeImputeWrapper(lrn.wrp)
# Afterwards, we can train the resulting learner on a task
train(lrn, iris.task)
```
### mlr3pipelines
```{r, eval = FALSE}
library("mlr3")
library("mlr3pipelines")
library("mlr3filters")
impute = PipeOpImpute$new()
filter = PipeOpFilter$new(filter = FilterVariance$new(), param_vals = list(filter.nfeat = 2L))
rpart = PipeOpLearner$new(mlr_learners$get("classif.rpart"))
# Assemble the Pipeline
pipeline = impute %>>% filter %>>% rpart
# And convert to a 'GraphLearner'
learner = GraphLearner$new(id = "Pipeline", pipeline)
```
The fact that **mlr's** wrappers have to be applied inside-out, i.e. in the reverse order
is often confusing. This is way more straight-forward in `mlr3pipelines`, where
we simply chain the different methods using `%>>%`. Additionally, `mlr3pipelines`
offers way greater possibilities with respect to the kinds of Pipelines that can be constructed.
In `mlr3pipelines`, we allow for the construction of parallel and conditional pipelines.
This was previously not possible.
## mlr3pipelines vs. sklearn.pipeline.Pipeline
In order to broaden the horizon, we compare to **Python** **sklearn's** `Pipeline` methods.
`sklearn.pipeline.Pipeline` sequentially applies a list of transforms before fitting a final estimator.
Intermediate steps of the pipeline are `transforms`, i.e. steps that can learn from the data,
but also transform the data while it flows through it.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
For this, it enables setting parameters of the various steps
It is thus conceptually very similar to `mlr3pipelines`.
Similarly to `mlr3pipelines`, we can tune over a full `Pipeline` using various tuning methods.
`Pipeline` mainly supports linear pipelines. This means, that it can execute parallel steps, such as
for example **Bagging**, but it does not support conditional execution, i.e. `PipeOpBranch`.
At the same time, the different `transforms` in the pipeline can be cached, which makes tuning over the configuration
space of a `Pipeline` more efficient, as executing some steps multiple times can be avoided.
We compare functionality available in both `mlr3pipelines` and `sklearn.pipeline.Pipeline` to give a
comparison.
The following example obtained from the **sklearn** documentation showcases a **Pipeline**
that first Selects a feature and performs PCA on the original data, concatenates the resulting datasets
and applies a Support Vector Machine.
### sklearn
```{r, eval = FALSE}
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way too high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
param_grid = dict(features__pca__n_components=[1, 2, 3],
features__univ_select__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
```
### mlr3pipelines
```{r, eval = FALSE}
library("mlr3verse")
iris = mlr_tasks$get("iris")
# Build the steps
copy = PipeOpCopy$new(2)
pca = PipeOpPCA$new()
selection = PipeOpFilter$new(filter = FilterVariance$new())
union = PipeOpFeatureUnion$new(2)
svm = PipeOpLearner$new(mlr_learners$get("classif.svm", param_vals = list(kernel = "linear")))
# Assemble the Pipeline
pipeline = copy %>>% gunion(list(pca, selection)) %>>% union %>>% svm
learner = GraphLearner$new(id = "Pipeline", pipeline)
# For tuning, we define the resampling and the Parameter Space
resampling = mlr3::mlr_resamplings$get("cv", param_vals = list(folds = 5L))
param_set = paradox::ParamSet$new(params = list(
paradox::ParamDbl$new("classif.svm.cost", lower = 0.1, upper = 1),
paradox::ParamInt$new("pca.rank.", lower = 1, upper = 3),
paradox::ParamInt$new("variance.filter.nfeat", lower = 1, upper = 2)
))
pe = PerformanceEvaluator$new(iris, learner, resampling, param_set)
terminator = TerminatorEvaluations$new(60)
tuner = TunerGridSearch$new(pe, terminator, resolution = 10)$tune()
# Set the learner to the optimal values and train
learner$param_set$values = tuner$tune_result()$values
```
In summary, we can achieve similar results with a comparable number of lines, while at the
same time offering greater flexibility with respect to which kinds of pipelines we want to
optimize over. At the same time, experiments using `mlr3` can now be arbitrarily parallelized
using `futures`.
## mlr3pipelines vs recipes
**recipes** is a new package, that covers some of the same applications steps as
**mlr3pipelines**.
Both packages feature the possiblility to connect different pre- and post-processing
methods using a pipe-operator.
As `recipes` package tightly integrates with the `tidymodels` ecosystem, much of the
functionality integrated there can be used in `recipes`.
We compare `recipes` to `mlr3pipelines` using an example from the `recipes` vignette.
The aim of the analysis is to predict whether customers pay back their loans given some
information on the customers. In order to do this, we build a model that does the following:
1. It first imputes missing values using k-nearest neighbours
2. All factor variables are converted to numerics using dummy encoding
3. The data is first centered then scaled.
In order to validate the algorithm, data is first split into a train and test set using
`initial_split`, `training`, `testing`. The recipe trained on the train data (see steps above)
is then applied to the test data.
### recipes
```{r, eval = FALSE}
library("tidymodels")
library("rsample")
data("credit_data")
set.seed(55)
train_test_split = initial_split(credit_data)
credit_train = training(train_test_split)
credit_test = testing(train_test_split)
rec = recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
step_dummy(all_predictors(), -all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
trained_rec = prep(rec, training = credit_train)
# Apply to train and test set
train_data <- bake(trained_rec, new_data = credit_train)
test_data <- bake(trained_rec, new_data = credit_test)
```
Afterwards, the transformed data can be used during train and predict
```{r, eval = FALSE}
# Train
rf = rand_forest(mtry = 12, trees = 200) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Status ~ ., data = train_data)
# Predict
prds = predict(rf, test_data)
```
### mlr3pipelines
The same analysis can be performed in **mlr3pipelines**
```{r, eval = FALSE}
library("data.table")
library("mlr3")
library("mlr3learners")
library("mlr3pipelines")
data("credit_data", package = "recipes")
set.seed(55)
# Create the task
tsk = TaskClassif$new(id = "credit_task", target = "Status",
backend = as_data_backend(data.table(credit_data)))
# Build up the Pipeline:
g = PipeOpImpute$new() %>>%
# PipeOpEncode$new() %>>%
# PipeOpScale$new() %>>%
PipeOpLearner$new(mlr_learners$get("classif.ranger"))
# We can visualize what happens to the data using the `plot` function:
g$plot()
# And we can use `mlr3's` full functionality be wrapping the Graph into a GraphLearner.
glrn = GraphLearner$new(g)
resample(tsk, glrn, mlr_resamplings$get("holdout"))
```
You can’t perform that action at this time.