vignettes/basic-04-parallelization.Rmd

---
title: "Introduction to parallelization"
author: "Michel Lang"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to parallelization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(mlr3)
knitr::opts_knit$set(
  datatable.print.keys = FALSE,
  datatable.print.class = TRUE
)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
future::plan("sequential")
set.seed(123)
```

This introduction shows how to parallelize `mlr3` with the packages [`future`](https://cran.r-project.org/package=future) and [`future.apply`](https://cran.r-project.org/package=future.apply).

# Installation

Make sure you have installed `future` and `future.apply`:

```{r, eval=FALSE}
if (!requireNamespace("future"))
  install.packages("future")
if (!requireNamespace("future.apply"))
  install.packages("future.apply")
```

# Parallel resampling

The most outer loop in resampling runs independent repetitions of applying a learner on a subset of a task, predict on a different subset and score the performance by comparing true and predicted labels.
This loop is what is called embarrassingly parallel.

In the following, we will consider the spam task and a simple classification tree (`"classif.rpart"`) to illustrate the parallelization.

```{r, eval = FALSE}
library("mlr3")

task = mlr_tasks$get("spam")
learner = mlr_learners$get("classif.rpart")
resampling = mlr_resamplings$get("subsampling")

system.time(
  resample(task, learner, resampling)
)[3L]
```

We now use the `future` package to parallelize the resampling by selecting a backend via the function `plan` and then repeat the resampling.
We use the "multiprocess" backend here which uses threads on linux/mac and a socket cluster on windows:

```{r, eval = FALSE}
future::plan("multiprocess")
system.time(
  resample(task, learner, resampling)
)[3L]
```

On most systems you should see a decrease in the reported real CPU time.
On some systems (e.g. windows), the overhead for parallelization is quite large though.
Therefore, you should only enable parallelization for experiments which run more than 10s each.

Benchmarking is also parallelized. The following code sends 64 jobs (4 tasks * 16 resampling repeats) to the future backend:

```{r, eval = FALSE}
tasks = mlr_tasks$mget(c("iris", "spam", "pima"))
learners = mlr_learners$mget("classif.rpart")
resamplings = mlr_resamplings$mget("subsampling", param_vals = list(ratio = 0.8, repeats = 16))

future::plan("multiprocess")
system.time(
  benchmark(expand_grid(tasks, learners, resamplings))
)[3L]
```