Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
266 lines (192 sloc) 8.12 KB
---
title: "Introduction to Tasks"
author: "Michel Lang"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to Tasks}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
library(mlr3)
knitr::opts_knit$set(
datatable.print.keys = FALSE,
datatable.print.class = TRUE
)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
set.seed(123)
```
## Predefined tasks
`mlr3` ships with some popular machine learning toy tasks.
These are stored in a `mlr3::Dictionary`, which is a simple key-value store named `mlr3::mlr_tasks`:
```{r}
library(mlr3)
mlr_tasks
```
The `mlr_tasks` object offers public member methods to modify and extend the key-value store, see also the help page in `?mlr_tasks`.
We can obtain an overview of all currently available tasks by
```{r}
as.data.frame(mlr_tasks)
```
# Retrieving tasks
For illustration purposes, we now retrieve the popular [iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) from `mlr_tasks`:
```{r}
task = mlr_tasks$get("iris")
print(task)
```
The `task` object is a `mlr3::Task`, which contains several information on the respective task.
The task properties and characteristics can be queried using the task's public member values and methods (see `?mlr3::Task`).
Most of them should be self-explanatory, e.g.,
```{r}
# public member values
task$nrow
task$ncol
# public member methods
task$head(n = 3)
```
In `mlr3` tasks, each row has a unique identifier (row name) which can be either `integer` or `character.`
These can be used to select specific rows.
```{r}
# iris uses integer row_ids
task$row_ids
# filter rows 1, 51, and 101
task$data(rows = c(1, 51, 101))
# filter rows 1, 51, and 101 and only select column "Species"
task$data(rows = c(1, 51, 101), cols = "Species")
```
Note that the method `$data()` is only an accesor and does not modify the underlying data/task.
To modify the underlying data/task, you can use the [`$filter()`](basic-01-tasks.html#filter) and [`$select()`](basic-01-tasks.html#select) methods, which are mutators.
Each task comes with at least one associated performance measure, stored as list inside the task:
```{r}
task$measures
```
To change a measure for a task, simply overwrite this slot.
## Manual task creation
To manually create a task from a `data.frame`, you must determine the task type to select the respective constructor:
* Classification Task: Target column is labels (stored as `character`/`factor`) with only few distinct values.
<br>$\Rightarrow$ `TaskClassif`.
* Regression Task: Target column is numeric (stored as `integer`/`double`).
<br>$\Rightarrow$ `TaskRegr`.
* Cluster Task: You don't have a target but want to identify similarities in the feature space.
<br>$\Rightarrow$ Not yet implemented.
Let's assume we want to create a simple regression task using the `mtcars` data set from the package `datasets` to predict the column `"mpg"` (miles per gallon).
We only take the first two features here to keep the output in the following examples short.
```{r}
data("mtcars", package = "datasets")
cars = mtcars[, 1:3]
str(cars)
```
Before we can create a regression task, we must create a `mlr3::DataBackend`, an abstraction for data storage system.
Here, we will stick to the simplest form of data storage: an in-memory table format using `data.table::data.table()`.
We construct the backend first, and then pass it to the regression task constructor:
```{r}
b = as_data_backend(cars)
task = TaskRegr$new(id = "cars", backend = b, target = "mpg")
print(task)
```
Note that the `cars` `data.frame` has character row names, which will automatically be used as `row_ids`.
Analogous to the filtering of row ids by integers, we can also filter the row ids by the respective characters, e.g.:
```{r}
# cars data set uses character row_ids
task$row_ids
# filter rows with id "Merc 280" and "Volvo 142E"
task$data(rows = c("Merc 280", "Volvo 142E"))
```
## Column roles
Now, we want the original `rownames()` of `mtcars` to be a regular column.
Thus, we first pre-process the `data.frame` and then re-create the task.
```{r}
library("data.table")
# `as.data.table` removes rownames, whereas `keep.rownames` ensures
# that they are stored in a separate column.
data = as.data.table(cars[, 1:3], keep.rownames = TRUE)
b = as_data_backend(data)
task = TaskRegr$new(id = "cars", backend = b, target = "mpg")
# we now have integer row_ids
task$row_ids
# there is a new "feature" called "rn"
task$feature_names
```
In `mlr3`, columns (and rows) can be assigned roles.
We have seen three different roles for columns so far:
1. The target column (here `"mpg"`), also called dependent variable.
2. Features, also called independent variables.
3. The `row_id`. This column is there for technical reasons, and is typically useless for learning.
The different roles are stored as as a list of column name vectors:
```{r}
task$col_roles
```
As the output shows, the column is `"mpg"` is the target column and are three features:
`"rn"` (row names), `"cyl"`, and `"disp"`.
More roles are documented in the help for tasks.
In the following, we do not want to learn on neither the primary key (which is taken care of `mlr3`) nor the new column `rn` with the row names.
However, we still might want to carry `rn` around for different reasons.
E.g., we can use the row names in plots or to associate outliers with the row names.
This being said, we need to change the role of the row names column `rn` and remove it from the set of features.
```{r}
task$feature_names
task$set_col_role("rn", new_roles = character(0L))
# "rn" not listed as feature any more
task$feature_names
# also eliminated from "data" and "head"
task$data(rows = 1:2)
task$head(2)
```
## Row roles
Just like columns, it is also possible to assign different roles to rows.
Rows can have two different roles:
1. Role `"use"`:
Rows that are generally available for model fitting (although they may also be used as test set in resampling).
This is the default role.
2. Role `"validation"`:
Rows that are held back (see below).
Rows which have missing values in the target column upon task creation are automatically moved to the validation set.
There are several reasons to hold some observations back or treat them differently:
1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting
2. Some observations may be unlabeled, e.g. in data mining cups or [Kaggle](https://www.kaggle.com/) competitions.
These observations cannot be used for training a model, but you can still predict labels.
Instead of creating a task with only a subset of observations and then manually apply the fitted model on an hold-back `data.frame`, you can just call the function `validate()` later on.
Marking observations as validation works analogously to changing column roles:
```{r}
str(task$row_roles)
task$nrow
task$set_row_role(rows = 29:32, new_role = "validation")
task$nrow
```
All pre- and post-processing you have used on the training data is also applied to the validation data in exactly the same way.
## Task mutation
A task can be mutated using methods `filter()`, `select()`, `rbind()`, `cbind()` and `overwrite()`.
The `iris` task is used again to showcase the mutators.
### $filter()
Subsetting based on rows is done with `$filter()`.
Afterwards we can check the modified task by either quering the `data` slot or by checking the number of rows.
```{r}
task = mlr_tasks$get("iris")
task$filter(c(10, 50, 99))
task$data()
task$nrow
```
### $select()
The equivalent method for subsetting columns is `select()`.
```{r}
task$feature_names
task$select("Petal.Length")
task$ncol
```
You might wonder why there are still two columns left even if we selected only one?
The subsetting only applies to the columns listed as "feature" (`task$col_roles`).
The "target" column is not touched.
### $rbind(), $cbind()
These methods add rows or columns to the data set, respectively.
In the following example we duplicate the rows of the `iris` task (please do not do this in practice):
```{r}
# init a fresh task
task = mlr_tasks$get("iris")
task2 = task
task$rbind(task2$data())
task$nrow
```
The same logic applies to `$cbind()`.