-
-
Notifications
You must be signed in to change notification settings - Fork 70
Expand file tree
/
Copy pathdata_and_basic_modeling.qmd
More file actions
1176 lines (891 loc) · 69 KB
/
data_and_basic_modeling.qmd
File metadata and controls
1176 lines (891 loc) · 69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
aliases:
- "/data_and_basic_modeling.html"
---
# Data and Basic Modeling {#sec-basics}
{{< include ../../common/_setup.qmd >}}
`r chapter = "Data and Basic Modeling"`
`r authors(chapter)`
In this chapter, we will introduce the `r mlr3` objects and corresponding `r ref_pkg("R6")` classes that implement the essential building blocks of machine learning.
These building blocks include the data (and the methods for creating training and test sets), the machine learning `r index('algorithm')` (and its training and prediction process), the configuration of a machine learning algorithm through its `r index('hyperparameters')`, and evaluation measures to assess the quality of predictions.
In the simplest definition, `r index('machine learning')` (ML) is the process of learning models of relationships from data.[Machine Learning/Supervised Learning]{.aside}
`r index('Supervised learning')` is a subfield of ML in which datasets consist of labeled observations, which means that each data point consists of `r index('features')`, which are variables to make predictions from, and a `r index('target')`, which is the quantity that we are trying to predict.
For example, predicting a car's miles per gallon (target) based on the car's properties (features) such as horsepower and the number of gears is a supervised learning problem, which we will return to several times in this book.
In `mlr3`, we refer to datasets, and their associated metadata as `r index('tasks')` (@sec-tasks).
The term 'task' is used to refer to the prediction problem that we are trying to solve.
Tasks are defined by the features used for prediction and the targets to predict, so there can be multiple tasks associated with any given dataset.
For example, predicting miles per gallon (mpg) from horsepower is one task, predicting horsepower from mpg is another task, and predicting the number of gears from the car's model is yet another task.
Supervised learning can be further divided into `r index('regression', aside = TRUE)` -- which is the prediction of numeric target values, e.g. predicting a car's mpg -- and `r index('classification', aside = TRUE)` -- which is the prediction of categorical values/labels, e.g., predicting a car's model.
@sec-special also discusses other tasks, including `r index('cost-sensitive classification', "cost-sensitive", parent = "classification")` and `r index('unsupervised learning')`.
For any supervised learning task, the goal is to build a `r index('model')` that captures the relationship between the features and target, often with the goal of `r index('training', "model training")` the model to learn relationships about the data so it can make predictions for new and previously unseen data.
A `r index('model', aside = TRUE)` is formally a mapping from a feature vector to a prediction.
A prediction can take many forms depending on the task; for example, in classification this can be a predicted label, or a set of predicted probabilities or scores.
Models are induced by passing `r index('training data')` to machine learning algorithms, such as `r index('decision trees', "decision tree")`, `r index('support vector machines', "support vector machine")`, `r index('neural networks', "neural network")`, and many more.
Machine learning algorithms are called learners[Learners]{.aside} in `mlr3` (@sec-learners) as, given data, they learn models.
Each learner has a parameterized space that potential models are drawn from and during the training process, these parameters are fitted to best match the data.
For example, the parameters could be the coefficients used for individual features when training a linear regression model.
During training, most machine learning algorithms are 'fitted'/'trained'\index{model training}\index{fitting|see{model training}} by optimizing a loss-function that quantifies the mismatch between ground truth target values in the training data and the predictions of the model.
For a model to be most useful, it should generalize beyond the training data to make 'good' predictions (@sec-predicting) on new and previously 'unseen' (by the model) data.
The simplest way to test this, is to split data into `r index('training data')` and `r index('test data')`[Train/Test Data]{.aside} -- where the model is trained on the training data and then the separate test data is used to evaluate models in an unbiased way by assessing to what extent the model has learned the true relationships that underlie the data (@sec-performance).
This evaluation procedure estimates a model's `r index('generalization error', aside = TRUE)`, i.e., how well we expect the model to perform in general.
There are many ways to evaluate models and to split data for estimating generalization error (@sec-resampling).
This brief overview of ML provides the basic knowledge required to use `mlr3` and is summarized in @fig-ml-abstraction-basics.
In the rest of this book, we will provide introductions to methodology when relevant.
For texts about ML, including detailed methodology and underpinnings of different algorithms, we recommend @hastie2001, @james_introduction_2014, and @bishop_2006.
In the next few sections we will look at the building blocks of `mlr3` using regression as an example, we will then consider how to extend this to classification in @sec-classif.
```{r data_and_basic_modeling-001, echo=FALSE}
#| label: fig-ml-abstraction-basics
#| fig-cap: "General overview of the machine learning process."
#| fig-alt: "A flowchart starting with the 'Data D' with two arrows to 'Dtrain' and 'Dtest'. 'Dtrain' has an arrow to 'Learner', which has an arrow to 'Model'. 'Dtest' has an arrow (labeled with 'Features') to 'Model' and an arrow (labeled with 'Labels') to 'Measure'. The 'Model' box has an arrow to 'Prediction', which has an arrow to 'Measure', which has an arrow to 'Performance'. The whole flowchart has curly brackets next to it that says 'Repeat = Resampling'."
include_multi_graphics("mlr3book_figures-1")
```
## Tasks {#sec-tasks}
`r index('Tasks')` are objects that contain the (usually tabular) data and additional metadata that define a machine learning problem.
The `r index('metadata')` contain, for example, the name of the target feature for supervised machine learning problems.
This information is extracted automatically when required, so the user does not have to specify the prediction target every time a model is trained.
### Constructing Tasks {#sec-tasks-built-in}
`mlr3` includes a few predefined machine learning tasks in the `r ref("mlr_tasks", aside = TRUE)` `Dictionary`.
```{r data_and_basic_modeling-002}
mlr_tasks
```
To get a task from the dictionary, use the `r ref("tsk()", aside = TRUE)` function and assign the return value to a new variable.
Below we retrieve `tsk("mtcars")`, which uses the `r ref("datasets::mtcars")` dataset:
```{r data_and_basic_modeling-003}
tsk_mtcars = tsk("mtcars")
tsk_mtcars
```
Running `tsk()` without any arguments will list all the tasks in the dictionary, this also works for all other sugar constructors that you will encounter throughout the book.
:::{.callout-tip}
## Help Pages
Usually in R, the help pages of functions can be queried with `?`.
The same is true of R6 classes, so if you want to find the help page of the `mtcars` task you could use `?mlr_tasks_mtcars`.
We have also added a `$help()` method to many of our classes, which allows you to access the help page from any instance of that class, for example: `tsk("mtcars")$help()`.
:::
To create your own regression task, you will need to construct a new instance of `r ref("TaskRegr", aside = TRUE)`.
The simplest way to do this is with the function `r ref("as_task_regr()", aside = TRUE)` to convert a `data.frame` type object to a regression task, specifying the target feature by passing this to the `target` argument.
By example, we will ignore that `mtcars` is already available as a predefined task in `mlr3`.
In the code below we load the `datasets::mtcars` dataset, subset the data to only include columns `"mpg"`, `"cyl"`, `"disp"`, print the modified data's properties, and then set up a regression task called `"cars"` (`id = "cars"`) in which we will try to predict miles per gallon (`target = "mpg"`) from the number of cylinders (`"cyl"`) and displacement (`"disp"`):
```{r data_and_basic_modeling-004}
data("mtcars", package = "datasets")
mtcars_subset = subset(mtcars, select = c("mpg", "cyl", "disp"))
str(mtcars_subset)
tsk_mtcars = as_task_regr(mtcars_subset, target = "mpg", id = "cars")
```
The data can be in any tabular format, e.g. a `data.frame()`, `data.table()`, or `tibble()`.
The `target` argument specifies the prediction target column.
The `id` argument is optional and specifies an identifier for the task that is used in plots and summaries; if omitted the variable name of the data will be used as the `id`.
:::{.callout-tip}
## UTF8 Column Names
As many machine learning models do not work properly with arbitrary UTF8 names, `mlr3` defaults to throwing an error if any of the column names passed to `r ref("as_task_regr()")` (and other task constructors) contain a non-ASCII character or do not comply with R's variable naming scheme.
Therefore, we recommend converting names with `r ref("make.names()")` if possible.
You can bypass this check by setting `options(mlr3.allow_utf8_names = TRUE)` (but do not be surprised if errors occur later, especially when passing objects to other packages).
:::
Printing a task provides a summary and in this case, we can see the task has `r tsk_mtcars$nrow` observations and `r tsk_mtcars$ncol` columns (32 x 3), of which `mpg` is the target, there are no special properties (`Properties: -`), and there are `r length(tsk_mtcars$feature_names)` features stored in double-precision floating point format.
```{r data_and_basic_modeling-005}
tsk_mtcars
```
We can plot the task using the `r mlr3viz` package, which gives a graphical summary of the distribution of the target and feature values:
```{r data_and_basic_modeling-006, message=FALSE}
#| label: fig-mtcars
#| fig-cap: "Overview of the mtcars dataset."
#| fig-alt: Diagram shows six plots, three are line plots showing the relationship between continuous variables, and three are scatter plots showing relationships between other variables.
library(mlr3viz)
autoplot(tsk_mtcars, type = "pairs")
```
### Retrieving Data {#sec-retrieve-data}
We have looked at how to create tasks to store data and metadata, now we will look at how to retrieve the stored data.
Various fields can be used to retrieve metadata about a task. The dimensions, for example, can be retrieved using `$nrow` and `$ncol`:
```{r data_and_basic_modeling-007}
c(tsk_mtcars$nrow, tsk_mtcars$ncol)
```
The names of the feature and target columns are stored in the `$feature_names` and `$target_names` slots, respectively.
```{r data_and_basic_modeling-008}
c(Features = tsk_mtcars$feature_names,
Target = tsk_mtcars$target_names)
```
The columns of a task have unique `character`-valued names and the rows are identified by unique natural numbers, called row IDs.
They can be accessed through the `$row_ids` field:
```{r data_and_basic_modeling-009}
head(tsk_mtcars$row_ids)
```
Row IDs are not used as features when training or predicting but are metadata that allow access to individual observations.
Note that row IDs are not the same as row numbers.
This is best demonstrated by example, below we create a regression task from random data, print the original row IDs, which correspond to row numbers 1-5, then we filter three rows (we will return to this method just below) and print the new row IDs, which no longer correspond to the row numbers.
```{r data_and_basic_modeling-010}
task = as_task_regr(data.frame(x = runif(5), y = runif(5)),
target = "y")
task$row_ids
task$filter(c(4, 1, 3))
task$row_ids
```
This design decision allows tasks and learners to transparently operate on real database management systems, where primary keys are required to be unique, but not necessarily consecutive.
See @sec-backends for more information on using databases as data backends for tasks
The data contained in a task can be accessed through `$data()`, which returns a `r ref("data.table")` object.
This method has optional `rows` and `cols` arguments to specify subsets of the data to retrieve.
```{r data_and_basic_modeling-011}
# retrieve all data
tsk_mtcars$data()
# retrieve data for rows with IDs 1, 5, and 10 and all feature columns
tsk_mtcars$data(rows = c(1, 5, 10), cols = tsk_mtcars$feature_names)
```
:::{.callout-tip}
## Accessing Rows by Number
You can work with row numbers instead of row IDs by using the `$row_ids` field to extract the row ID corresponding to a given row number:
```{r data_and_basic_modeling-012, eval = FALSE}
# select the 2nd row of the task by extracting the second row_id:
tsk_mtcars$data(rows = task$row_ids[2])
```
:::
You can always use 'standard' R methods to extract summary data from a task, for example, to summarize the underlying data:
```{r data_and_basic_modeling-013}
summary(as.data.table(tsk_mtcars))
```
### Task Mutators {#sec-tasks-mutators}
After a task has been created, you may want to perform operations on the task such as filtering down to subsets of rows and columns, which is often useful for manually creating train and test splits or to fit models on a subset of given features.
Above we saw how to access subsets of the underlying dataset using `$data()`, however, this will not change the underlying task.
Therefore, we provide `r index('mutators', aside = TRUE)`, which modify the given `Task` in place, which can be seen in the examples below.
Subsetting by features (columns) is possible with `$select()` with the desired feature names passed as a character vector and subsetting by observations (rows) is performed with `$filter()` by passing the row IDs as a numeric vector. `r index(NULL, "$select()", parent = "Task", code = TRUE)` `r index(NULL, "$filter()", parent = "Task", code = TRUE)`
```{r data_and_basic_modeling-014}
tsk_mtcars_small = tsk("mtcars") # initialize with the full task
tsk_mtcars_small$select("cyl") # keep only one feature
tsk_mtcars_small$filter(2:3) # keep only these rows
tsk_mtcars_small$data()
```
As `R6` uses reference semantics (@sec-r6), you need to use `$clone()` if you want to modify a task while keeping the original object intact.
```{r data_and_basic_modeling-015}
# the wrong way
tsk_mtcars = tsk("mtcars")
tsk_mtcars_wrong = tsk_mtcars
tsk_mtcars_wrong$filter(1:2)
# original data affected
tsk_mtcars$head()
# the right way
tsk_mtcars = tsk("mtcars")
tsk_mtcars_right = tsk_mtcars$clone()
tsk_mtcars_right$filter(1:2)
# original data unaffected
tsk_mtcars$head()
```
To add extra rows and columns to a task, you can use `$rbind()` and `$cbind()` respectively: `r index(NULL, "$cbind()", parent = "Task", code = TRUE)` `r index(NULL, "$rbind()", parent = "Task", code = TRUE)`
```{r data_and_basic_modeling-016}
tsk_mtcars_small$cbind( # add another column
data.frame(disp = c(150, 160))
)
tsk_mtcars_small$rbind( # add another row
data.frame(mpg = 23, cyl = 5, disp = 170)
)
tsk_mtcars_small$data()
```
## Learners {#sec-learners}
Objects of class `r ref("Learner", aside = TRUE)` provide a unified interface to many popular machine learning algorithms in R.
The `r ref("mlr_learners", aside = TRUE)` dictionary contains all the learners available in `mlr3`.
We will discuss the available learners in @sec-lrns-add; for now, we will just use a regression tree learner as an example to discuss the `Learner` interface.
As with tasks, you can access learners from the dictionary with a single sugar function, in this case, `r ref("lrn()", aside = TRUE)`.
```{r data_and_basic_modeling-017}
lrn("regr.rpart")
```
All `Learner` objects include the following metadata, which can be seen in the output above:
* `$feature_types`: the type of features the learner can handle.
* `$packages`: the packages required to be installed to use the learner.
* `$properties`: the properties of the learner. For example, the "missings" properties means a model can handle missing data, and "importance" means it can compute the relative importance of each feature.
* `$predict_types`: the types of prediction that the model can make (@sec-predicting).
* `$param_set`: the set of available hyperparameters (@sec-param-set).
To run a machine learning experiment, learners pass through two stages (@fig-basics-learner):
* `r index('Training', "model training", aside = TRUE)`: A training `Task` is passed to the learner's `r index("$train()", parent = "Learner", code = TRUE)` function which trains and stores a `r index('model')`, i.e., the learned relationship of the features to the target.
* `r index('Predicting', "model predicting", aside = TRUE)`: New data, potentially a different partition of the original dataset, is passed to the `r index("$predict()", parent = "Learner", code = TRUE)` method of the trained learner to predict the target values.
```{r data_and_basic_modeling-018, echo=FALSE, out.width = "70%"}
#| label: fig-basics-learner
#| fig-cap: Overview of the different stages of a learner. Top -- data (features and a target) are passed to an (untrained) learner. Bottom -- new data are passed to the trained model which makes predictions for the 'missing' target column.
#| fig-alt: Diagram shows two boxes, the first is labeled "$train() on Training Data" and shows data pointing at the Learner. The second is labeled "$predict() on New Data to Get Predictions" and shows different data pointing at a learner which now includes a "$model". An arrow then shows predictions being made from the "Learner" in the second box.
include_multi_graphics("mlr3book_figures-2")
```
### Training {#sec-training}
In the simplest use case, models are trained by passing a task to a learner with the `r index("$train()", parent = "Learner", aside = TRUE, code = TRUE)` method:
```{r data_and_basic_modeling-019}
# load mtcars task
tsk_mtcars = tsk("mtcars")
# load a regression tree
lrn_rpart = lrn("regr.rpart")
# pass the task to the learner via $train()
lrn_rpart$train(tsk_mtcars)
```
After training, the fitted model is stored in the `r index("$model", parent = "Learner", aside = TRUE, code = TRUE)` field for future inspection and prediction:
```{r data_and_basic_modeling-020}
# inspect the trained model
lrn_rpart$model
```
We see that the regression tree has identified features in the task that are predictive of the target (`mpg`) and used them to partition observations.
The textual representation of the model depends on the type of learner.
For more information on any model see the learner help page, which can be accessed in the same way as tasks with the `help()` field, e.g., `lrn_rpart$help()`.
#### Partitioning Data {#sec-basics-partition}
When assessing the quality of a model's predictions, you will likely want to partition your dataset to get a fair and unbiased estimate of a model's generalization error.
In @sec-performance we will look at resampling and benchmark experiments, which will go into more detail about performance estimation but for now, we will just discuss the simplest method of splitting data using the `r ref("partition()", aside = TRUE)` function.
This function creates index sets that randomly split the given task into two disjoint sets: a training set\index{training data} (67% of the total data by default) and a test set\index{test data} (the remaining 33% of the total data not in the training set).
```{r data_and_basic_modeling-021}
splits = partition(tsk_mtcars)
splits
```
When training we will tell the model to only use the training data by passing the row IDs from `partition` to the `row_ids` argument of `$train()`:
```{r data_and_basic_modeling-022}
lrn_rpart$train(tsk_mtcars, row_ids = splits$train)
```
Now we can use our trained learner to make predictions on new data.
### Predicting {#sec-predicting}
Predicting from trained models is as simple as passing your data as a `Task` to the `r index("$predict()", parent = "Learner", aside = TRUE, code = TRUE)` method of the trained `Learner`.
Carrying straight on from our last example, we will call the `$predict()` method of our trained learner and again will use the `row_ids` argument, but this time to pass the IDs of our `r index("test set", "test data")`:
```{r data_and_basic_modeling-023}
prediction = lrn_rpart$predict(tsk_mtcars, row_ids = splits$test)
```
The `$predict()` method returns an object inheriting from `r ref("Prediction", aside = TRUE)`, in this case `r ref("PredictionRegr", aside = TRUE)` as this is a regression task.
```{r data_and_basic_modeling-024}
prediction
```
The `row_ids` column corresponds to the row IDs of the predicted observations.
The `truth` column contains the ground truth data if available, which the object extracts from the task, in this case: `tsk_mtcars$truth(splits$test)`.
Finally, the `response` column contains the values predicted by the model.
The `Prediction` object can easily be converted into a `data.table` or `data.frame` using `as.data.table()`/`as.data.frame()` respectively.
All data in the above columns can be accessed directly, for example, to get the first two predicted responses:
```{r data_and_basic_modeling-025}
prediction$response[1:2]
```
Similarly to plotting `Task`s, `r mlr3viz` provides an `r ref("ggplot2::autoplot()")` method for `Prediction` objects.
```{r data_and_basic_modeling-026, message = FALSE, warning = FALSE, out.width = "70%"}
#| label: fig-basics-truthresponse
#| fig-cap: "Comparing predicted and ground truth values for the mtcars dataset."
#| fig-alt: "A scatter plot with predicted values on one axis and ground truth values on the other. A trend line is fit to show that in general there is good agreement between predicted and ground truth values."
library(mlr3viz)
prediction = lrn_rpart$predict(tsk_mtcars, splits$test)
autoplot(prediction)
```
In the examples above we made predictions by passing a task to `$predict()`.
However, if you would rather pass a `data.frame` type object directly, then you can use `r index("$predict_newdata()", parent = "Learner", code = TRUE)`.
Note, the `truth` column values are all `NA`, as we did not include a target column in the generated data.
```{r data_and_basic_modeling-027}
mtcars_new = data.table(cyl = c(5, 6), disp = c(100, 120),
hp = c(100, 150), drat = c(4, 3.9), wt = c(3.8, 4.1),
qsec = c(18, 19.5), vs = c(1, 0), am = c(1, 1),
gear = c(6, 4), carb = c(3, 5))
prediction = lrn_rpart$predict_newdata(mtcars_new)
prediction
```
#### Changing the Prediction Type {.unnumbered .unlisted}
While predicting a single numeric quantity is the most common prediction type in regression, it is not the only prediction type.
Several regression models can also predict standard errors.
To predict this, the `r index('$predict_type', parent = "Learner", code = TRUE)` field of a `r ref("LearnerRegr")` must be changed from "response" (the default) to `"se"` before training.
The `"rpart"` learner we used above does not support predicting standard errors, so in the example below we will use a linear regression model (`lrn("regr.lm")`).
```{r data_and_basic_modeling-028}
library(mlr3learners)
lrn_lm = lrn("regr.lm", predict_type = "se")
lrn_lm$train(tsk_mtcars, splits$train)
lrn_lm$predict(tsk_mtcars, splits$test)
```
Now the output includes an `se` column as desired.
In @sec-basics-classif-learner we will see prediction types playing an even bigger role in the context of classification.
Having covered the unified train/predict interface, we can now look at how to use hyperparameters to configure these methods for individual algorithms.
### Hyperparameters {#sec-param-set}
`Learner`s encapsulate a machine learning algorithm and its `r index('hyperparameters')`, which affect *how* the algorithm is run and can be set by the user.
Hyperparameters may affect how a model is trained or how it makes predictions and deciding how to set hyperparameters can require expert knowledge.
Hyperparameters can be optimized automatically (@sec-optimization), but in this chapter we will focus on how to set them manually.
#### Paradox and Parameter Sets
We will continue our running example with a regression tree learner.
To access the hyperparameters in the decision tree, we use `r index("$param_set", parent = "Learner", aside = TRUE, code = TRUE)`:
```{r data_and_basic_modeling-029}
lrn_rpart$param_set
```
The output above is a `r ref("ParamSet", aside = TRUE)` object, supplied by the `r ref_pkg("paradox")` package.
A more detailed introduction of the `paradox` package is provided in @sec-paradox.
These objects provide information on hyperparameters including their name (`id`), data type (`class`), technically valid ranges for hyperparameter values (`lower`, `upper`), the number of levels possible if the data type is categorical (`nlevels`), the default value from the underlying package (`default`), and finally the set value (`value`).
The second column references classes defined in `r ref_pkg("paradox")` that determine the class of the parameter and the possible values it can take.
@tbl-parameters-classes lists the possible hyperparameter types, all of which inherit from `r ref("Domain")`.
| Hyperparameter Class | Hyperparameter Type |
|----------------------|-----------------------|
| `ParamDbl` | Real-valued (numeric) |
| `ParamInt` | Integer |
| `ParamFct` | Categorical (factor) |
| `ParamLgl` | Logical / Boolean |
| `ParamUty` | Untyped |
: Hyperparameter classes and the type of hyperparameter they represent. {#tbl-parameters-classes}
In our decision tree example, we can infer from the `ParamSet` output that:
* `cp` must be a "double" (`ParamDbl`) taking values between `0` (`lower`) and `1` (`upper`) with a default of 0.01 (`default`).
* `keep_model` must be a "logical" (`ParamLgl`) taking values `TRUE` or `FALSE` with default `FALSE`
* `xval` must be an "integer" (`ParamInt`) taking values between `0` and `Inf` with a default of `10` and has a set value of `0`.
In rare cases (we try to minimize it as much as possible), hyperparameters are initialized to values which deviate from the default in the underlying package.
When this happens, the reason will always be given in the learner help page.
In the case of `lrn("regr.rpart")`, the `xval` hyperparameter is initialized to `0` because `xval` controls internal cross-validations and if a user accidentally leaves this at the default `10`, model training can take an unnecessarily long time.
#### Getting and Setting Hyperparameter Values
Now we have looked at how hyperparameter sets are stored, we can think about getting and setting them.
Returning to our decision tree, say we are interested in growing a tree with depth `1`, also known as a "decision stump", where data is split only once into two terminal nodes.
From the parameter set output, we know that the `maxdepth` parameter has a default of `30` and that it takes integer values.
There are a few different ways we could change this hyperparameter.
The simplest way is during construction of the learner by passing the hyperparameter name and new value to `lrn()`:
```{r data_and_basic_modeling-030}
lrn_rpart = lrn("regr.rpart", maxdepth = 1)
```
We can get a list of non-default hyperparameters (i.e., those that have been set) by using `$param_set$values`:
```{r data_and_basic_modeling-031}
lrn_rpart$param_set$values
```
Now we can see that `maxdepth = 1` (as we discussed above `xval = 0` is changed during construction) and the learned regression tree reflects this:
```{r data_and_basic_modeling-032}
lrn_rpart$train(tsk("mtcars"))$model
```
The `$values` field simply returns a `list` of set hyperparameters, so another way to update hyperparameters is by updating an element in the list:
```{r data_and_basic_modeling-033}
lrn_rpart$param_set$values$maxdepth = 2
lrn_rpart$param_set$values
# now with depth 2
lrn_rpart$train(tsk("mtcars"))$model
```
To set multiple values at once we recommend either setting these during construction or using `r index("$set_values()", parent = "Learner", aside = TRUE, code = TRUE)`, which updates the given hyperparameters (argument names) with the respective values.
```{r data_and_basic_modeling-034}
lrn_rpart = lrn("regr.rpart", maxdepth = 3, xval = 1)
lrn_rpart$param_set$values
# or with set_values
lrn_rpart$param_set$set_values(xval = 2, cp = 0.5)
lrn_rpart$param_set$values
```
In addition to `$set_values()`, `r index("$configure()", parent = "Learner", aside = TRUE, code = TRUE)` allows setting hyperparameters and learner fields simultaneously.
Arguments matching parameter names are set as hyperparameters, while remaining arguments are set as learner fields such as `$timeout` (@sec-encapsulation):
```{r data_and_basic_modeling-034a}
lrn_rpart = lrn("regr.rpart")
lrn_rpart$configure(maxdepth = 3, timeout = c(train = 10, predict = Inf))
lrn_rpart$timeout
```
:::{.callout-warning}
## Setting Hyperparameters Using a `list`
As `lrn_rpart$param_set$values` returns a `list`, some users may be tempted to set hyperparameters by passing a new `list` to `$values` -- this would work but **we do not recommend it**.
This is because passing a `list` will wipe any existing hyperparameter values if they are not included in the list.
For example:
```{r data_and_basic_modeling-035}
# set xval and cp
lrn_rpart_params = lrn("regr.rpart", xval = 0, cp = 1)
# passing maxdepth through a list, removing all other values
lrn_rpart_params$param_set$values = list(maxdepth = 1)
# we have removed xval and cp by mistake
lrn_rpart_params$param_set$values
# now with set_values
lrn_rpart_params = lrn("regr.rpart", xval = 0, cp = 1)
lrn_rpart_params$param_set$set_values(maxdepth = 1)
lrn_rpart_params$param_set$values
```
:::
Whichever method you choose, all have safety checks to ensure your new values fall within the allowed parameter range:
```{r data_and_basic_modeling-036, error=TRUE}
lrn("regr.rpart", cp = 2, maxdepth = 2)
```
#### Hyperparameter Dependencies
{{< include ../../common/_optional.qmd >}}
More complex hyperparameter spaces may include dependencies, which occur when setting a hyperparameter is conditional on the value of another hyperparameter; this is most important in the context of model tuning (@sec-optimization).
One such example is a `r index('support vector machine')` (`lrn("regr.svm")`).
The field `r index("$deps", parent = "ParamSet", code = TRUE)` returns a `data.table`, which lists the hyperparameter dependencies in the `Learner`.
For example we can see that the `cost` (`id`-column) parameter is dependent on the `type` (`on`-column) parameter.
```{r data_and_basic_modeling-037}
lrn("regr.svm")$param_set$deps
```
The `cond` column tells us what the condition is, which will either mean that `id` can be set if `on` equals a single value (`r ref("CondEqual")`) or any value in the listed set (`r ref("CondAnyOf")`).
```{r data_and_basic_modeling-038}
lrn("regr.svm")$param_set$deps[[1, "cond"]]
lrn("regr.svm")$param_set$deps[[3, "cond"]]
```
This tells us that the parameter `cost` should only be set if the `type` parameter is one of `"eps-regression"` or `"nu-regression"`, and `degree` should only be set if `kernel` is equal to `"polynomial"`.
The `Learner` will error if dependent hyperparameters are set when their conditions are not met:
```{r data_and_basic_modeling-039, error=TRUE}
# error as kernel is not polynomial
lrn("regr.svm", kernel = "linear", degree = 1)
# works because kernel is polynomial
lrn("regr.svm", kernel = "polynomial", degree = 1)
```
### Baseline Learners {#sec-basics-featureless}
Before we move on to learner evaluation, we will highlight an important class of learners.
These are extremely simple or 'weak' learners known as `r index('baselines', aside = TRUE)`.
Baselines are useful in model comparison (@sec-performance) and as fallback learners (@sec-encapsulation-fallback, @sec-fallback).
For regression, we have implemented the baseline `lrn("regr.featureless")`, which always predicts new values to be the mean (or median, if the `robust` hyperparameter is set to `TRUE`) of the target in the training data:
```{r data_and_basic_modeling-040}
# generate data
df = as_task_regr(data.frame(x = runif(1000), y = rnorm(1000, 2, 1)),
target = "y")
lrn("regr.featureless")$train(df, 1:995)$predict(df, 996:1000)
```
It is good practice to test all new models against a baseline, and also to include baselines in experiments with multiple other models.
In general, a model that does not outperform a baseline is a 'bad' model, on the other hand, a model is not necessarily 'good' if it outperforms the baseline.
## Evaluation {#sec-eval}
Perhaps *the most* important step of the applied machine learning workflow is evaluating model performance.
Without this, we would have no way to know if our trained model makes very accurate predictions, is worse than randomly guessing, or somewhere in between.
We will continue with our decision tree example to establish if the quality of our predictions is 'good', first we will rerun the above code so it is easier to follow along.
```{r data_and_basic_modeling-041}
lrn_rpart = lrn("regr.rpart")
tsk_mtcars = tsk("mtcars")
splits = partition(tsk_mtcars)
lrn_rpart$train(tsk_mtcars, splits$train)
prediction = lrn_rpart$predict(tsk_mtcars, splits$test)
```
### Measures
The quality of predictions is evaluated using measures that compare them to the ground truth data for supervised learning tasks.
Similarly to `Task`s and `Learner`s, the available measures in `mlr3` are stored in a dictionary called `r ref("mlr_measures", aside = TRUE)` and can be accessed with `r index("msr()", "msr()/msrs()", aside = TRUE, code = TRUE)`:
```{r data_and_basic_modeling-042}
as.data.table(msr())
```
All measures implemented in `mlr3` are defined primarily by three components: 1) the function that defines the measure; 2) whether a lower or higher value is considered 'good'; and 3) the range of possible values the measure can take.
As well as these defining elements, other metadata are important to consider when selecting and using a `Measure`, including if the measure has any special properties (e.g., requires training data), the type of predictions the measure can evaluate, and whether the measure has any 'control parameters'.
All this information is encapsulated in the `r ref("Measure", aside = TRUE)` object.
By example, let us consider the `r index('mean absolute error')` (MAE):
```{r data_and_basic_modeling-043}
measure = msr("regr.mae")
measure
```
This measure compares the absolute difference ('error') between true and predicted values: $f(y, \hat{y}) = | y - \hat{y} |$.
Lower values are considered better (`Minimize: TRUE`), which is intuitive as we would like the true values, $y$, to be identical (or as close as possible) in value to the predicted values, $\hat{y}$.
We can see that the range of possible values the learner can take is from $0$ to $\infty$ (`Range: [0, Inf]`), it has no special properties (`Properties: -`), it evaluates `response` type predictions for regression models (`Predict type: response`), and it has no control parameters (`Parameters: list()`).
Now let us see how to use this measure for scoring our predictions.
### Scoring Predictions
Usually, supervised learning measures compare the difference between predicted values and the ground truth.
`mlr3` simplifies the process of bringing these quantities together by storing the predictions and true outcomes in the `r ref("Prediction", index = TRUE)` object as we have already seen.
```{r data_and_basic_modeling-044}
prediction
```
To calculate model performance, we simply call the `r index("$score()", parent = "Prediction", aside = TRUE, code = TRUE)` method of a `Prediction` object and pass as a single argument the measure that we want to compute:
```{r data_and_basic_modeling-045}
prediction$score(measure)
```
Note that all task types have default measures that are used if the argument to `$score()` is omitted, for regression this is the mean squared error (`msr("regr.mse")`), which is the squared difference between true and predicted values: $f(y, \hat{y}) = (y - \hat{y})^2$, averaged over the test set.
It is possible to calculate multiple measures at the same time by passing multiple measures to `$score()`.
For example, below we compute performance for mean squared error (`"regr.mse"`) and mean absolute error (`"regr.mae"`) -- note we use `r index("msrs()", "msr()/msrs()", aside = TRUE, code = TRUE)` to load multiple measures at once.
```{r data_and_basic_modeling-046}
measures = msrs(c("regr.mse", "regr.mae"))
prediction$score(measures)
```
### Technical Measures {#sec-basics-measures-tech}
{{< include ../../common/_optional.qmd >}}
`mlr3` also provides measures that do not quantify the quality of the predictions of a model, but instead provide 'meta'-information about the model.
These include:
* `msr("time_train")` -- The time taken to train a model.
* `msr("time_predict")` -- The time taken for the model to make predictions.
* `msr("time_both")` -- The total time taken to train the model and then make predictions.
* `msr("selected_features")` -- The number of features selected by a model, which can only be used if the model has the "selected_features" property.
For example, we could score our decision tree to see how many seconds it took to train the model and make predictions:
```{r data_and_basic_modeling-047}
measures = msrs(c("time_train", "time_predict", "time_both"))
prediction$score(measures, learner = lrn_rpart)
```
Notice a few key properties of these measures:
1) `time_both` is simply the sum of `time_train` and `time_predict`.
2) We had to pass `learner = lrn_rpart` to `$score()` as these measures have the `requires_learner` property:
```{r data_and_basic_modeling-048}
msr("time_train")$properties
```
3) These can be used after model training and predicting because we automatically store model run times whenever `$train()` and `$predict()` are called, so the measures above are equivalent to:
```{r data_and_basic_modeling-049}
c(lrn_rpart$timings, both = sum(lrn_rpart$timings))
```
The `selected_features` measure calculates how many features were used in the fitted model.
```{r data_and_basic_modeling-050}
msr_sf = msr("selected_features")
msr_sf
```
We can see that this measure contains `r index('control parameters', aside = TRUE)` (`Parameters: normalize=FALSE`), which control how the measure is computed.
As with hyperparameters these can be accessed with `r index("$param_set", parent = "Measure", code = TRUE)`:
```{r data_and_basic_modeling-051}
msr_sf = msr("selected_features")
msr_sf$param_set
```
The `normalize` hyperparameter specifies whether the returned number of selected features should be normalized by the total number of features, this is useful if you are comparing this value across tasks with differing numbers of features.
We would change this parameter in the exact same way as we did with the learner above:
```{r data_and_basic_modeling-052}
msr_sf$param_set$values$normalize = TRUE
prediction$score(msr_sf, task = tsk_mtcars, learner = lrn_rpart)
```
Note that we passed the task and learner as the measure has the `requires_task` and `requires_learner` properties.
## Our First Regression Experiment {#sec-basics-regr-experiment}
We have now seen how to train a model, make predictions and score them.
What we have not yet attempted is to ascertain if our predictions are any 'good'.
So before look at how the building blocks of `mlr3` extend to classification, we will take a brief pause to put together everything above in a short experiment to assess the quality of our predictions.
We will do this by comparing the performance of a featureless regression learner to a decision tree with changed hyperparameters.
```{r data_and_basic_modeling-053}
library(mlr3)
set.seed(349)
# load and partition our task
tsk_mtcars = tsk("mtcars")
splits = partition(tsk_mtcars)
# load featureless learner
lrn_featureless = lrn("regr.featureless")
# load decision tree and set hyperparameters
lrn_rpart = lrn("regr.rpart", cp = 0.2, maxdepth = 5)
# load MSE and MAE measures
measures = msrs(c("regr.mse", "regr.mae"))
# train learners
lrn_featureless$train(tsk_mtcars, splits$train)
lrn_rpart$train(tsk_mtcars, splits$train)
# make and score predictions
lrn_featureless$predict(tsk_mtcars, splits$test)$score(measures)
lrn_rpart$predict(tsk_mtcars, splits$test)$score(measures)
```
Before starting the experiment we load the `mlr3` library and set a seed.
We loaded the `mtcars` task using `tsk()` and then split this using `partition` with the default 2/3 split.
Next, we loaded a featureless baseline learner (`"regr.featureless"`) with the `lrn()` function.
Then loaded a decision tree (`lrn("regr.rpart")`) but changed the complexity parameter and max tree depth from their defaults.
We then used `msrs()` to load multiple measures at once, the mean squared error (MSE: `regr.mse`) and the mean absolute error (MAE: `regr.mae`).
With all objects loaded, we trained our models, ensuring we passed the same training data to both.
Finally, we made predictions from our trained models and scored these.
For both MSE and MAE, lower values are 'better' (`Minimize: TRUE`) and we can therefore conclude that our decision tree performs better than the featureless baseline.
In @sec-benchmarking we will see how to formalize comparison between models in a more efficient way using `r ref("benchmark()", index = TRUE)`.
Now we have put everything together you may notice that our learners and measures both have the `"regr."` prefix, which is a handy way of reminding us that we are working with a regression task and therefore must make use of learners and measures built for regression.
In the next section, we will extend the building blocks of `mlr3` to consider classification tasks, which make use of learners and measures with the `"classif."` prefix.
## Classification {#sec-classif}
`r index('Classification')` problems are ones in which a model predicts a discrete, categorical target, as opposed to a continuous, numeric quantity.
For example, predicting the species of penguin from its physical characteristics would be a classification problem as there is a defined set of species.
`mlr3` ensures that the interface for all tasks is as similar as possible (if not identical) and therefore we will not repeat any content from the previous section but will just focus on differences that make classification a unique machine learning problem.
We will first demonstrate the similarities between regression and classification by performing an experiment very similar to the one in @sec-basics-regr-experiment, using code that will now be familiar to you.
We will then move to differences in tasks, learners and predictions, before looking at `r index('thresholding')`, which is a method specific to classification.
### Our First Classification Experiment {#sec-basics-classif-experiment}
The interface for classification tasks, learners, and measures, is identical to the regression setting, except the underlying objects inherit from `r ref("TaskClassif", index = TRUE)`, `r ref("LearnerClassif", index = TRUE)`, and `r ref("MeasureClassif", index = TRUE)`, respectively.
We can therefore run a very similar experiment to the one above.
```{r data_and_basic_modeling-054}
library(mlr3)
set.seed(349)
# load and partition our task
tsk_penguins = tsk("penguins")
splits = partition(tsk_penguins)
# load featureless learner
lrn_featureless = lrn("classif.featureless")
# load decision tree and set hyperparameters
lrn_rpart = lrn("classif.rpart", cp = 0.2, maxdepth = 5)
# load accuracy measure
measure = msr("classif.acc")
# train learners
lrn_featureless$train(tsk_penguins, splits$train)
lrn_rpart$train(tsk_penguins, splits$train)
# make and score predictions
lrn_featureless$predict(tsk_penguins, splits$test)$score(measure)
lrn_rpart$predict(tsk_penguins, splits$test)$score(measure)
```
In this experiment, we loaded the predefined task `penguins`, which is based on the `r ref("palmerpenguins::penguins")` dataset, then partitioned the data into training and test splits.
We loaded the featureless classification baseline (using the default which always predicts the most common class in the training data, but which also has the option of predicting (uniformly or weighted) random response values) and a classification decision tree, then the accuracy measure (number of correct predictions divided by total number of predictions), trained our models and finally made and scored predictions.
Once again we can be happy with our predictions, which are vastly more accurate than the baseline.
Now that we have seen the similarities between classification and regression, we can turn to some key differences.
### TaskClassif
Classification tasks, objects inheriting from `r ref("TaskClassif", aside = TRUE)`, are very similar to regression tasks, except that the target variable is of type factor and will have a limited number of possible classes/categories that observations can fall into.
You can view the predefined classification tasks in `mlr3` by filtering the `mlr_tasks` dictionary:
```{r data_and_basic_modeling-055}
as.data.table(mlr_tasks)[task_type == "classif"]
```
You can create your own task with `r ref("as_task_classif", aside = TRUE)`.
```{r data_and_basic_modeling-056}
as_task_classif(palmerpenguins::penguins, target = "species")
```
There are two types of classification tasks supported in `mlr3`: binary classification\index{classification!binary}, in which the outcome can be one of two categories, and multiclass classification\index{classification!multiclass}, where the outcome can be one of three or more categories.
The `sonar` task is an example of a binary classification problem, as the target can only take two different values, in `mlr3` terminology it has the "twoclass" property:
```{r data_and_basic_modeling-057}
tsk_sonar = tsk("sonar")
tsk_sonar
tsk_sonar$class_names
```
In contrast, `tsk("penguins")` is a multiclass problem as there are more than two species of penguins; it has the "multiclass" property:
```{r data_and_basic_modeling-058}
tsk_penguins = tsk("penguins")
tsk_penguins$properties
tsk_penguins$class_names
```
A further difference between these tasks is that binary classification tasks have an extra field called `r index('$positive', parent = "TaskClassif", aside = TRUE, code = TRUE)`, which defines the 'positive' class.
In binary classification, as there are only two possible class types, by convention one of these is known as the 'positive' class, and the other as the 'negative' class.
It is arbitrary which is which, though often the more 'important' (and often smaller) class is set as the positive class.
You can set the positive class during or after construction.
If no positive class is specified then `mlr3` assumes the first level in the `target` column is the positive class, which can lead to misleading results.
```{r data_and_basic_modeling-059}
# Load the "Sonar" dataset from the "mlbench" package as an example
data(Sonar, package = "mlbench")
# specifying the positive class:
tsk_classif = as_task_classif(Sonar, target = "Class", positive = "R")
tsk_classif$positive
# changing after construction
tsk_classif$positive = "M"
tsk_classif$positive
```
While the choice of positive and negative class is arbitrary, they are essential to ensuring results from models and performance measures are interpreted as expected -- this is best demonstrated when we discuss thresholding (@sec-classif-prediction) and ROC metrics (@sec-roc).
Finally, plotting is possible with `r ref("autoplot.TaskClassif")`, below we plot a comparison between the target column and features.
```{r data_and_basic_modeling-060, warning = FALSE, message = FALSE, fig.height = 5}
#| label: fig-penguins-overview
#| fig-cap: Overview of part of the penguins dataset.
#| fig-alt: Diagram showing the distribution of target and feature values for a subset of the penguins data. The 'Adelie' species has an even split between male/female, short bill length and average bill depth. The 'Chinstrap' species only come from the island 'Dream' and have a lower body mass. The 'Gentoo' species only come from the island 'Biscoe', and have a longer flipper length and higher body mass.
library(ggplot2)
autoplot(tsk("penguins"), type = "duo") +
theme(strip.text.y = element_text(angle = -45, size = 8))
```
### LearnerClassif and MeasureClassif {#sec-basics-classif-learner}
Classification learners, which inherit from `r ref("LearnerClassif", aside = TRUE)`, have nearly the same interface as regression learners.
However, a key difference is that the possible predictions in classification are either `"response"` -- predicting an observation's class (a penguin's species in our example, this is sometimes called "hard labeling") -- or `"prob"` -- predicting a vector of probabilities, also called "posterior probabilities", of an observation belonging to each class.
In classification, the latter can be more useful as it provides information about the confidence of the predictions:
```{r data_and_basic_modeling-061}
lrn_rpart = lrn("classif.rpart", predict_type = "prob")
lrn_rpart$train(tsk_penguins, splits$train)
prediction = lrn_rpart$predict(tsk_penguins, splits$test)
prediction
```
Notice how the predictions include the predicted probabilities for all three classes, as well as the `response`, which (by default) is the class with the highest predicted probability.
Also, the interface for classification measures, which are of class `r ref('MeasureClassif', aside = TRUE)`, is identical to regression measures.
The key difference in usage is that you will need to ensure your selected measure evaluates the prediction type of interest.`r index(NULL, "$predict_type", parent = "Measure", code = TRUE)`
To evaluate `"response"` predictions, you will need measures with `predict_type = "response"`, or to evaluate probability predictions you will need `predict_type = "prob"`.
The easiest way to find these measures is by filtering the `r ref("mlr_measures")` dictionary:
```{r data_and_basic_modeling-062}
as.data.table(msr())[
task_type == "classif" & predict_type == "prob" &
!sapply(task_properties, function(x) "twoclass" %in% x)]
```
We also filtered to remove any measures that have the `"twoclass"` property as this would conflict with our `"multiclass"` task.
We need to use `sapply` for this, the `task_properties` column is a list column.
We can evaluate the quality of our probability predictions and response predictions simultaneously by providing multiple measures:
```{r data_and_basic_modeling-063}
measures = msrs(c("classif.mbrier", "classif.logloss", "classif.acc"))
prediction$score(measures)
```
The `r index('accuracy')` measure evaluates the `"response"` predictions whereas the `r index('Brier score', lower = FALSE)` (`"classif.mbrier"`, squared difference between predicted probabilities and the truth) and `r index('logloss')` (`"classif.logloss"`, negative logarithm of the predicted probability for the true class) are evaluating the probability predictions.
If no measure is passed to `r index("$score()", parent = "Prediction", code = TRUE)`, the default is the `r index('classification error')` (`msr("classif.ce")`), which is the number of misclassifications divided by the number of predictions, i.e., $1 -$ `msr("classif.acc")`.
### `PredictionClassif`, Confusion Matrix, and Thresholding {#sec-classif-prediction}
`r ref("PredictionClassif", aside = TRUE)` objects have two important differences from their regression analog.
Firstly, the added field `$confusion`, and secondly the added method `r index('$set_threshold()', parent = "PredictionClassif", code = TRUE)`.
#### Confusion matrix {.unnumbered .unlisted}
A `r index('confusion matrix', aside = TRUE)` is a popular way to show the quality of classification (response) predictions in a more detailed fashion by seeing if a model is good at (mis)classifying observations in a particular class.
For binary and multiclass classification, the confusion matrix is stored in the `r index('$confusion', parent = "PredictionClassif", code = TRUE, aside = TRUE)` field of the `r ref("PredictionClassif")` object:
```{r data_and_basic_modeling-064}
prediction$confusion
```
The rows in a confusion matrix are the predicted class and the columns are the true class.
All off-diagonal entries are incorrectly classified observations, and all diagonal entries are correctly classified.
In this case, the classifier does fairly well classifying all penguins, but we could have found that it only classifies the Adelie species well but often conflates Chinstrap and Gentoo, for example.
You can visualize the predicted class labels with `autoplot.PredictionClassif()`.
```{r data_and_basic_modeling-065}
#| output: false
#| cache: false
autoplot(prediction)
```
```{r data_and_basic_modeling-066, out.width = "70%"}
#| fig-cap: "Counts of each class label in the ground truth data (left) and predictions (right)."
#| fig-alt: "Two stacked bar plots. Bottom left corresponds to true number of Gentoo species (41), middle left is true Chinstrap (22) and top left is true Adelie (50). Bottom right is predicted number of Gentoo species (41), middle right is Chinstrap (20), and top right is Adelie (52)."
#| label: fig-basics-classlabels
#| warning: false
#| message: false
#| echo: false
plt = ggplot2::last_plot()
plt = plt + ggplot2::scale_fill_manual(values = c("grey30", "grey50", "grey70"))
print(plt)
```
In the binary classification case, the top left entry corresponds to `r index('true positives')`, the top right to `r index('false positives')`, the bottom left to `r index('false negatives')` and the bottom right to `r index('true negatives')`.
Taking `tsk_sonar` as an example with ``r tsk_sonar$positive`` as the positive class:
```{r data_and_basic_modeling-067}
splits = partition(tsk_sonar)
lrn_rpart$
train(tsk_sonar, splits$train)$
predict(tsk_sonar, splits$test)$
confusion
```
We will return to the concept of binary (mis)classification in greater detail in @sec-roc.
#### Thresholding {.unnumbered .unlisted}
The final big difference compared to regression we will discuss is `r index('thresholding', aside = TRUE)`.
We saw previously that the default `response` prediction type is the class with the highest predicted probability.
For `k` classes with predicted probabilities $p_1,\dots,p_k$, this is the same as saying `response` = argmax$\{p_1,\dots,p_k\}$.
If the maximum probability is not unique, i.e., multiple classes are predicted to have the highest probability, then the response is chosen randomly from these.
In binary classification, this means that the positive class will be selected if the predicted class is greater than 50%, and the negative class otherwise.
This 50% value is known as the threshold and it can be useful to change this threshold if there is class imbalance (when one class is over- or under-represented in a dataset), or if there are different costs associated with classes, or simply if there is a preference to 'over'-predict one class.
As an example, let us take `tsk("german_credit")` in which 700 customers have good credit and 300 have bad.
Now we could easily build a model with around "70%" accuracy simply by always predicting a customer will have good credit:
```{r data_and_basic_modeling-068}
task_credit = tsk("german_credit")
lrn_featureless = lrn("classif.featureless", predict_type = "prob")
split = partition(task_credit)
lrn_featureless$train(task_credit, split$train)
prediction = lrn_featureless$predict(task_credit, split$test)
prediction$score(msr("classif.acc"))
```
```{r data_and_basic_modeling-069}
#| output: false
#| cache: false
autoplot(prediction)
```
```{r data_and_basic_modeling-070, out.width = "70%"}
#| fig-cap: "Class labels ground truth (left) and predictions (right). The learner completely ignores the 'bad' class."
#| fig-alt: "Two stacked bar plots. Bottom left corresponds to true number of 'good' customers (231) and top left is 'bad' customers (99). Right is a single bar corresponding to 330 'good' predictions, 'bad' is never predicted."
#| label: fig-basics-classlabels-german
#| echo: false
#| warning: false
#| message: false
plt = ggplot2::last_plot()
plt = plt + ggplot2::scale_fill_manual(values = c("grey30", "grey50"))
print(plt)
```
While this model may appear to have good performance on the surface, in fact, it just ignores all 'bad' customers -- this can create big problems in this finance example, as well as in healthcare tasks and other settings where `r index('false positives')` cost more than `r index('false negatives')` (see @sec-cost-sens for cost-sensitive classification).
Thresholding allows classes to be selected with a different probability threshold, so instead of predicting that a customer has bad credit if P(good) < 50%, we might predict bad credit if P(good) < 70% -- notice how we write this in terms of the positive class, which in this task is 'good'.
Let us see this in practice:
```{r data_and_basic_modeling-071}
prediction$set_threshold(0.7)
prediction$score(msr("classif.acc"))
```
```{r data_and_basic_modeling-072}
lrn_rpart = lrn("classif.rpart", predict_type = "prob")
lrn_rpart$train(task_credit, split$train)
prediction = lrn_rpart$predict(task_credit, split$test)
prediction$score(msr("classif.acc"))
prediction$confusion
prediction$set_threshold(0.7)
prediction$score(msr("classif.acc"))
prediction$confusion
```
While our model performs 'worse' overall, i.e. with lower accuracy, it is still a 'better' model as it more accurately captures the relationship between classes.
In the binary classification setting, `$set_threshold()` only requires one numeric argument, which corresponds with the threshold for the positive class -- hence it is essential to ensure the positive class is correctly set in your task.
In multiclass classification, thresholding works by first assigning a threshold to each of the `n` classes, dividing the predicted probabilities for each class by these thresholds to return `n` ratios, and then the class with the highest ratio is selected.
For example, say we are predicting if a new observation will be of class A, B, C, or D and we have predicted $P(A = 0.2), P(B = 0.4), P(C = 0.1), P(D = 0.3)$.
We will assume that the threshold for all classes is identical and `1`:
```{r data_and_basic_modeling-073}
probs = c(0.2, 0.4, 0.1, 0.3)
thresholds = c(A = 1, B = 1, C = 1, D = 1)
probs/thresholds
```
We would therefore predict our observation is of class B as this is the highest ratio.
However, we could change our thresholds so that D has the lowest threshold and is most likely to be predicted, A has the highest threshold, and B and C have equal thresholds:
```{r data_and_basic_modeling-074}
thresholds = c(A = 0.5, B = 0.25, C = 0.25, D = 0.1)
probs/thresholds
```
Now our observation will be predicted to be in class D.
In `mlr3`, this is achieved by passing a named list to `$set_threshold()`.
This is demonstrated below with `tsk("zoo")`.
Before changing the thresholds, some classes are never predicted and some are predicted more often than they occur.
```{r data_and_basic_modeling-075}
#| label: fig-zoopreds