/
bk01_base-functions.Rmd
405 lines (332 loc) · 15.3 KB
/
bk01_base-functions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
title: "Relationship to base and plyr functions"
comment: "*side-by-side workflow comparison*"
output:
html_document:
toc: true
toc_float: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
```
## Why not base?
You need a way to iterate in R in a data-structure-informed way. What does that mean?
* Iterate over elements of a list
* Iterate over rows or columns of a 2-dimensional object
* Iterate over sub data frames induced by one or more factors
* Iterate over tuples formed from the i-th element of several vectors of equal length
**All of this is absolutely possible with base R**, using `for()` loops or the "apply" functions, such as `apply()`, `[slvmt]apply()`, and `by()`. I know, because I did so religiously for over 15 years. It works fine!
Why might you do otherwise? As an instructor, I encounter lots of useRs who sheepishly say they know they should be using the apply functions, but they don't. Because they've never quite figured them out or been able to form the habit.
I have a theory about why. The user interface of the "apply"" functions is not as consistent as it could be, which slows down learning. The return objects frequently require further checking and massage to use downstream, which diminishes the payoff. In particular, there's a tendency to return a vector (atomic or otherwise) or array, instead of data frame, with the original factor levels appearing in a names attribute.
Regardless of your preference, hopefully the side-by-side examples below are a helpful tour of different ways to iterate. Even if you prefer purrr/tidyverse in some settings, you might want base methods when you don't want any dependencies.
## Why purrr?
purrr addresses some of the friction identified in the base functions for "split-apply-combine":
* The `map()` family of functions is highly internally consistent, making it easier to transfer expertise from one function to another.
* Greater encouragement for type-safe simplification to atomic vector or data frame, producing output that is more ready for the next step.
* Concise syntax for defining anonymous functions.
We load purrr now. Below we also use example lists from [repurrrsive](https://github.com/jennybc/repurrrsive) which are [introduced and explored elsewhere](ls00_inspect-explore.html).
```{r}
library(purrr)
library(repurrrsive)
```
## Why not plyr?
If you have never heard of the [plyr package](https://CRAN.R-project.org/package=plyr), you can skip this section! Personally, I'm a bit sad that it's time to move on from plyr. But it is time.
Why? plyr is no longer under active development. The innovation is happening elsewhere, in purrr and the other packages in the tidyverse. Also, even I must admit that plyr can be rather slow. I wouldn't recommend writing new code that requires plyr.
In terms of understanding or translating plyr code, here's the minimum you need to know: The key feature of the main 3 * 4 = 12 functions in plyr is their explicit, memorable form.
```{r eval = FALSE}
XYply()
```
where `X` and `Y` specify the type of input and output, respectively. `X` and `Y` take values from here:
* `a` = "array"
* `d` = "data frame"
* `l` = "list"
* `_` = "nothing" (not eligible as `X` re: input, for obvious reasons)
The most useful function proved to be `ddply()`: data frame in, data frame out. This inspired the creation of the [dplyr package](https://dplyr.tidyverse.org). Most tasks suitable for `a*ply()` and, especially `l*ply()` are now handled by purrr's `map()` family. The functions called for their side effects, as opposed to return value, are those named `*_ply()`; this is handled by purrr's `walk()` functions. Marching through a data frame row by row is now a job for purrr's `pmap()`.
## `lapply()` vs. `purrr::map()`
These are the core mapping functions of base and purrr, respectively. They are "list in, list out". The main (only?) difference is access to purrr's shortcuts for indexing by name or position and for creating anonymous functions.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
lapply(got_chars[1:3],
function(x) x[["name"]])
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map(got_chars[1:3], "name")
```
</div>
</div>
This is also `plyr::llply()`, although it offers some other arguments, such as `.progress` and `.parallel`.
```{r eval = FALSE}
plyr::llply(got_chars[1:3], function(x) x[["name"]])
```
## `sapply()` vs. ¯\\\_(ツ)\_/¯
`sapply()` is a base function that attempts to apply a reasonable simplification to the output of `lapply()`. It's handy for interactive use, but due to the unpredictability of it return value, it's unwise to use it in programming. There is no equivalent in purrr or plyr.
Problem demonstration: The characters in Game of Thrones can have aliases. Some have several, some have one, some have none. Depending on which characters I work with, the same `sapply()` call can return an object of entirely different class, i.e. a list or a character vector. This can be a source of mysterious bugs in code meant to run non-interactively.
```{r}
aliases1 <- sapply(got_chars[20:22], function(x) x[["aliases"]])
str(aliases1)
aliases2 <- sapply(got_chars[c(3, 22, 27)], function(x) x[["aliases"]])
str(aliases2)
```
With purrr, you would use `map()` to get a list back or `map_chr()` to get atomic character vector. If you use `map_chr()` when you should not, you'll get an informative error right away (shown below) and can adjust your approach accordingly.
```{r}
map_chr(got_chars[2:4], "aliases")
```
## `vapply()` vs. `map_*()`
Base `vapply()` requires you to specify a template for the return value and is described as a safer alternative to `sapply()`. The closest purrr functions are the type-specific mapping functions: `map_lgl()`, `map_int()`, `map_dbl()`, and `map_chr()` that are "list in, atomic vector out". Here's comparable use of `vapply()` and `map_chr()` to get some of the Game of Thrones characters' names.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
vapply(got_chars[1:3],
function(x) x[["name"]],
character(1))
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map_chr(got_chars[1:3], "name")
```
</div>
</div>
### `vapply()` always simplifies
What's not to love with `vapply()` then? It suffers from the `drop = TRUE` vs `FALSE` problem we have when requesting a single row or column from a 2-dimensional object. Except `vapply()` has no `drop` argument to control this behavior. It's an example of the base functions being more difficult to program around. The template allows you to specify the form of each individual result, but there is no way to specify the form -- such as the dimension -- of the *overall* result.
I adapt this example from my real life, where I have `vapply()` inside a function and `n` is an argument to that function, i.e. it varies. Here I simply define `n` in the global environment prior to the `vapply()` call. Note how `vapply()` returns a 2 dimensional object in the first case and atomic vector in the second. As it says in the docs: "Simplification is always done in `vapply`." Believe it.
```{r}
f <- function(x, n) rep(x, n)
n <- 3
vapply(c("a", "b"), f, character(n), n = n)
n <- 1
vapply(c("a", "b"), f, character(n), n = n)
```
## ¯\\\_(ツ)\_/¯ vs. `map_dfr()`
The `purrr::map_dfr()` function is "list in, data frame out" and there is no true base equivalent. Given the centrality of data frames for analysis, it is handy to have a function to produce them, without resorting to `do.call()`.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
l <- lapply(got_chars[23:25],
`[`, c("name", "playedBy"))
mat <- do.call(rbind, l)
(df <- as.data.frame(mat, stringsAsFactors = FALSE))
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map_dfr(got_chars[23:25],
`[`, c("name", "playedBy"))
```
</div>
</div>
The base workflow above gets trickier if you're extracting elements of disparate type. At that point, it may make more sense to use `vapply()` repeatedly. For comparability, we'll show similar using purrr's type-specific mapping, which is also safer than relying on automatic type conversion from `map_dfr()`.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
data.frame(
name = vapply(got_chars[23:25], `[[`,
character(1), "name"),
id = vapply(got_chars[23:25], `[[`,
integer(1), "id"),
stringsAsFactors = FALSE
)
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
tibble::tibble(
name = map_chr(got_chars[23:25], "name"),
id = map_int(got_chars[23:25], "id")
)
```
</div>
</div>
## `mapply()` vs. `map2()`, `pmap()`
When you need to iterate over 2 or more vectors/lists in parallel, the base option is `mapply()`. Unlike the other apply functions, the first argument is `FUN`, the function to apply, and the multiple vector inputs are provided "loose" via `...`.
For exactly two vector inputs, purrr has `map2()`, with all the usual type-specific variants. For an arbitrary number of vector inputs, use purrr `pmap()` or type-specific variants, with the inputs packaged in a list. A very handy special case is when the input is a data frame, in which case `pmap_*()` applies `.f` to each row.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
nms <- vapply(got_chars[16:18],
`[[`, character(1), "name")
birth <- vapply(got_chars[16:18],
`[[`, character(1), "born")
mapply(function(x, y) paste(x, "was born", y),
nms, birth)
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
nms <- got_chars[16:18] %>%
map_chr("name")
birth <- got_chars[16:18] %>%
map_chr("born")
map2_chr(nms, birth, ~ paste(.x, "was born", .y))
## and again, but with pmap()
df <- tibble::tibble(
nms,
connector = "was born",
birth
)
pmap_chr(df, paste)
```
</div>
</div>
## `aggregate()` vs. `dplyr::summarize()`
Consider a data frame, as opposed to a nested list. How do you split it into pieces, according to one or more factors, apply a function to the pieces, and combine the results?
Create a tiny excerpt of the Gapminder dataset that contains a bit of data for Canada and Germany. Load dplyr, now that we are more in the data frame world.
```{r message = FALSE}
library(dplyr)
library(gapminder)
(mini_gap <- gapminder %>%
filter(country %in% c("Canada", "Germany"), year > 2000) %>%
droplevels())
```
Simple summary of a single variable for each country. Specifically, take the mean of life expectancy. In this case, the formula method of base `aggregate()` is quite nice, i.e. it returns a convenient data frame.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
aggregate(lifeExp ~ country, mini_gap, mean)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
mini_gap %>%
group_by(country) %>%
summarize(lifeExp = mean(lifeExp))
```
</div>
</div>
Simple summaries of two variables for each country. We take the mean of life expectancy and of GDP per capita.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
## formula method
aggregate(cbind(lifeExp, gdpPercap) ~ country, mini_gap, mean)
## data.frame method
aggregate(mini_gap[c("lifeExp", "gdpPercap")], list(mini_gap$country), mean)
## tapply() more general but output less useful here (data frame?)
## returns named vector
tapply(mini_gap$lifeExp, mini_gap$country, mean)
## returns list
tapply(mini_gap$lifeExp, mini_gap$country, mean, simplify = FALSE)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
mini_gap %>%
group_by(country) %>%
summarize_at(vars(lifeExp, gdpPercap), mean)
```
</div>
</div>
Bivariate summary of two variables for each country. We compute the correlation of life expectancy and year, for the full gapminder dataset now. On the base side, we can no longer use `aggregate()` or `tapply()` and need to graduate to `by()`.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
## by() with simplification (the default)
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year))
head(by_obj)
## by() without simplification
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year),
simplify = FALSE)
head(by_obj)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
gapminder %>%
group_by(country) %>%
summarize(cor = cor(lifeExp, year))
```
</div>
</div>
This is a good place to pause and glance back over the base vs tidyverse calls and results in this section. It shows that you can absolutely do these tasks either way, but the consistency of the code and return values is higher in the tidyverse.
## `by()` vs. `tidyr::nest()`
Fit a linear model of life expectancy against year. On the tidyverse side, we now create a nested data frame, with one meta-row per country. Therefore we load tidyr to get `nest()`. The data needed for each country's linear model is stored as a list-column of country-specific data frame.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
by_obj <- by(gapminder,
gapminder$country,
function(df) lm(lifeExp ~ year, data = df))
str(by_obj[1:2], max.level = 1)
by_obj[[1]]
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
library(tidyr)
nested_df <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))
str(nested_df$fit[1:2], max.level = 1)
nested_df$fit[[1]]
```
</div>
</div>
What if you want to inspect the fits for Oceania? On the base side, you've got some work to do because the country information lives only in the names and the continent information is not directly linked to the model fits at all. On the tidyverse side, where the fits live in a data frame that carries country and continent info, we can use our usual techniques for filtering rows based on the data.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
o_countries <- as.character(unique(gapminder$country[gapminder$continent == "Oceania"]))
by_obj[names(by_obj) %in% o_countries]
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
nested_df %>%
filter(continent == "Oceania") %>%
.$fit
```
</div>
</div>
Let's form a data frame with one row per country and variables for country, continent, estimated intercept, and estimated slope. I also want to guarantee that the country and continent factors have the same levels as they originally had in gapminder.
<div class = "row">
<div class = "col-md-6">
**base**
```{r}
coefs <- lapply(by_obj, coef)
coefs <- do.call(rbind, coefs)
coefs <- data.frame(
country = I(rownames(coefs)),
coefs
)
coefs$continent <- gapminder$continent[match(coefs$country, gapminder$country)]
coefs$continent <- factor(coefs$continent, levels = levels(gapminder$continent))
coefs$country <- factor(coefs$country, levels = levels(gapminder$country))
head(coefs)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
nested_df %>%
mutate(coefs = map(fit, coef),
intercept = map_dbl(coefs, 1),
slope = map_dbl(coefs, 2)) %>%
select(country, continent, intercept, slope)
```
</div>
</div>
This illustrates the post-processing that is often necessary in a base workflow. We need to restore the country info from the names of `by_obj`, use them to look up the continents in gapminder, and then restore the original factor levels for both country and continent. It illustrates the payoff for temporarily tolerating the `data` and `fit` list-columns in the nested data frame used in the tidyverse workflow. The country and continent factors remain intact and directly associated with the data and fits.