/
r_functional-programming_with-answers.org
executable file
·564 lines (407 loc) · 16 KB
/
r_functional-programming_with-answers.org
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
#+OPTIONS: title:t date:t author:t email:t
#+OPTIONS: toc:t h:6 num:nil |:t todo:nil
#+OPTIONS: *:t -:t ::t <:t \n:t e:t creator:nil
#+OPTIONS: f:t inline:t tasks:t tex:t timestamp:t
#+OPTIONS: html-preamble:t html-postamble:nil
#+PROPERTY: header-args:R :session R:purrr :eval no :exports code :tangle yes :comments link
#+TITLE: Functional programming in R (with /purrr/)
#+DATE: {{{time(%B %d\, %Y)}}}
#+AUTHOR: Marie-Hélène Burle
#+EMAIL: msb2@sfu.ca
* Introduction
** What is functional programming?
It is a programming paradigm based on the evaluation of functions. This is opposed to /imperative programming/. While some languages are based strictly on functional programming (e.g. Haskell), R allows both imperative code (e.g. loops) and functional code (e.g. many base functions, the src_R[:eval no]{apply()} family, src_R[:eval no]{purrr}).
** Iterations
Iterations are the repetition of a process (e.g. applying the same function to several variables, several datasets, or several files).
The classic methods in R are:
- loops
- src_R[:eval no]{apply()} functions family
** The purrr package
One of the src_R[:eval no]{tidyverse} core packages, src_R[:eval no]{purrr} was written in 2015 by [[https://github.com/lionel-][Lionel Henry]] (also the maintainer), [[http://hadley.nz/][Hadley Wickham]], and [[https://www.rstudio.com/][RStudio inc.]]
*** Goal
src_R[:eval no]{Purrr} is a set of tools allowing consistent functional programming in R in a src_R[:eval no]{tidyverse} style (using src_R[:eval no]{magrittr} pipes and following the same naming conventions found in other src_R[:eval no]{tidyverse} packages).
As Hadley Wickham says, in many ways, src_R[:eval no]{purrr} is the equivalent of the src_R[:eval no]{dplyr}, but while src_R[:eval no]{dplyr} focuses on data frames, src_R[:eval no]{purrr} works on vectors: it works on the elements of atomic vectors, lists, and data frames. Since R's most basic data structure is the vector, this makes src_R[:eval no]{purrr} extremely powerful and flexible.
*** Logistics
Install it with:
#+BEGIN_SRC R
install.packages("tidyverse")
## or
install.packages("purrr")
#+END_SRC
Load it with:
#+BEGIN_SRC R
library(tidyverse)
## or
library(purrr)
#+END_SRC
As always, once the package is loaded, you can get information on the package with:
#+BEGIN_SRC R
?purrr
#+END_SRC
and on any of its functions with:
#+BEGIN_SRC R
?function
## e.g. for the map function
?map
#+END_SRC
* Let's dive in
** Load packages
First, let's load the packages that we will use. It is always a good idea to write all the packages that you will be using at the top of the script. This will help others, using your script, to know what is required to run it.
#+BEGIN_SRC R
library(tidyverse) # we will use purrr and other core packages
library(magrittr) # we will use several types of pipes
#+END_SRC
** Create some fake banding data
Let's create some imaginary bird banding data:
#+BEGIN_SRC R
banding <- tibble(
bird = paste0("bird", 1:50),
sex = sample(c("F", "M"), 50, replace = T),
population = sample(LETTERS[1:3], 50, replace = T),
mass = rnorm(50, 43, 4) %>% round(1),
tarsus = rnorm(50, 27, 1) %>% round(1),
wing = rnorm(50, 112, 3) %>% round(0)
)
banding
#+END_SRC
** Map: apply functions to elements of a list
Imagine that you want to calculate the mean for each of the morphometric measurements (mass, tarsus, and wing).
#+BEGIN_VERBATIM
How would you usually do this?
Spend 5 minutes writing code you would usually use.
#+END_VERBATIM
To apply functions to elements of a list, you can use src_R[:eval no]{map}, one of the key function of the src_R[:eval no]{purrr} package.
*** Usage
#+BEGIN_SRC R
map(.x, .f, ...)
#+END_SRC
#+BEGIN_EXAMPLE
.x a list or atomic vector
.f a function, formula, or atomic vector
... additional arguments passed to .f
#+END_EXAMPLE
For every element of src_R[:eval no]{.x}, apply src_R[:eval no]{.f}.
What we have, in the simplest case, is:
#+BEGIN_SRC R
map(list, function)
#+END_SRC
*** In our example
#+BEGIN_VERBATIM
How could we use src_R[:eval no]{map()} to calculate the means of all 3 measurement types?
#+END_VERBATIM
#+BEGIN_RED
A data frame is a list! It is a list of vectors.
Without running it in your computer, try to guess what the result of the following will be:
#+BEGIN_SRC R
length(banding)
#+END_SRC
Now, run it. What do you get? Why?
#+END_RED
So, back to our example, we do have a list: a list of vectors. That's what our banding data frame is! So no problem about applying src_R[:eval no]{map()} to it.
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
map(banding[4:6], mean)
#+END_SRC
or using a pipe
#+BEGIN_SRC R
banding[4:6] %>% map(mean)
#+END_SRC
#+HTML: </div>
However, the output of src_R[:eval no]{map()} is always a list. And a list as output is not really convenient here. There are other map functions which have vector or data frame outputs. To get a numeric vector as the output, we use src_R[:eval no]{map_dbl()}:
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
map_dbl(banding[4:6], mean)
#+END_SRC
or
#+BEGIN_SRC R
banding[4:6] %>% map_dbl(mean)
#+END_SRC
#+HTML: </div>
Similarly, you can calculate the variance, the sum, look for the largest value, or apply any other function to our data.
#+BEGIN_VERBATIM
Spend 2 min writing codes for these.
#+END_VERBATIM
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
map_dbl(banding[4:6], var)
map_dbl(banding[4:6], sum)
map_dbl(banding[4:6], max)
#+END_SRC
#+HTML: </div>
*** Stepping things up
Now, imagine that you would like to plot the relationship between tarsus and mass for each population.
#+BEGIN_VERBATIM
How would you usually do that?
Spend 5 min writing code for this.
And feel free to chat.
#+END_VERBATIM
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
You could write a for loop:
#+BEGIN_SRC R
for (i in unique(banding$population)) {
print(ggplot(banding %>% filter(population == i),
aes(tarsus, mass)) + geom_point())
}
#+END_SRC
But this is the functional programming method:
#+BEGIN_SRC R
banding %>%
split(.$population) %>%
map(~ ggplot(., aes(tarsus, mass)) + geom_point())
#+END_SRC
Let's save those graphs in a variable called src_R[:eval no]{graphs} that we will use later.
#+BEGIN_SRC R
graphs <-
banding %>%
split(.$population) %>%
map(~ ggplot(., aes(tarsus, mass)) + geom_point())
#+END_SRC
#+HTML: </div>
*** Formulas
#+BEGIN_RED
Formulas = a shorter notation for anonymous functions
#+END_RED
**** With one element
The code:
#+BEGIN_SRC R
map(function(x) x + 3)
#+END_SRC
which contains the anonymous function src_R[:eval no]{function(x) x + 3} can be written as:
#+BEGIN_SRC R
map(~ . + 3)
#+END_SRC
This code abbreviation is called a "formula".
#+BEGIN_VERBATIM
Your turn: write the following anonymous function as a formula.
#+END_VERBATIM
#+BEGIN_SRC R
map(function(x) mean(x) + 3)
#+END_SRC
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
map(~ mean(.) + 3)
#+END_SRC
#+HTML: </div>
**** With 2 elements
The code:
#+BEGIN_SRC R
map2(function(x, y) x + y)
#+END_SRC
can be shortened to:
#+BEGIN_SRC R
map2(~ .x + .y)
#+END_SRC
**** Referring to elements
| 1st element | | 2nd element | | 3rd element |
|-------------+---+-------------+---+-------------|
| =.= | | | | |
| =.x= | | =.y= | | |
| =..1= | | =..2= | | =..3= |
etc.
#+BEGIN_VERBATIM
Your turn: write the following anonymous function as a formula.
#+END_VERBATIM
#+BEGIN_SRC R
pmap(function(x1, x2, y) lm(y ~ x1 + x2))
#+END_SRC
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
pmap(~ lm(..3 ~ ..1 + ..2))
#+END_SRC
#+HTML: </div>
** src_R[:eval no]{map_if}/src_R[:eval no]{modify_if} and src_R[:eval no]{map_at}/src_R[:eval no]{modify_at}
We built our data frame with src_R[:eval no]{tibble()} which, as is the norm in the src_R[:eval no]{tidyverse}, does not transform strings into factors:
#+BEGIN_SRC R
banding <-
tibble(
bird = paste0("bird", 1:50),
sex = sample(c("F", "M"), 50, replace = T),
population = sample(LETTERS[1:3], 50, replace = T),
mass = rnorm(50, 43, 4) %>% round(1),
tarsus = rnorm(50, 27, 1) %>% round(1),
wing = rnorm(50, 112, 3) %>% round(0)
) %T>%
str()
#+END_SRC
Several base R functions however, do.
Let's build the same data with the base R function src_R[:eval no]{data.frame()}:
#+BEGIN_SRC R
banding <-
data.frame(
bird = paste0("bird", 1:50),
sex = sample(c("F", "M"), 50, replace = T),
population = sample(LETTERS[1:3], 50, replace = T),
mass = rnorm(50, 43, 4) %>% round(1),
tarsus = rnorm(50, 27, 1) %>% round(1),
wing = rnorm(50, 112, 3) %>% round(0)
) %T>%
str()
#+END_SRC
#+BEGIN_RED
The reason several base R functions transform strings into factors is historic. This used to be essential to save space. But this is not relevant anymore and has become somewhat of an annoyance.
#+END_RED
If you have such a data frame, you may wish to transform the factors into characters.
#+BEGIN_VERBATIM
How can you do this?
#+END_VERBATIM
src_R[:eval no]{map()} has the derivatives src_R[:eval no]{map_if()} and src_R[:eval no]{map_at()} which allow to apply functions when conditions are met or at certain locations. Here, we can use src_R[:eval no]{map_if()}:
#+BEGIN_SRC R
banding %>%
map_if(is.factor, as.character) %T>%
str()
#+END_SRC
However, src_R[:eval no]{map_if} and src_R[:eval no]{map_at} always return lists. If you want the output to be of the same type of the input, use src_R[:eval no]{modify_if} and src_R[:eval no]{modify_at} instead.
#+BEGIN_SRC R
banding <-
data.frame(
bird = paste0("bird", 1:50),
sex = sample(c("F", "M"), 50, replace = T),
population = sample(LETTERS[1:3], 50, replace = T),
mass = rnorm(50, 43, 4) %>% round(1),
tarsus = rnorm(50, 27, 1) %>% round(1),
wing = rnorm(50, 112, 3) %>% round(0)
)
banding %>%
modify_if(is.factor, as.character) %>%
head() %T>%
str()
#+END_SRC
#+BEGIN_RED
This could also be accomplished with src_R[:eval no]{mutate_if()}:
#+BEGIN_SRC R
banding %>% mutate_if(is.factor, as.character)
#+END_SRC
But the src_R[:eval no]{map()} functions also work with lists and are more flexible than src_R[:eval no]{mutate()} and its derivatives.
#+END_RED
*** Usage
#+BEGIN_SRC R
modify(.x, .f, ...)
modify_if(.x, .p, .f, ...)
modify_at(.x, .at, .f, ...)
#+END_SRC
#+BEGIN_EXAMPLE
.x a list or atomic vector
.f a function, formula, or atomic vector
... additional arguments passed to .f
.p a predicate function.
Only the elements for which .p evaluates to TRUE will be modified
.at a character vector of names or a numeric vector of positions.
Only the elements corresponding to .at will be modified
#+END_EXAMPLE
For every element of src_R[:eval no]{.x}, apply src_R[:eval no]{.f}, and return a modified version of src_R[:eval no]{.x}.
So basically, in its simplest form, we have:
#+BEGIN_SRC R
modify(list, function)
#+END_SRC
** Walk: apply side effects to elements of a list
Now, we want to save the 3 graphs we previously drew into 3 files.
#+BEGIN_VERBATIM
How would you do this?
Spend 5 minutes writing code you would usually use.
#+END_VERBATIM
To apply side effects to elements of a list, we use the src_R[:eval no]{walk} functions family.
*** Usage
#+BEGIN_SRC R
walk(.x, .f, ...)
#+END_SRC
#+BEGIN_EXAMPLE
.x a list or atomic vector
.f a function, formula, or atomic vector
... additional arguments passed to .f
#+END_EXAMPLE
*** Apply to our example
We already have a list of graphs: src_R[:eval no]{graphs}. Now, we can create a list of paths where we want to save them:
#+BEGIN_SRC R
paths <- paste0("population_", names(graphs), ".png")
#+END_SRC
So we want to save each element of src_R[:eval no]{graphs} into an element of src_R[:eval no]{paths}. The function we will use is src_R[:eval no]{ggsave}. To apply it to all of our elements, instead of using src_R[:eval no]{map}, we will use src_R[:eval no]{walk} because we are not trying to create a new object.
The problem is that we have 2 lists to deal with. src_R[:eval no]{Map} and src_R[:eval no]{walk} only allow to deal with one list. But src_R[:eval no]{map2} and src_R[:eval no]{walk2} allow to deal with 2 lists (src_R[:eval no]{pmap} and src_R[:eval no]{pwalk} allow to deal with any number of lists).
Here is how src_R[:eval no]{walk2} works (it is the same for src_R[:eval no]{map2}):
#+BEGIN_SRC R
walk2(.x, .y, .f, ...)
#+END_SRC
#+BEGIN_EXAMPLE
.x, .y vectors of the same length.
A vector of length 1 will be recycled.
.f a function, formula, or atomic vector
... additional arguments passed to .f
#+END_EXAMPLE
#+BEGIN_VERBATIM
Give it a try:
use src_R[:eval no]{walk2} to save the elements of src_R[:eval no]{graphs} into the elements of src_R[:eval no]{paths} using src_R[:eval no]{ggsave}.
Don't hesitate to look up the help file for src_R[:eval no]{ggsave} with src_R[:eval no]{?ggsave} if you don't remember how to use it!
#+END_VERBATIM
#+BEGIN_accordion
Answer
#+END_accordion
#+HTML: <div class="panel">
#+BEGIN_SRC R
walk2(paths, graphs, ggsave)
#+END_SRC
#+HTML: </div>
* Summary of the map and walk functions family
We will use different src_R[:eval no]{map} (or src_R[:eval no]{walk}, if we want the side effects) function depending on:
#+BEGIN_VERSE
- How many lists we are using in the input
#+END_VERSE
| number of arguments in input | | | purrr function |
|------------------------------+---+---+-------------------|
| 1 | | | =map= or =walk= |
| 2 | | | =map2= or =walk2= |
| more | | | =pmap= or =pwalk= |
#+HTML: <br>
#+BEGIN_VERSE
- The class of the output we want
#+END_VERSE
| class we want for the output | | | purrr function |
|--------------------------------+---+---+----------------|
| nothing* | | | =walk= |
| list* | | | =map= |
| double | | | =map_dbl= |
| integer | | | =map_int= |
| character | | | =map_chr= |
| logical | | | =map_lgl= |
| data frame (by row-binding) | | | =map_dfr= |
| data frame (by column-binding) | | | =map_dfc= |
#+HTML: <br>
Results are returned predictably and consistently, which is [[https://blog.rstudio.com/2016/01/06/purrr-0-2-0/][not the case]] of src_R[:eval no]{sapply()}.
*As [[https://github.com/jennybc][Jenny Bryan]] said [[https://speakerdeck.com/jennybc/data-rectangling][nicely]]:
#+BEGIN_QUOTE
"src_R[:eval no]{walk()} can be thought of as src_R[:eval no]{map_nothing()}
src_R[:eval no]{map()} can be thought of as src_R[:eval no]{map_list()}"
#+END_QUOTE
#+HTML: <br>
#+BEGIN_VERSE
- How we want to select the input
#+END_VERSE
| selecting input based on | | | purrr function |
|--------------------------+---+---+----------------|
| condition | | | =map_if= |
| location | | | =map_at= |
* Conclusion
These are some of the most important src_R[:eval no]{purrr} functions. But there are many others and I encourage you to explore them by yourself.
Great resources for this are:
- The [[http://r4ds.had.co.nz/iteration.html][iteration chapter]] of [[http://hadley.nz/][Hadley Wickham]]'s book [[http://r4ds.had.co.nz/index.html][R for data science]]
- The [[https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf][purrr cheatsheet]]
- The [[https://cran.r-project.org/web/packages/purrr/purrr.pdf][purrr CRAN manual]]
- The vignettes and help files for the many purrr functions
Have fun!!!
#+HTML: <script>; var acc = document.getElementsByClassName("accordion"); var i; for (i = 0; i < acc.length; i++) {; acc[i].addEventListener("click", function() {; this.classList.toggle("active"); var panel = this.nextElementSibling; if (panel.style.maxHeight){; panel.style.maxHeight = null; } else {; panel.style.maxHeight = panel.scrollHeight + "px"; }; }); }; </script>