-
Notifications
You must be signed in to change notification settings - Fork 4
/
08-func.Rmd
537 lines (368 loc) · 17.7 KB
/
08-func.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
# Iteration & Functions {#func}
<div class="right meme"><img src="images/memes/functions.jpg"
alt="History channel aliens conspiracy guy. Top text: I've got functions inside functions...; Bottom text: ... inside functions." /></div>
## Learning Objectives {#ilo-func}
You will learn about functions and iteration by using simulation to calculate a power analysis for an independent samples t-test.
### Basic {-}
1. Work with basic [iteration functions](#iteration-functions) `rep`, `seq`, `replicate` [(video)](https://youtu.be/X3zFA71JzgE){class="video"}
2. Use [`map()` and `apply()` functions](#map-apply) [(video)](https://youtu.be/HcZxQZwJ8T4){class="video"}
3. Write your own [custom functions](#custom-functions) with `function()` [(video)](https://youtu.be/Qqjva0xgYC4){class="video"}
4. Set [default values](#defaults) for the arguments in your functions
### Intermediate {-}
5. Understand [scope](#scope)
6. Use [error handling and warnings](#warnings-errors) in a function
### Advanced {-}
<div class="right meme"><img src="images/memes/purrr.gif"
alt="Jacobim Mugatu (Will Ferrell) and Katinka Ingabogovinanana (Mila Jojovich) from Zoolander, sitting in a theatre with a small dog; bottom text: purrr::map so hot right now" /></div>
The topics below are not (yet) covered in these materials, but they are directions for independent learning.
7. Repeat commands having multiple arguments using `purrr::map2_*()` and `purrr::pmap_*()`
8. Create **nested data frames** using `dplyr::group_by()` and `tidyr::nest()`
9. Work with **nested data frames** in `dplyr`
10. Capture and deal with errors using 'adverb' functions `purrr::safely()` and `purrr::possibly()`
In the next two lectures, we are going to learn more about `r glossary("iteration")` (doing the same commands over and over) and custom `r glossary("function", "functions")` through a data simulation exercise, which will also lead us into more traditional statistical topics.
## Setup {#setup-func}
1. Open your `reprores-class-notes` project
1. Create a new R Markdown file called `08-func.Rmd`
1. Update the YAML header
1. Replace the setup chunk with the one below:
```{r setup-func, results = 'hide', warning = FALSE, message = FALSE, verbatim = "r setup, include = FALSE"}
knitr::opts_chunk$set(echo = TRUE)
# packages needed for this chapter
library(tidyverse) # loads purrr for iteration
library(broom) # converts test output to tidy tables
set.seed(8675309) # makes sure random numbers are reproducible
```
Download the [Apply functions with purrr cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/purrr.pdf).
## Iteration functions {#iteration-functions}
We first learned about the two basic iteration functions, `rep()` and `seq()` in the [Working with Data](#rep_seq) chapter.
### rep()
The function `rep()` lets you repeat the first argument a number of times.
Use `rep()` to create a vector of alternating `"A"` and `"B"` values of length 24.
```{r rep1}
rep(c("A", "B"), 12)
```
If you don't specify what the second argument is, it defaults to `times`, repeating the vector in the first argument that many times. Make the same vector as above, setting the second argument explicitly.
```{r rep1-times}
rep(c("A", "B"), times = 12)
```
If the second argument is a vector that is the same length as the first argument, each element in the first vector is repeated than many times. Use `rep()` to create a vector of 11 `"A"` values followed by 3 `"B"` values.
```{r rep-vector}
rep(c("A", "B"), c(11, 3))
```
You can repeat each element of the vector a sepcified number of times using the `each` argument, Use `rep()` to create a vector of 12 `"A"` values followed by 12 `"B"` values.
```{r rep-each}
rep(c("A", "B"), each = 12)
```
What do you think will happen if you set both `times` to 3 and `each` to 2?
```{r rep-times-each}
rep(c("A", "B"), times = 3, each = 2)
```
### seq()
The function `seq()` is useful for generating a sequence of numbers with some pattern.
Use `seq()` to create a vector of the integers 0 to 10.
```{r seq1-10}
seq(0, 10)
```
You can set the `by` argument to count by numbers other than 1 (the default). Use `seq()` to create a vector of the numbers 0 to 100 by 10s.
```{r seq-by}
seq(0, 100, by = 10)
```
The argument `length.out` is useful if you know how many steps you want to divide something into. Use `seq()` to create a vector that starts with 0, ends with 100, and has 12 equally spaced steps (hint: how many numbers would be in a vector with 2 *steps*?).
```{r seq-length-out}
seq(0, 100, length.out = 13)
```
### replicate()
You can use the `replicate()` function to run a function `n` times.
For example, you can get 3 sets of 5 numbers from a random normal distribution by setting `n` to `3` and `expr` to `rnorm(5)`.
```{r replicate1}
replicate(n = 3, expr = rnorm(5))
```
By default, `replicate()` simplifies your result into a `r glossary("matrix")` that is easy to convert into a table if your function returns vectors that are the same length. If you'd rather have a list of vectors, set `simplify = FALSE`.
```{r repliacte2}
replicate(n = 3, expr = rnorm(5), simplify = FALSE)
```
### map() and apply() functions {#map-apply}
`purrr::map()` and `lapply()` return a list of the same length as a vector or list, each element of which is the result of applying a function to the corresponding element. They function much the same, but purrr functions have some optimisations for working with the tidyverse. We'll be working mostly with purrr functions in this course, but apply functions are very common in code that you might see in examples on the web.
Imagine you want to calculate the power for a two-sample t-test with a mean difference of 0.2 and SD of 1, for all the sample sizes 100 to 1000 (by 100s). You could run the `power.t.test()` function 20 times and extract the values for "power" from the resulting list and put it in a table.
```{r}
p100 <- power.t.test(n = 100, delta = 0.2, sd = 1, type="two.sample")
# 18 more lines
p1000 <- power.t.test(n = 500, delta = 0.2, sd = 1, type="two.sample")
tibble(
n = c(100, "...", 1000),
power = c(p100$power, "...", p1000$power)
)
```
However, the `apply()` and `map()` functions allow you to perform a function on each item in a vector or list. First make an object `n` that is the vector of the sample sizes you want to test, then use `lapply()` or `map()` to run the function `power.t.test()` on each item. You can set other arguments to `power.t.test()` after the function argument.
```{r}
n <- seq(100, 1000, 100)
pcalc <- lapply(n, power.t.test,
delta = 0.2, sd = 1, type="two.sample")
# or
pcalc <- purrr::map(n, power.t.test,
delta = 0.2, sd = 1, type="two.sample")
```
These functions return a list where each item is the result of `power.t.test()`, which returns a list of results that includes the named item "power". This is a special list that has a summary format if you just print it directly:
```{r}
pcalc[[1]]
```
But you can see the individual items using the `str()` function.
```{r}
pcalc[[1]] |> str()
```
`sapply()` is a version of `lapply()` that returns a vector or array instead of a list, where appropriate. The corresponding purrr functions are `map_dbl()`, `map_chr()`, `map_int()` and `map_lgl()`, which return vectors with the corresponding `r glossary("data type")`.
You can extract a value from a list with the function `[[`. You usually see this written as `pcalc[[1]]`, but if you put it inside backticks, you can use it in apply and map functions.
```{r sapply1}
sapply(pcalc, `[[`, "power")
```
We use `map_dbl()` here because the value for "power" is a `r glossary("double")`.
```{r map-dbl}
purrr::map_dbl(pcalc, `[[`, "power")
```
We can use the `map()` functions inside a `mutate()` function to run the `power.t.test()` function on the value of `n` from each row of a table, then extract the value for "power", and delete the column with the power calculations.
```{r purrr-mypower}
mypower <- tibble(
n = seq(100, 1000, 100)) |>
mutate(pcalc = purrr::map(n, power.t.test,
delta = 0.2,
sd = 1,
type="two.sample"),
power = purrr::map_dbl(pcalc, `[[`, "power")) |>
select(-pcalc)
```
```{r purrr-plot, echo = FALSE, fig.cap="Power for a two-sample t-test with d = 0.2"}
ggplot(mypower, aes(n, power)) +
geom_hline(yintercept = 0.8, color = "red") +
geom_smooth(method = "loess", formula = y~x, se = FALSE) +
geom_point() +
ylim(0, 1) + xlim(0, 1000)
```
## Custom functions {#custom-functions}
In addition to the built-in functions and functions you can access from packages, you can also write your own functions (and eventually even packages!).
### Structuring a function {#structure-function}
The general structure of a function is as follows:
```{r}
function_name <- function(my_args) {
# process the arguments
# return some value
}
```
Here is a very simple function. Can you guess what it does?
```{r}
add1 <- function(my_number) {
my_number + 1
}
add1(10)
```
Let's make a function that reports p-values in APA format (with "p = [rounded value]" when p >= .001 and "p < .001" when p < .001).
First, we have to name the function. You can name it anything, but try not to duplicate existing functions or you will overwrite them. For example, if you call your function `rep`, then you will need to use `base::rep()` to access the normal `rep` function. Let's call our p-value function `report_p` and set up the framework of the function.
```{r}
report_p <- function() {
}
```
### Arguments {#arguments}
We need to add one `r glossary("argument")`, the p-value you want to report. The names you choose for the arguments are private to that argument, so it is not a problem if they conflict with other variables in your script. You put the arguments in the parentheses of `function()` in the order you want them to default (just like the built-in functions you've used before).
```{r}
report_p <- function(p) {
}
```
### Argument defaults {#defaults}
You can add a default value to any argument. If that argument is skipped, then the function uses the default argument. It probably doesn't make sense to run this function without specifying the p-value, but we can add a second argument called `digits` that defaults to 3, so we can round p-values to any number of digits.
```{r}
report_p <- function(p, digits = 3) {
}
```
Now we need to write some code inside the function to process the input arguments and turn them into a **return**ed output. Put the output as the last item in function.
```{r}
report_p <- function(p, digits = 3) {
if (p < .001) {
reported = "p < .001"
} else {
roundp <- round(p, digits)
reported = paste("p =", roundp)
}
reported
}
```
You might also see the returned output inside of the `return()` function. This does the same thing.
```{r}
report_p <- function(p, digits = 3) {
if (p < .001) {
reported = "p < .001"
} else {
roundp <- round(p, digits)
reported = paste("p =", roundp)
}
return(reported)
}
```
When you run the code defining your function, it doesn't output anything, but makes a new object in the Environment tab under **`Functions`**. Now you can run the function.
```{r}
report_p(0.04869)
report_p(0.0000023)
```
### Scope {#scope}
What happens in a function stays in a function. You can change the value of a variable passed to a function, but that won't change the value of the variable outside of the function, even if that variable has the same name as the one in the function.
```{r}
reported <- "not changed"
# inside this function, reported == "p = 0.002"
report_p(0.0023)
reported # still "not changed"
```
### Warnings and errors {#warnings-errors}
::: {.try data-latex=""}
What happens when you omit the argument for `p`? Or if you set `p` to 1.5 or "a"?</p>
:::
You might want to add a more specific warning and stop running the function code if someone enters a value that isn't a number. You can do this with the `stop()` function.
If someone enters a number that isn't possible for a p-value (0-1), you might want to warn them that this is probably not what they intended, but still continue with the function. You can do this with `warning()`.
```{r}
report_p <- function(p, digits = 3) {
if (!is.numeric(p)) stop("p must be a number")
if (p <= 0) warning("p-values are normally greater than 0")
if (p >= 1) warning("p-values are normally less than 1")
if (p < .001) {
reported = "p < .001"
} else {
roundp <- round(p, digits)
reported = paste("p =", roundp)
}
reported
}
```
```{r error-warn, error=TRUE, warning=TRUE}
report_p()
report_p("a")
report_p(-2)
report_p(2)
```
## Iterating your own functions
### Build code
First, let's build up the code that we want to iterate.
#### Simulate and structure data
Create a vector of 20 random numbers drawn from a normal distribution with a mean of 5 and standard deviation of 1 using the `rnorm()` function and store them in the variable `A`.
```{r}
A <- rnorm(20, mean = 5, sd = 1)
```
A `tibble` is a type of table or `data.frame`. The function `tibble::tibble()` creates a tibble with a column for each argument. Each argument takes the form `column_name = data_vector`.
Create a table called `dat` including two vectors: `A` that is a vector of 20 random normally distributed numbers with a mean of 5 and SD of 1, and `B` that is a vector of 20 random normally distributed numbers with a mean of 5.5 and SD of 1.
```{r}
dat <- tibble(
A = rnorm(20, 5, 1),
B = rnorm(20, 5.5, 1)
)
```
#### Statistical test
You can run a Welch two-sample t-test by including the two samples you made as the first two arguments to the function `t.test`. You can reference one column of a table by its names using the format `table_name$column_name`
```{r}
t.test(dat$A, dat$B)
```
You can also convert the table to long format using the `gather` function and specify the t-test using the format `dv_column~grouping_column`.
```{r}
longdat <- gather(dat, group, score, A:B)
t.test(score~group, data = longdat)
```
#### Tidy output
You can use the function `broom::tidy()` to extract the data from a statistical test in a table format. The example below pipes everything together.
```{r}
tibble(
A = rnorm(20, 5, 1),
B = rnorm(20, 5.5, 1)
) |>
gather(group, score, A:B) |>
t.test(score~group, data = _) |>
broom::tidy()
```
::: {.info data-latex=""}
In the pipeline above, `t.test(score~group, data = _)` uses the `_` notation to change the location of the piped-in data table from it's default position as the first argument to a different position.
:::
#### Extract important values
Finally, we can extract a single value from this results table using `pull()`.
```{r}
tibble(
A = rnorm(20, 5, 1),
B = rnorm(20, 5.5, 1)
) |>
gather(group, score, A:B) |>
t.test(score~group, data = _) |>
broom::tidy() |>
pull(p.value)
```
### Custom function
Next, we can group the code above inside a function.
First, name your function `t_sim` and wrap the code above in a function with no arguments.
```{r}
t_sim <- function() {
tibble(
A = rnorm(20, 5, 1),
B = rnorm(20, 5.5, 1)
) |>
gather(group, score, A:B) |>
t.test(score~group, data = _) |>
broom::tidy() |>
pull(p.value)
}
```
Run it a few times to see what happens.
```{r}
t_sim()
```
#### Iterate
Let's run the `t_sim` function 1000 times, assign the resulting p-values to a vector called `reps`, and check what proportion of p-values are lower than alpha (e.g., .05). This number is the power for this analysis.
```{r}
reps <- replicate(1000, t_sim())
alpha <- .05
power <- mean(reps < alpha)
power
```
#### Set seed {#seed}
You can use the `set.seed` function before you run a function that uses random numbers to make sure that you get the same random data back each time. You can use any integer you like as the seed.
```{r set-seed-90210}
set.seed(90201)
```
::: {.warning data-latex=""}
Make sure you don't ever use `set.seed()` **inside** of a simulation function, or you will just simulate the exact same data over and over again.
:::
```{r img-seed-alignment, echo=FALSE, fig.cap="@KellyBodwin"}
knitr::include_graphics("images/memes/seed_alignment.png")
```
#### Add arguments
You can just edit your function each time you want to calculate power for a different sample n, but it is more efficient to build this into your function as an arguments. Redefine `t_sim`, setting arguments for the mean and SD of group A, the mean and SD of group B, and the number of subjects per group. Give them all default values.
```{r}
t_sim <- function(n = 10, m1=0, sd1=1, m2=0, sd2=1) {
tibble(
A = rnorm(n, m1, sd1),
B = rnorm(n, m2, sd2)
) |>
gather(group, score, A:B) |>
t.test(score~group, data = _) |>
broom::tidy() |>
pull(p.value)
}
```
### Test your function
Test your function with some different values to see if the results make sense.
```{r}
t_sim(100)
t_sim(100, 0, 1, 0.5, 1)
```
Use `replicate` to calculate power for 100 subjects/group with an effect size of 0.2 (e.g., A: m = 0, SD = 1; B: m = 0.2, SD = 1). Use 1000 replications.
```{r}
reps <- replicate(1000, t_sim(100, 0, 1, 0.2, 1))
power <- mean(reps < .05)
power
```
Compare this to power calculated from the `power.t.test` function.
```{r}
power.t.test(n = 100, delta = 0.2, sd = 1, type="two.sample")
```
::: {.try data-latex=""}
Calculate power via simulation and `power.t.test` for the following tests:
* 20 subjects/group, A: m = 0, SD = 1; B: m = 0.2, SD = 1
* 40 subjects/group, A: m = 0, SD = 1; B: m = 0.2, SD = 1
* 20 subjects/group, A: m = 10, SD = 1; B: m = 12, SD = 1.5
:::
## Glossary {#glossary-func}
`r glossary_table()`
## Further Resources {#resources-func}
* Chapters 19 and 21 of [R for Data Science](http://r4ds.had.co.nz)
* [Apply functions with purrr cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/purrr.pdf)