Generate data from different distributions via pmap()
Jenny Bryan 2018-05-08
runif()
Uniform[min, max] via CONSIDER:
runif(n, min = 0, max = 1)
Want to do this for several triples of (n, min, max).
Store each triple as a row in a data frame.
Now iterate over the rows.
library(tidyverse)
Notice how df’s variable names are same as runif’s argument names. Do this when you can!
df <- tribble(
~ n, ~ min, ~ max,
1L, 0, 1,
2L, 10, 100,
3L, 100, 1000
)
df
#> # A tibble: 3 x 3
#> n min max
#> <int> <dbl> <dbl>
#> 1 1 0 1
#> 2 2 10 100
#> 3 3 100 1000
Set seed to make this repeatedly random.
Practice on single rows.
set.seed(123)
(x <- df[1, ])
#> # A tibble: 1 x 3
#> n min max
#> <int> <dbl> <dbl>
#> 1 1 0 1
runif(n = x$n, min = x$min, max = x$max)
#> [1] 0.2875775
x <- df[2, ]
runif(n = x$n, min = x$min, max = x$max)
#> [1] 80.94746 46.80792
x <- df[3, ]
runif(n = x$n, min = x$min, max = x$max)
#> [1] 894.7157 946.4206 141.0008
Think out loud in pseudo-code.
## x <- df[i, ]
## runif(n = x$n, min = x$min, max = x$max)
## runif(n = df$n[i], min = df$min[i], max = df$max[i])
## runif with all args from the i-th row of df
Just. Do. It. with pmap()
.
set.seed(123)
pmap(df, runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
Finessing variable and argument names
Q: What if you can’t arrange it so that variable names and arg names are same?
foofy <- tibble(
alpha = 1:3, ## was: n
beta = c(0, 10, 100), ## was: min
gamma = c(1, 100, 1000) ## was: max
)
foofy
#> # A tibble: 3 x 3
#> alpha beta gamma
#> <int> <dbl> <dbl>
#> 1 1 0 1
#> 2 2 10 100
#> 3 3 100 1000
A: Rename the variables on-the-fly, on the way in.
set.seed(123)
foofy %>%
rename(n = alpha, min = beta, max = gamma) %>%
pmap(runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
A: Write a wrapper around runif()
to say how df vars <–> runif args.
## wrapper option #1:
## ARGNAME = l$VARNAME
my_runif <- function(...) {
l <- list(...)
runif(n = l$alpha, min = l$beta, max = l$gamma)
}
set.seed(123)
pmap(foofy, my_runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
## wrapper option #2:
my_runif <- function(alpha, beta, gamma, ...) {
runif(n = alpha, min = beta, max = gamma)
}
set.seed(123)
pmap(foofy, my_runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
You can use ..i
to refer to input by position.
set.seed(123)
pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3))
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
Use this with extreme caution. Easy to shoot yourself in the foot.
Extra variables in the data frame
What if data frame includes variables that should not be passed to
.f()
?
df_oops <- tibble(
n = 1:3,
min = c(0, 10, 100),
max = c(1, 100, 1000),
oops = c("please", "ignore", "me")
)
df_oops
#> # A tibble: 3 x 4
#> n min max oops
#> <int> <dbl> <dbl> <chr>
#> 1 1 0 1 please
#> 2 2 10 100 ignore
#> 3 3 100 1000 me
This will not work!
set.seed(123)
pmap(df_oops, runif)
#> Error in .f(n = .l[[c(1L, i)]], min = .l[[c(2L, i)]], max = .l[[c(3L, : unused argument (oops = .l[[c(4, i)]])
A: use dplyr::select()
to limit the variables passed to pmap()
.
set.seed(123)
df_oops %>%
select(n, min, max) %>% ## if it's easier to say what to keep
pmap(runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
set.seed(123)
df_oops %>%
select(-oops) %>% ## if it's easier to say what to omit
pmap(runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
A: Use a custom wrapper and absorb extra variables with ...
.
my_runif <- function(n, min, max, ...) runif(n, min, max)
set.seed(123)
pmap(df_oops, my_runif)
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
Add the generated data to the data frame as a list-column
set.seed(123)
(df_aug <- df %>%
mutate(data = pmap(., runif)))
#> # A tibble: 3 x 4
#> n min max data
#> <int> <dbl> <dbl> <list>
#> 1 1 0 1 <dbl [1]>
#> 2 2 10 100 <dbl [2]>
#> 3 3 100 1000 <dbl [3]>
#View(df_aug)
What about computing within a data frame, in the presence of the
complications discussed above? Use list()
in the place of the .
placeholder above to select the target variables and, if necessary, map
variable names to argument names. Thanks @hadley for sharing this
trick.
How to address variable names != argument names:
foofy <- tibble(
alpha = 1:3, ## was: n
beta = c(0, 10, 100), ## was: min
gamma = c(1, 100, 1000) ## was: max
)
set.seed(123)
foofy %>%
mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif))
#> # A tibble: 3 x 4
#> alpha beta gamma data
#> <int> <dbl> <dbl> <list>
#> 1 1 0 1 <dbl [1]>
#> 2 2 10 100 <dbl [2]>
#> 3 3 100 1000 <dbl [3]>
How to address presence of ‘extra variables’ with either an inclusion or exclusion mentality
df_oops <- tibble(
n = 1:3,
min = c(0, 10, 100),
max = c(1, 100, 1000),
oops = c("please", "ignore", "me")
)
set.seed(123)
df_oops %>%
mutate(data = pmap(list(n, min, max), runif))
#> # A tibble: 3 x 5
#> n min max oops data
#> <int> <dbl> <dbl> <chr> <list>
#> 1 1 0 1 please <dbl [1]>
#> 2 2 10 100 ignore <dbl [2]>
#> 3 3 100 1000 me <dbl [3]>
df_oops %>%
mutate(data = pmap(select(., -oops), runif))
#> # A tibble: 3 x 5
#> n min max oops data
#> <int> <dbl> <dbl> <chr> <list>
#> 1 1 0 1 please <dbl [1]>
#> 2 2 10 100 ignore <dbl [2]>
#> 3 3 100 1000 me <dbl [3]>
Review
What have we done?
- Arranged inputs as rows in a data frame
- Used
pmap()
to implement a loop over the rows. - Used dplyr verbs
rename()
andselect()
to manipulate data on the way intopmap()
. - Wrote custom wrappers around
runif()
to deal with:- df var names !=
.f()
arg names - df vars that aren’t formal args of
.f()
- df var names !=
- Demonstrated all of the above when working inside a data frame and adding generated data as a list-column