Permalink
Browse files

Update ex03

  • Loading branch information...
jennybc committed Apr 3, 2018
1 parent 6bfd877 commit c3f5c1ffd91caa144f78e4be89c9606ecb7a5e75
Showing with 64 additions and 47 deletions.
  1. +29 −17 ex03_row-wise-iteration-are-you-sure.R
  2. +35 −30 ex03_row-wise-iteration-are-you-sure.md
@@ -1,5 +1,5 @@
#' ---
#' title: "Are you absolutely SURE you need to iterate over rows?"
#' title: "Are you absolutely sure that you, personally, need to iterate over rows?"
#' author: "Jenny Bryan"
#' date: "`r format(Sys.Date())`"
#' output: github_document
@@ -35,38 +35,50 @@ new_df <- function() {
#' Sometimes it's easy to fixate on one (unfavorable) way of accomplishing
#' something, because it feels like a natural extension of a successful
#' small-scale experiment.

#' Start with a small example, row 1 of the data frame.
#'
#' Let's create a string from row 1 of the data frame.
df <- new_df()
paste(df$name[1], "is", df$age[1], "years old")

#' I want to scale up, therefore I must ... loop over all rows!
#' I want to scale up, therefore I obviously must ... loop over all rows!
n <- nrow(df)
s <- vector(mode = "character", length = n)
for (i in seq_len(n)) {
cat(paste(df$name[i], "is", df$age[i], "years old"), sep = "\n")
s[i] <- paste(df$name[i], "is", df$age[i], "years old")
}
s

#' HOLD ON. What if I told you `paste()` is already vectorized over its
#' arguments?
paste(df$name, "is", df$age, "years old") %>% cat(sep = "\n")
paste(df$name, "is", df$age, "years old")

#' A surprising number of "iterate over rows" problems can be solved by
#' A surprising number of "iterate over rows" problems can be eliminated by
#' exploiting functions that are already vectorized and by making your own
#' functions vectorized over the primary argument. Writing a loop is not
#' necessarily bad, but it should always give you pause.
#' functions vectorized over the primary argument.
#'
#' Writing an explicit loop in your code is not necessarily bad, but it should
#' always give you pause. Has someone already written this loop for you? Ideally
#' in C or C++ and inside a package that's being regularly checked, with high
#' test coverage. That is usually the better choice.

# ----
#' ## Don't forget to work "inside the box"
#'

#' Even better: work with a natively vectorized function that knows about your
#' data frame!
library(glue)
#' For this string interpolation task, we can even work with a vectorized
#' function that is happy to do lookup inside a data frame. The [glue
#' package](https://glue.tidyverse.org) is doing the work under the hood here,
#' but its Greatest Functions are now re-exported by stringr, which we already
#' attached via `library(tidyverse)`.

glue_data(df, "{name} is {age} years old")
str_glue_data(df, "{name} is {age} years old")

#' You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`, because
#' the other variables in `df` are automatically available for use.

df %>%
mutate(sentence = glue("{name} is {age} years old"))
mutate(sentence = str_glue("{name} is {age} years old"))

#' The tidyverse style is to manage data holistically in a data frame and
#' provide a user interface that encourages self-explaining code with low
#' "syntactical noise".
#'
#' Q for team: Are there gotchas here? Has `glue()` always played this nicely
#' inside `mutate()`?
@@ -1,4 +1,4 @@
Are you absolutely SURE you need to iterate over rows?
Are you absolutely sure that you, personally, need to iterate over rows?
================
Jenny Bryan
2018-04-02
@@ -25,61 +25,69 @@ new_df <- function() {

Sometimes it’s easy to fixate on one (unfavorable) way of accomplishing
something, because it feels like a natural extension of a successful
small-scale experiment. Start with a small example, row 1 of the data
frame.
small-scale experiment.

Let’s create a string from row 1 of the data frame.

``` r
df <- new_df()
paste(df$name[1], "is", df$age[1], "years old")
#> [1] "Reed is 14 years old"
```

I want to scale up, therefore I must … loop over all rows\!
I want to scale up, therefore I obviously must … loop over all rows\!

``` r
n <- nrow(df)
s <- vector(mode = "character", length = n)
for (i in seq_len(n)) {
cat(paste(df$name[i], "is", df$age[i], "years old"), sep = "\n")
s[i] <- paste(df$name[i], "is", df$age[i], "years old")
}
#> Reed is 14 years old
#> Wesley is 12 years old
#> Eli is 12 years old
#> Toby is 1 years old
s
#> [1] "Reed is 14 years old" "Wesley is 12 years old"
#> [3] "Eli is 12 years old" "Toby is 1 years old"
```

HOLD ON. What if I told you `paste()` is already vectorized over its
arguments?

``` r
paste(df$name, "is", df$age, "years old") %>% cat(sep = "\n")
#> Reed is 14 years old
#> Wesley is 12 years old
#> Eli is 12 years old
#> Toby is 1 years old
paste(df$name, "is", df$age, "years old")
#> [1] "Reed is 14 years old" "Wesley is 12 years old"
#> [3] "Eli is 12 years old" "Toby is 1 years old"
```

A surprising number of “iterate over rows” problems can be solved by
A surprising number of “iterate over rows” problems can be eliminated by
exploiting functions that are already vectorized and by making your own
functions vectorized over the primary argument. Writing a loop is not
necessarily bad, but it should always give you pause. Even better: work
with a natively vectorized function that knows about your data frame\!
functions vectorized over the primary argument.

Writing an explicit loop in your code is not necessarily bad, but it
should always give you pause. Has someone already written this loop for
you? Ideally in C or C++ and inside a package that’s being regularly
checked, with high test coverage. That is usually the better choice.

## Don’t forget to work “inside the box”

For this string interpolation task, we can even work with a vectorized
function that is happy to do lookup inside a data frame. The [glue
package](https://glue.tidyverse.org) is doing the work under the hood
here, but its Greatest Functions are now re-exported by stringr, which
we already attached via `library(tidyverse)`.

``` r
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
glue_data(df, "{name} is {age} years old")
str_glue_data(df, "{name} is {age} years old")
#> Reed is 14 years old
#> Wesley is 12 years old
#> Eli is 12 years old
#> Toby is 1 years old
```

You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`,
because the other variables in `df` are automatically available for use.

``` r
df %>%
mutate(sentence = glue("{name} is {age} years old"))
mutate(sentence = str_glue("{name} is {age} years old"))
#> # A tibble: 4 x 3
#> name age sentence
#> <chr> <dbl> <S3: glue>
@@ -92,6 +100,3 @@ df %>%
The tidyverse style is to manage data holistically in a data frame and
provide a user interface that encourages self-explaining code with low
“syntactical noise”.

Q for team: Are there gotchas here? Has `glue()` always played this
nicely inside `mutate()`?

0 comments on commit c3f5c1f

Please sign in to comment.