bk01_base-functions.Rmd

---
title: "Relationship to base and plyr functions"
comment: "*side-by-side workflow comparison*"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
```

## Why not base?

You need a way to iterate in R in a data-structure-informed way. What does that mean?

  * Iterate over elements of a list
  * Iterate over rows or columns of a 2-dimensional object  
  * Iterate over sub data frames induced by one or more factors
  * Iterate over tuples formed from the i-th element of several vectors of equal length

**All of this is absolutely possible with base R**, using `for()` loops or the "apply" functions, such as `apply()`, `[slvmt]apply()`, and `by()`. I know, because I did so religiously for over 15 years. It works fine!

Why might you do otherwise? As an instructor, I encounter lots of useRs who sheepishly say they know they should be using the apply functions, but they don't. Because they've never quite figured them out or been able to form the habit.

I have a theory about why. The user interface of the "apply"" functions is not as consistent as it could be, which slows down learning. The return objects frequently require further checking and massage to use downstream, which diminishes the payoff. In particular, there's a tendency to return a vector (atomic or otherwise) or array, instead of data frame, with the original factor levels appearing in a names attribute.

Regardless of your preference, hopefully the side-by-side examples below are a helpful tour of different ways to iterate. Even if you prefer purrr/tidyverse in some settings, you might want base methods when you don't want any dependencies.

## Why purrr?

purrr addresses some of the friction identified in the base functions for "split-apply-combine":

  * The `map()` family of functions is highly internally consistent, making it easier to transfer expertise from one function to another.
  * Greater encouragement for type-safe simplification to atomic vector or data frame, producing output that is more ready for the next step.
  * Concise syntax for defining anonymous functions.

We load purrr now. Below we also use example lists from [repurrrsive](https://github.com/jennybc/repurrrsive) which are [introduced and explored elsewhere](ls00_inspect-explore.html).

```{r}
library(purrr)
library(repurrrsive)
```

## Why not plyr?

If you have never heard of the [plyr package](https://CRAN.R-project.org/package=plyr), you can skip this section! Personally, I'm a bit sad that it's time to move on from plyr. But it is time.

Why? plyr is no longer under active development. The innovation is happening elsewhere, in purrr and the other packages in the tidyverse. Also, even I must admit that plyr can be rather slow. I wouldn't recommend writing new code that requires plyr.

In terms of understanding or translating plyr code, here's the minimum you need to know: The key feature of the main 3 * 4 = 12 functions in plyr is their explicit, memorable form.

```{r eval = FALSE}
XYply()
```

where `X` and `Y` specify the type of input and output, respectively. `X` and `Y` take values from here:

  * `a` = "array"
  * `d` = "data frame"
  * `l` = "list"
  * `_` = "nothing" (not eligible as `X` re: input, for obvious reasons)
  
The most useful function proved to be `ddply()`: data frame in, data frame out. This inspired the creation of the [dplyr package](https://dplyr.tidyverse.org). Most tasks suitable for `a*ply()` and, especially `l*ply()` are now handled by purrr's `map()` family. The functions called for their side effects, as opposed to return value, are those named `*_ply()`; this is handled by purrr's `walk()` functions. Marching through a data frame row by row is now a job for purrr's `pmap()`.

## `lapply()` vs. `purrr::map()`

These are the core mapping functions of base and purrr, respectively. They are "list in, list out". The main (only?) difference is access to purrr's shortcuts for indexing by name or position and for creating anonymous functions.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
lapply(got_chars[1:3],
       function(x) x[["name"]])
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map(got_chars[1:3], "name")
```
</div>
</div>

This is also `plyr::llply()`, although it offers some other arguments, such as `.progress` and `.parallel`.

```{r eval = FALSE}
plyr::llply(got_chars[1:3], function(x) x[["name"]])
```

## `sapply()` vs. ¯\\\_(ツ)\_/¯

`sapply()` is a base function that attempts to apply a reasonable simplification to the output of `lapply()`. It's handy for interactive use, but due to the unpredictability of it return value, it's unwise to use it in programming. There is no equivalent in purrr or plyr.

Problem demonstration: The characters in Game of Thrones can have aliases. Some have several, some have one, some have none. Depending on which characters I work with, the same `sapply()` call can return an object of entirely different class, i.e. a list or a character vector. This can be a source of mysterious bugs in code meant to run non-interactively.

```{r}
aliases1 <- sapply(got_chars[20:22], function(x) x[["aliases"]])
str(aliases1)
aliases2 <- sapply(got_chars[c(3, 22, 27)], function(x) x[["aliases"]])
str(aliases2)
```

With purrr, you would use `map()` to get a list back or `map_chr()` to get atomic character vector. If you use `map_chr()` when you should not, you'll get an informative error right away (shown below) and can adjust your approach accordingly.

```{r}
map_chr(got_chars[2:4], "aliases")
```

## `vapply()` vs. `map_*()`

Base `vapply()` requires you to specify a template for the return value and is described as a safer alternative to `sapply()`. The closest purrr functions are the type-specific mapping functions: `map_lgl()`, `map_int()`, `map_dbl()`, and `map_chr()` that are "list in, atomic vector out". Here's comparable use of `vapply()` and `map_chr()` to get some of the Game of Thrones characters' names.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
vapply(got_chars[1:3],
       function(x) x[["name"]],
       character(1))
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map_chr(got_chars[1:3], "name")
```
</div>
</div>

### `vapply()` always simplifies

What's not to love with `vapply()` then? It suffers from the `drop = TRUE` vs `FALSE` problem we have when requesting a single row or column from a 2-dimensional object. Except `vapply()` has no `drop` argument to control this behavior. It's an example of the base functions being more difficult to program around. The template allows you to specify the form of each individual result, but there is no way to specify the form -- such as the dimension -- of the *overall* result.

I adapt this example from my real life, where I have `vapply()` inside a function and `n` is an argument to that function, i.e. it varies. Here I simply define `n` in the global environment prior to the `vapply()` call. Note how `vapply()` returns a 2 dimensional object in the first case and atomic vector in the second. As it says in the docs: "Simplification is always done in `vapply`." Believe it.

```{r}
f <- function(x, n) rep(x, n)
n <- 3
vapply(c("a", "b"), f, character(n), n = n)
n <- 1
vapply(c("a", "b"), f, character(n), n = n)
```

## ¯\\\_(ツ)\_/¯ vs. `map_dfr()`

The `purrr::map_dfr()` function is "list in, data frame out" and there is no true base equivalent. Given the centrality of data frames for analysis, it is handy to have a function to produce them, without resorting to `do.call()`.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
l <- lapply(got_chars[23:25],
            `[`, c("name", "playedBy"))
mat <- do.call(rbind, l)
(df <- as.data.frame(mat, stringsAsFactors = FALSE))
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
map_dfr(got_chars[23:25],
        `[`, c("name", "playedBy"))
```
</div>
</div>

The base workflow above gets trickier if you're extracting elements of disparate type. At that point, it may make more sense to use `vapply()` repeatedly. For comparability, we'll show similar using purrr's type-specific mapping, which is also safer than relying on automatic type conversion from `map_dfr()`.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
data.frame(
  name = vapply(got_chars[23:25], `[[`,
                character(1), "name"),
  id = vapply(got_chars[23:25], `[[`,
              integer(1), "id"),
  stringsAsFactors = FALSE
)
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
tibble::tibble(
  name = map_chr(got_chars[23:25], "name"),
  id = map_int(got_chars[23:25], "id")
)
```
</div>
</div>

## `mapply()` vs. `map2()`, `pmap()`

When you need to iterate over 2 or more vectors/lists in parallel, the base option is `mapply()`. Unlike the other apply functions, the first argument is `FUN`, the function to apply, and the multiple vector inputs are provided "loose" via `...`.

For exactly two vector inputs, purrr has `map2()`, with all the usual type-specific variants. For an arbitrary number of vector inputs, use purrr `pmap()` or type-specific variants, with the inputs packaged in a list. A very handy special case is when the input is a data frame, in which case `pmap_*()` applies `.f` to each row.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
nms <- vapply(got_chars[16:18],
              `[[`, character(1), "name")
birth <- vapply(got_chars[16:18],
                `[[`, character(1), "born")
mapply(function(x, y) paste(x, "was born", y),
       nms, birth)
```
</div>
<div class = "col-md-6">
**purrr**
```{r}
nms <- got_chars[16:18] %>% 
  map_chr("name")
birth <- got_chars[16:18] %>% 
  map_chr("born")
map2_chr(nms, birth, ~ paste(.x, "was born", .y))

## and again, but with pmap()
df <- tibble::tibble(
  nms, 
  connector = "was born",
  birth
)
pmap_chr(df, paste)
```
</div>
</div>

## `aggregate()` vs. `dplyr::summarize()`

Consider a data frame, as opposed to a nested list. How do you split it into pieces, according to one or more factors, apply a function to the pieces, and combine the results?

Create a tiny excerpt of the Gapminder dataset that contains a bit of data for Canada and Germany. Load dplyr, now that we are more in the data frame world.

```{r message = FALSE}
library(dplyr)
library(gapminder)
(mini_gap <- gapminder %>%
    filter(country %in% c("Canada", "Germany"), year > 2000) %>% 
    droplevels())
```

Simple summary of a single variable for each country. Specifically, take the mean of life expectancy. In this case, the formula method of base `aggregate()` is quite nice, i.e. it returns a convenient data frame.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
aggregate(lifeExp ~ country, mini_gap, mean)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
mini_gap %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp))
```
</div>
</div>

Simple summaries of two variables for each country. We take the mean of life expectancy and of GDP per capita.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
## formula method
aggregate(cbind(lifeExp, gdpPercap) ~ country, mini_gap, mean)

## data.frame method
aggregate(mini_gap[c("lifeExp", "gdpPercap")], list(mini_gap$country), mean)

## tapply() more general but output less useful here (data frame?)

## returns named vector
tapply(mini_gap$lifeExp, mini_gap$country, mean)

## returns list
tapply(mini_gap$lifeExp, mini_gap$country, mean, simplify = FALSE)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
mini_gap %>% 
  group_by(country) %>% 
  summarize_at(vars(lifeExp, gdpPercap), mean)
```
</div>
</div>

Bivariate summary of two variables for each country. We compute the correlation of life expectancy and year, for the full gapminder dataset now. On the base side, we can no longer use `aggregate()` or `tapply()` and need to graduate to `by()`.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
## by() with simplification (the default)
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year))
head(by_obj)

## by() without simplification
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year),
   simplify = FALSE)
head(by_obj)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
gapminder %>% 
  group_by(country) %>% 
  summarize(cor = cor(lifeExp, year))
```
</div>
</div>

This is a good place to pause and glance back over the base vs tidyverse calls and results in this section. It shows that you can absolutely do these tasks either way, but the consistency of the code and return values is higher in the tidyverse.

## `by()` vs. `tidyr::nest()`

Fit a linear model of life expectancy against year. On the tidyverse side, we now create a nested data frame, with one meta-row per country. Therefore we load tidyr to get `nest()`. The data needed for each country's linear model is stored as a list-column of country-specific data frame.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
by_obj <- by(gapminder,
             gapminder$country, 
             function(df) lm(lifeExp ~ year, data = df))
str(by_obj[1:2], max.level = 1)
by_obj[[1]]
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
library(tidyr)
nested_df <- gapminder %>% 
  group_by(country, continent) %>% 
  nest() %>% 
  mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))
str(nested_df$fit[1:2], max.level = 1)
nested_df$fit[[1]]
```
</div>
</div>

What if you want to inspect the fits for Oceania? On the base side, you've got some work to do because the country information lives only in the names and the continent information is not directly linked to the model fits at all. On the tidyverse side, where the fits live in a data frame that carries country and continent info, we can use our usual techniques for filtering rows based on the data.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
o_countries <- as.character(unique(gapminder$country[gapminder$continent == "Oceania"]))
by_obj[names(by_obj) %in% o_countries]
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
nested_df %>% 
  filter(continent == "Oceania") %>% 
  .$fit
```
</div>
</div>

Let's form a data frame with one row per country and variables for country, continent, estimated intercept, and estimated slope. I also want to guarantee that the country and continent factors have the same levels as they originally had in gapminder.

<div class = "row">
<div class = "col-md-6">
**base**
```{r}
coefs <- lapply(by_obj, coef)
coefs <- do.call(rbind, coefs)
coefs <- data.frame(
  country = I(rownames(coefs)),
  coefs
)
coefs$continent <- gapminder$continent[match(coefs$country, gapminder$country)]
coefs$continent <- factor(coefs$continent, levels = levels(gapminder$continent))
coefs$country <- factor(coefs$country, levels = levels(gapminder$country))
head(coefs)
```
</div>
<div class = "col-md-6">
**tidyverse**
```{r}
nested_df %>% 
  mutate(coefs = map(fit, coef),
         intercept = map_dbl(coefs, 1),
         slope = map_dbl(coefs, 2)) %>% 
  select(country, continent, intercept, slope)
```
</div>
</div>

This illustrates the post-processing that is often necessary in a base workflow. We need to restore the country info from the names of `by_obj`, use them to look up the continents in gapminder, and then restore the original factor levels for both country and continent. It illustrates the payoff for temporarily tolerating the `data` and `fit` list-columns in the nested data frame used in the tidyverse workflow. The country and continent factors remain intact and directly associated with the data and fits.