/
pivot.Rmd
560 lines (409 loc) · 20.6 KB
/
pivot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
---
title: "Pivoting"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Pivoting}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
options(tibble.print_max = 10)
```
# Introduction
This vignette describes the use of the new `pivot_longer()` and `pivot_wider()` functions. Their goal is to improve the usability of `gather()` and `spread()`, and incorporate state-of-the-art features found in other packages.
For some time, it's been obvious that there is something fundamentally wrong with the design of `spread()` and `gather()`. Many people don't find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing reshaping in R:
* `pivot_longer()` can work with multiple value variables that may have
different types, inspired by the enhanced `melt()` and `dcast()`
functions provided by the [data.table][data.table] package by Matt Dowle
and Arun Srinivasan.
* `pivot_longer()` and `pivot_wider()` can take a data frame that specifies
precisely how metadata stored in column names becomes data variables (and
vice versa), inspired by the [cdata][cdata] package by John Mount and
Nina Zumel.
In this vignette, you'll learn the key ideas behind `pivot_longer()` and `pivot_wider()` as you see them used to solve a variety of data reshaping challenges ranging from simple to complex.
To begin we'll load some needed packages. In real analysis code, I'd imagine you'd do with the `library(tidyverse)`, but I can't do that here since this vignette is embedded in a package.
```{r setup, message = FALSE}
library(tidyr)
library(dplyr)
library(readr)
```
# Longer
`pivot_longer()` makes datasets __longer__ by increasing the number of rows and decreasing the number of columns. I don't believe it makes sense to describe a dataset as being in "long form". Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.
`pivot_longer()` is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis. The following sections show how to use `pivot_longer()` for a wide range of realistic datasets.
## String data in column names {#pew}
The `relig_income` dataset stores counts based on a survey which (among other things) asked people about their religion and annual income:
```{r}
relig_income
```
This dataset contains three variables:
* `religion`, stored in the rows,
* `income` spread across the column names, and
* `count` stored in the cell values.
To tidy it we use `pivot_longer()`:
```{r}
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
```
* The first argument is the dataset to reshape, `relig_income`.
* The second argument describes which columns need to be reshaped. In this
case, it's every column apart from `religion`.
* The `names_to` gives the name of the variable that will be created from
the data stored in the column names, i.e. `income`.
* The `values_to` gives the name of the variable that will be created from
the data stored in the cell value, i.e. `count`.
Neither the `names_to` nor the `values_to` column exists in `relig_income`, so we provide them as character strings surrounded in quotes.
## Numeric data in column names {#billboard}
The `billboard` dataset records the billboard rank of songs in the year 2000. It has a form similar to the `relig_income` data, but the data encoded in the column names is really a number, not a string.
```{r}
billboard
```
We can start with the same basic specification as for the `relig_income` dataset. Here we want the names to become a variable called `week`, and the values to become a variable called `rank`. I also use `values_drop_na` to drop rows that correspond to missing values. Not every song stays in the charts for all 76 weeks, so the structure of the input data force the creation of unnessary explicit `NA`s.
```{r}
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
```
## Many variables in column names
A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the `who` dataset:
```{r}
who
```
`country`, `iso2`, `iso3`, and `year` are already variables, so they can be left as is. But the columns from `new_sp_m014` to `newrel_f65` encode four variables in their names:
* The `new_`/`new` prefix indicates these are counts of new cases. This
dataset only contains new cases, so we'll ignore it here because it's
constant.
* `sp`/`rel`/`sp`/`ep` describe how the case was diagnosed.
* `m`/`f` gives the gender.
* `014`/`1524`/`2535`/`3544`/`4554`/`65` supplies the age range.
We can break these variables up by specifying multiple column names in `names_to`, and then either providing `names_sep` or `names_pattern`. Here `names_pattern` is the most natural fit. It has a similar interface to `extract`: you give it a regular expression containing groups (defined by `()`) and it puts each group in a column.
```{r}
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
```
We could go one step further and specify the types of the `gender` and `age` columns. I think this is good practice when you have categorical variables with a known set of values.
```{r}
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_ptypes = list(
gender = factor(levels = c("f", "m")),
age = factor(
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
```
## Multiple observations per row
So far, we have been working with data frames that have one observation per row, but many important pivotting problems involve multiple observations per row. You can usually recognise this case because name of the column that you want to appear in the output is part of the column name in the input. In this section, you'll learn how to pivot this sort of data.
The following example is adapted from the [data.table vignette](https://CRAN.R-project.org/package=data.table/vignettes/datatable-reshape.html), as inspiration for tidyr's solution to this problem.
```{r}
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family
```
Note that we have two pieces of information (or values) for each child: their `gender` and their `dob` (date of birth). These need to go into separate columns in the result. Again we supply multiple variables to `names_to`, using `names_sep` to split up each variable name. Note the special name `.value`: this tells `pivot_longer()` that that part of the column name specifies the "value" being measured (which will become a variable in the output).
```{r}
family %>%
pivot_longer(
-family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
```
Note the use of `values_drop_na = TRUE`: the input shape forces the creation of explicit missing variables for observations that don't exist.
This problem also exists in the `anscombe` dataset built in to base R:
```{r}
anscombe
```
This dataset contains four pairs of variables (`x1` and `y1`, `x2` and `y2`, etc) that underlie Anscombe's quartet, a collection of four datasets that have the same summary statistics (mean, sd, correlation etc), but have quite different data. We want to produce a dataset with columns `set`, `x` and `y`.
```{r}
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
```
A similar situation can arise with panel data. For example, take this example dataset provided by [Thomas Leeper](http://github.com/leeper/rio/issues/193). We can tidy it using the same approach as for `anscombe`:
```{r}
pnl <- tibble(
x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
pnl %>%
pivot_longer(
-c(x, a, b),
names_to = c(".value", "time"),
names_pattern = "(.)(.)"
)
```
## Duplicated column names
Occassionally you will come across datasets that have duplicated column names. Generally, such datasets are hard to work with in R, because when you refer to a column by name it only finds the first match. To create a tibble with duplicated names, you have to explicitly opt out of the name repair that usually prevents you from creating such a dataset:
```{r}
df <- tibble(x = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal")
df
```
When `pivot_longer()` encounters such data, it automatically adds another column to the output:
```{r}
df %>% pivot_longer(-x, names_to = "name", values_to = "value")
```
# Wider
`pivot_wider()` is the opposite of `pivot_longer()`: it makes a dataset __wider__ by increasing the number of columns and decreasing the number of rows. It's relatively rare to need `pivot_wider()` to make tidy data, but it's often useful for creating summary tables for presentation, or data in a format needed by other tools.
## Capture-recapture data
The `fish_encounters` dataset, contributed by [Myfanwy Johnston](https://fishsciences.github.io/post/visualizing-fish-encounter-histories/), describes when fish swimming down a river are detected by automatic monitoring stations:
```{r}
fish_encounters
```
Many tools used to analyse this data need it in a form where each station is a column:
```{r}
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
```
This dataset only records when a fish was detected by the station - it doesn't record when it wasn't detected (this is common with this type of data). That means the output data is filled with `NA`s. However, in this case we know that the absence of a record means that the fish was not `seen`, so we can ask `pivot_wider()` to fill these missing values in with zeros:
```{r}
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
)
```
## Aggregation
You can also use `pivot_wider()` to perform simple aggregation. For example, take the `warpbreaks` dataset built in to base R (converted to a tibble for the better print method):
```{r}
warpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks)
warpbreaks
```
This is a designed experiment with nine replicates for every combination of `wool` (`A` and `B`) and `tension` (`L`, `M`, `H`):
```{r}
warpbreaks %>% count(wool, tension)
```
What happens if we attempt to pivot the levels of `wool` into the columns?
```{r}
warpbreaks %>% pivot_wider(names_from = wool, values_from = breaks)
```
We get a warning that each cell in the output corresponds to multiple cells in the input. The default behaviour produces list-columns, which contain all the individual values. A more useful output would be summary statistics, e.g. `mean` breaks for each combination of wool and tension:
```{r}
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = mean)
)
```
For more complex summary operations, I recommend summarising before reshaping, but for simple cases it's often convenient to summarise within `pivot_wider()`.
## Generate column name from multiple variables
Imagine, as in <https://stackoverflow.com/questions/24929954>, that we have information containing the combination of product, country, and year. In tidy form it might look like this:
```{r}
production <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
production
```
We want to widen the data so we have one column for each combination of `product` and `country`. The key is to specify multiple variables for `names_from`:
```{r}
production %>% pivot_wider(
names_from = c(product, country),
values_from = production
)
```
## Tidy census
The `us_rent_income` dataset contains information about median income and rent for each state in the US for 2017 (from the American Community Survey, retrieved with the [tidycensus][tidycensus] package).
```{r}
us_rent_income
```
Here both `estimate` and `moe` are values columns, so we can supply them to `values_from`:
```{r}
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
```
Note that the name of the variable is automatically appended to the output columns.
## Contact list
A final challenge is inspired by [Jiena Gu](https://github.com/jienagu/tidyverse_examples/blob/master/example_long_wide.R). Imagine you have a contact list that you've copied and pasted from a website:
```{r}
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
```
This is challenging because there's no variable that identifies which observations belong together. We can fix this by noting that every contact starts with a name, so we can create a unique id by counting every time we see "name" as the `field`:
```{r}
contacts <- contacts %>%
mutate(
person_id = cumsum(field == "name")
)
contacts
```
Now that we have a unique identifier for each person, we can pivot `field` and `value` into the columns:
```{r}
contacts %>%
pivot_wider(names_from = field, values_from = value)
```
# Longer, then wider
Some problems can't be solved by pivotting in a single direction. The examples in this section show how you might combine `pivot_longer()` and `pivot_wider()` to solve more complex problems.
## World bank
`world_bank_pop` contains data from the World Bank about population per country from 2000 to 2018.
```{r}
world_bank_pop
```
My goal is to produce a tidy dataset where each variable is in a column. It's not obvious exactly what steps are needed yet, but I'll start with the most obvious problem: year is spread across multiple columns.
```{r}
pop2 <- world_bank_pop %>%
pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value")
pop2
```
Next we need to consider the `indicator` variable:
```{r}
pop2 %>% count(indicator)
```
Here `SP.POP.GROW` is population growth, `SP.POP.TOTL` is total population, and `SP.URB.*` are the same but only for urban areas. Let's split this up into two variables: `area` (total or urban) and the actual variable (population or growth):
```{r}
pop3 <- pop2 %>%
separate(indicator, c(NA, "area", "variable"))
pop3
```
Now we can complete the tidying by pivoting `variable` and `value` to make `TOTL` and `GROW` columns:
```{r}
pop3 %>%
pivot_wider(names_from = variable, values_from = value)
```
## Multi-choice
Based on a suggestion by [Maxime Wack](https://github.com/MaximeWack), <https://github.com/tidyverse/tidyr/issues/384>), the final example shows how to deal with a common way of recording multiple choice data. Often you will get such data as follows:
```{r}
multi <- tribble(
~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
)
```
But the actual order isn't important, and you'd prefer to have the individual questions in the columns. You can achieve the desired transformation in two steps. First, you make the data longer, eliminating the explcit `NA`s, and adding a column to indicate that this choice was chosen:
```{r}
multi2 <- multi %>%
pivot_longer(-id, values_drop_na = TRUE) %>%
mutate(checked = TRUE)
multi2
```
Then you make the data wider, filling in the missing observations with `FALSE`:
```{r}
multi2 %>%
pivot_wider(
id_cols = id,
names_from = value,
values_from = checked,
values_fill = list(checked = FALSE)
)
```
# Manual specs
The arguments to `pivot_longer()` and `pivot_wider()` allow you to pivot a wide range of datasets. But the creativity that people apply to their data structures is seemingly endless, so it's quite possible that you will encounter a dataset that you can't immediately see how to reshape with `pivot_longer()` and `pivot_wider()`. To gain more control over pivotting, you can instead create a "spec" data frame that describes exactly how data stored in the column names becomes variables (and vice versa). This section introduces you to the spec data structure, and show you how to use it when `pivot_longer()` and `pivot_wider()` are insufficient.
## Longer
To see how this works, lets return to the simplest case of pivotting applied to the `relig_income` dataset. Now pivotting happens in two steps: we first create a spec object (using `build_longer_spec()`) then use that to describe the pivotting operation:
```{r}
spec <- relig_income %>% build_longer_spec(
cols = -religion,
names_to = "income",
values_to = "count"
)
pivot_longer_spec(relig_income, spec)
```
(This gives the same result as before, just with more code. There's no need to use it here, it is presented as a simple example for using `spec`.)
What does `spec` look like? It's a data frame with one row for each column, and two special columns that start with `.`:
* `.name` gives the name of the column.
* `.value` gives the name of the column that the values in the cells will
go into.
```{r}
spec
```
## Wider
Below we widen `us_rent_income` with `pivot_wider()`. The result is ok, but I think it could be improved:
```{r}
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
```
I think it would be better to have columns `income`, `rent`, `income_moe`, and `rent_moe`, which we can achieve with a manual spec. The current spec looks like this:
```{r}
spec1 <- us_rent_income %>%
build_wider_spec(names_from = variable, values_from = c(estimate, moe))
spec1
```
For this case, we mutate `spec` to carefully construct the column names:
```{r}
spec2 <- spec1 %>%
mutate(.name = paste0(variable, ifelse(.value == "moe", "_moe", "")))
spec2
```
Supplying this spec to `pivot_wider()` gives us the result we're looking for:
```{r}
pivot_wider_spec(us_rent_income, spec2)
```
## By hand
Sometimes it's not possible (or not convenient) to compute the spec, and instead it's more convenient to construct the spec "by hand". For example, take this `construction` data, which is lightly modified from Table 5 "completions" found at <https://www.census.gov/construction/nrc/index.html>:
```{r}
construction
```
This sort of data is not uncommon from government agencies: the column names actually belong to different variables, and here we have summaries for number of units (1, 2-4, 5+) and regions of the country (NE, NW, midwest, S, W). We can most easily describe that with a tibble:
```{r}
spec <- tribble(
~.name, ~.value, ~units, ~region,
"1 unit", "n", "1", NA,
"2 to 4 units", "n", "2-4", NA,
"5 units or more", "n", "5+", NA,
"Northeast", "n", NA, "Northeast",
"Midwest", "n", NA, "Midwest",
"South", "n", NA, "South",
"West", "n", NA, "West",
)
```
Which yields the following longer form:
```{r}
pivot_longer_spec(construction, spec)
```
Note that there is no overlap between the `units` and `region` variables; here the data would really be most naturally described in two independent tables.
## Theory
One neat property of the `spec` is that you need the same spec for `pivot_longer()` and `pivot_wider()`. This makes it very clear that the two operations are symmetric:
```{r}
construction %>%
pivot_longer_spec(spec) %>%
pivot_wider_spec(spec)
```
The pivotting spec allows us to be more precise about exactly how `pivot_longer(df, spec = spec)` changes the shape of `df`: it will have `nrow(df) * nrow(spec)` rows, and `ncol(df) - nrow(spec) + ncol(spec) - 2` columns.
[cdata]: https://winvector.github.io/cdata/
[data.table]: https://github.com/Rdatatable/data.table/wiki
[tidycensus]: https://walkerke.github.io/tidycensus