tidyr::gather() to tidy data
Joyce Robbins originally published: 1/31/2018, last update: 2/3/2018
- Example 1: City vs. Highway
- Heads up
- Example 2:
This tutorial focuses on one function:
tidyr::gather(), which is used to convert data from messy to tidy. I won't get into the distinction here -- I'm going to assume that you know the terms and just need help figuring how to get from A to B. If, however, this is all new to you, I suggest reading "Tidy Data" by Hadley Wickham, the seminal paper on the topic, which lays out the concept in detail.
Example 1: City vs. Highway
Before getting into the nitty-gritty of
gather() let's study what our messy data looks like, and what it will look like when it's tidied:
Study the differences carefully. What did you observe?
Hopefully you noticed the following:
idcolumn appears in both versions, though it's double the length in the tidy form.
We have two new columns:
hwycolumn names became the contents of the new
The values of the
hwycolumns became the values of the new
Now here's the trick: the only parts we need to concern ourselves with are 1. and 2.: deciding which columns to keep unchanged, and what names to give the new columns, which are called the key and value columns.
The columns to keep are the ones that are already tidy. The ones to dump are the ones that aren't true variables but in fact are levels of another variable. In this case,
hwy should be levels of a new variable called
roadtype (or something similar), according to the theory of tidy data. On the other hand,
id should remain as is.
The bottom line: keep
id as is
The second (and final) step is to choose names for the new key and value columns. We've already chosen
roadtype for key; let's pick
mpg for value. You can also choose to stick with the defaults for the new key and value column names:
value (surprise). Using
value may be helpful at first to keep track of which is which, though ultimately I find more descriptive names to be more useful.
The bottom line: key =
roadtype, value =
The code to carry out the transformation from messy to tidy is one call to
gather(), with parameters that reflect our conclusions about the columns to keep and the names we've chosen:
tidydata <- messydata %>% gather(key = "roadtype", value = "mpg", -id)
The most important thing to note is that each parameter corresponds to one of the columns in the tidy data set.
Now let's look at them one at a time in more detail:
key = "roadtype" sounds like it's setting the key to a preexisting column called "roadtype", but that's not what's going on. Rather, think of this as an instruction to create a new key column and call it "roadtype".
value = "mpg" likewise is an instruction to create a new value column and call it "mpg".
-id is the piece that says "Don't touch my
id column! Leave it as is!"
Nothing else needs to be specified: every value in a column not marked "don't touch" will be moved to the value ("mpg") column and automatically paired with its old column name (in this case "city" or "hwy") from the messy data set, which now appears in the key column.
There are a few problems I've seen with the way in which people use
Not separating the messy data columns properly into "keep" and "dump".
If we had made this mistake in the example above, the result would be something like:
messydata %>% gather(key = "roadtype", value = "mpg")
## # A tibble: 9 x 2 ## roadtype mpg ## <chr> <chr> ## 1 id car1 ## 2 id car2 ## 3 id car3 ## 4 city 19 ## 5 city 20 ## 6 city 29 ## 7 hwy 24 ## 8 hwy 30 ## 9 hwy 35
Yikes, not what we wanted. Looking at the
mpg column, we see that "car1", "car2", and "car3" don't belong. The solution is to add the
-id parameter -- remember, think: "don't touch
id" -- so that it isn't "dumped" into the
Note: it is possible to specify the columns to dump rather than the columns to keep:
messydata %>% gather(key = "roadtype", value = "mpg", city, hwy)
## # A tibble: 6 x 3 ## id roadtype mpg ## <chr> <chr> <dbl> ## 1 car1 city 19.0 ## 2 car2 city 20.0 ## 3 car3 city 29.0 ## 4 car1 hwy 24.0 ## 5 car2 hwy 30.0 ## 6 car3 hwy 35.0
I find this method less intuitive, but of course it's your choice.
Missing id column
Note in our example that the
id column is important for linking the city and highway miles per gallon. For example, we might want to plot the data as follows, which requires knowing which car had which city and highway mpg:
Often, however, there is no
id column: it's not necessary since each row represents one item--a car in this case. If we try to tidy messy data without an id, it looks like this:
The problem is that we've lost the natural connection we had between city and highway for each car since the two values no longer reside in the same row. The solution here is to move the row names--or row numbers in this case--to a column to be used as an id:
messy2 %>% rownames_to_column("id") %>% gather(key, value, -id)
## # A tibble: 6 x 3 ## id key value ## <chr> <chr> <dbl> ## 1 1 city 19.0 ## 2 2 city 20.0 ## 3 3 city 29.0 ## 4 1 hwy 24.0 ## 5 2 hwy 30.0 ## 6 3 hwy 35.0
Note that we must specify the "don't touch" id column that we just created. Also note that we used the default
value column names rather than choose our own.
I don't know if this issue is as widespread as the others, but I'm going to include it since I've made this mistake one too many times. Working on the assumption that you can leave out parameters if you don't change the defaults, if you're like me, you might try something like this:
messydata %>% gather(-id)
## # A tibble: 9 x 2 ## `-id` value ## <chr> <chr> ## 1 id car1 ## 2 id car2 ## 3 id car3 ## 4 city 19 ## 5 city 20 ## 6 city 29 ## 7 hwy 24 ## 8 hwy 30 ## 9 hwy 35
The result is that no columns are spared, and the key column is now called "-id". Not what we wanted. Remember
value, if you don't change the column names!
messydata %>% gather(key, value, -id)
## # A tibble: 6 x 3 ## id key value ## <chr> <chr> <dbl> ## 1 car1 city 19.0 ## 2 car2 city 20.0 ## 3 car3 city 29.0 ## 4 car1 hwy 24.0 ## 5 car2 hwy 30.0 ## 6 car3 hwy 35.0
Let's take the
painters dataset from the MASS package:
## Composition Drawing Colour Expression School ## Da Udine 10 8 16 3 A ## Da Vinci 15 16 4 14 A ## Del Piombo 8 13 16 7 A ## Del Sarto 12 16 9 8 A ## Fr. Penni 0 15 8 0 A ## Guilio Romano 15 16 4 14 A
...and tidy it into four columns as such:
## Name School Skill Score ## 1 Da Udine A Composition 10 ## 2 Da Vinci A Composition 15 ## 3 Del Piombo A Composition 8 ## 4 Del Sarto A Composition 12 ## 5 Fr. Penni A Composition 0 ## 6 Guilio Romano A Composition 15
This example is a little more complex than the previous one, since it's missing an id column, and there is more than one "keep" column. (You are encouraged to try this on your own before looking at the solution!)
Move rownames to a new column and call it
idcolumn" in the Heads up section above to understand the need for this new column.)
Schoolcolumns as is. ("Don't Touch!")
Since the four columns
Expressionare really levels of another variable, they do not get the "keep" as is designation. That means we are staging them to be dumped into a new key variable which we'll call
Skill. In addition, the values contained in these columns will move to a new value column, which we'll call
library(MASS) library(tidyverse) tidypaint <- painters %>% rownames_to_column("Name") %>% gather(key = "Skill", value = "Score", -Name, -School)
Finally, if the visuals aren't your style, here's a running commentary of the instructions:
"Take the (messy) dataset
painters. Move the contents of the rownames to a new column called
Name. Now let's start tidying by gathering multiple columns into key-value pairs. Do this by creating a new key column called
Skill. While you're at it, create a new value column called
Score. Absolutely do not touch or change the
School columns (other than to replicate as necessary). It goes without saying that the four other column names will fill the new
Skill column, and the values of those columns will fill the new
Score column, since it's understood from what I've said already! Much appreciated,