unheadr 
The goal of unheadr is to help wrangle data when it has embedded
subheaders, or when values are wrapped across several rows.
https://unheadr.liomys.mx/
Installation
You can install the CRAN release or the development version with:
# Install unheadr from CRAN:
install.packages("unheadr")
# Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("luisDVA/unheadr")Some of the possible uses of unheadr are now described in this
publication:
Verde Arregoitia, L. D., Cooper, N., D’Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. Hystrix, the Italian Journal of Mammalogy, 29(2), 155-161. Open Access, DOI 10.4404/hystrix-00133-2018
Usage
untangle2()
untangle2() puts embedded subheaders into their own variable, using
regular expressions to identify them.
data(primates2017)
# head(primates2017,n=20)The first half of the dataset looks like the table below. Note that there are rows that correspond to values in grouping variables, which should be in their own column. Instead, they are embedded within the data rectangle. This is a pretty common practice. This looks OK and is easy to read, but hard to work with (for example: calculating group-wise summaries).
In this example, values for an implicit “geographic region” variable and an implicit ‘taxonomic family’ variable are embedded in the column that contains the observational units (the scientific names of various primates).
| scientific_name | common_name | red_list_status | mass_kg |
|---|---|---|---|
| Asia | NA | NA | NA |
| CERCOPITHECIDAE | NA | NA | NA |
| Trachypithecus obscurus | Dusky Langur | NT | 7.13 |
| Presbytis sumatra | Black Sumatran Langur | EN | 6.00 |
| Rhinopithecus roxellana | Golden Snub-nosed Monkey | EN | NA |
| Trachypithecus auratus | East Javan Langur | VU | 6.25 |
| Semnopithecus johnii | Nilgiri Langur | VU | 11.45 |
| Trachypithecus delacouri | Delacour’s Langur | CR | NA |
| Trachypithecus leucocephalus | White-headed Langur | CR | 8.00 |
| Presbytis comata | Javan Langur | EN | 6.70 |
| Macaca pagensis | Pagai Macaque | CR | 4.50 |
| Trachypithecus germaini | Germain’s Langur | EN | 8.83 |
| Macaca munzala | Arunachal Macaque | EN | NA |
| Macaca mulatta | Rhesus Macaque | LC | 9.90 |
| Semnopithecus hector | Terai Sacred Langur | NT | 15.20 |
| HYLOBATIDAE | NA | NA | NA |
| Hylobates funereus | East Bornean Gray Gibbon | EN | NA |
| Hylobates klossii | Kloss’s Gibbon | EN | 5.80 |
| Nomascus concolor | Western Black Crested Gibbon | CR | 7.71 |
| LORISIDAE | NA | NA | NA |
For a tidier structure, these subheaders embedded in the
scientific_name column need to be plucked out and placed in their own
variable. This is the main objective of unheadr and what untangle2()
was made for.
If these subheaders can be matched in bulk with a regular expression because they share a prefix, suffix, or anything in common, we can save a lot of time. Otherwise, they can be matched by name.
Sorting out the mess in the example data:
# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(primates2017, "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region)The function can be used with magrittr pipes as a dplyr-type verb.
primates2017 %>%
untangle2("DAE$", scientific_name, family) %>%
untangle2("Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region) %>%
head(n = 20)| scientific_name | common_name | red_list_status | mass_kg | family | region |
|---|---|---|---|---|---|
| Trachypithecus obscurus | Dusky Langur | NT | 7.13 | CERCOPITHECIDAE | Asia |
| Presbytis sumatra | Black Sumatran Langur | EN | 6.00 | CERCOPITHECIDAE | Asia |
| Rhinopithecus roxellana | Golden Snub-nosed Monkey | EN | NA | CERCOPITHECIDAE | Asia |
| Trachypithecus auratus | East Javan Langur | VU | 6.25 | CERCOPITHECIDAE | Asia |
| Semnopithecus johnii | Nilgiri Langur | VU | 11.45 | CERCOPITHECIDAE | Asia |
| Trachypithecus delacouri | Delacour’s Langur | CR | NA | CERCOPITHECIDAE | Asia |
| Trachypithecus leucocephalus | White-headed Langur | CR | 8.00 | CERCOPITHECIDAE | Asia |
| Presbytis comata | Javan Langur | EN | 6.70 | CERCOPITHECIDAE | Asia |
| Macaca pagensis | Pagai Macaque | CR | 4.50 | CERCOPITHECIDAE | Asia |
| Trachypithecus germaini | Germain’s Langur | EN | 8.83 | CERCOPITHECIDAE | Asia |
| Macaca munzala | Arunachal Macaque | EN | NA | CERCOPITHECIDAE | Asia |
| Macaca mulatta | Rhesus Macaque | LC | 9.90 | CERCOPITHECIDAE | Asia |
| Semnopithecus hector | Terai Sacred Langur | NT | 15.20 | CERCOPITHECIDAE | Asia |
| Hylobates funereus | East Bornean Gray Gibbon | EN | NA | HYLOBATIDAE | Asia |
| Hylobates klossii | Kloss’s Gibbon | EN | 5.80 | HYLOBATIDAE | Asia |
| Nomascus concolor | Western Black Crested Gibbon | CR | 7.71 | HYLOBATIDAE | Asia |
| Nycticebus menagensis | Philippine Slow Loris | VU | 0.28 | LORISIDAE | Asia |
| Nycticebus bengalensis | Bengal Slow Loris | VU | 1.21 | LORISIDAE | Asia |
| Allocebus trichotis | Hairy-eared Dwarf Lemur | VU | 0.09 | CHEIROGALEIDAE | Madagascar |
| Microcebus tavaratra | Tavaratra Mouse Lemur | VU | 0.06 | CHEIROGALEIDAE | Madagascar |
Now we can easily perform grouping operations and summarize the data (e.g. calculating average body mass by Family).
unbreak_vals()
This is a niche function for very specific uses. It uses regex to fix values that are broken across two rows. This usually happens when we are formatting a table and we need to fit it on a page.
dogsDesc <- tibble::tibble(
dogs =
c("Retriever", "(Golden)", "Retriever", "(Labrador)", "Bulldog", "(French)"),
coat = c("long", NA, "short", NA, "short", NA)
)
head(dogsDesc)We can match the opening brackets with regex.
unbreak_vals(dogsDesc,"^\\(",dogs, dogs_unbroken)unwrap_cols()
Use this function to unwrap and glue values that have been wrapped
across multiple rows for presentation purposes, with an inconsistent
number of empty or NA values padding out the columns.
# set up the data
nyk <- tibble::tribble(
~player, ~listed_height_m., ~teams_chronological, ~position,
"Marcus Camby", 2.11, "Raptors", "Power forward",
NA, NA, "Knicks", "Center",
NA, NA, "Nuggets", NA,
NA, NA, "Clippers", NA,
NA, NA, "Trail Blazers", NA,
NA, NA, "Rockets", NA,
NA, NA, "Knicks", NA,
"Allan Houston", 1.98, "Pistons", "Shooting guard",
NA, NA, "Knicks", NA,
"Latrell Sprewell", 1.96, "Warriors", "Small forward",
NA, NA, "Knicks", NA,
NA, NA, "Timberwolves", NA
)Paste the wrapped elements, separating with commas.
unwrap_cols(nyk, groupingVar = player, separator = ", ")unbreak_rows()
This function merges sets of two contiguous rows upwards by pasting the values of the lagging row to the values of the leading row (identified using regular expressions).
The following table of basketball records has two sets of header rows with values broken across two contiguous rows.
| v1 | v2 | v3 |
|---|---|---|
| Player | Most points | Season |
| NA | in a game | (year ending) |
| Sleve McDichael | 55 | 2001 |
| Dean Wesrey | 43 | 2000 |
| Karl Dandleton | 41 | 2010 |
| Player | Most varsity | Season |
| NA | games played | (year ending) |
| Mike Sernandez | 111 | 2005 |
| Glenallen Mixon | 109 | 2004 |
| Rey McSriff | 104 | 2002 |
unbreak_rows() merges these rows if we can match them with a common
pattern.
bball <- data.frame(
stringsAsFactors = FALSE,
v1 = c(
"Player", NA, "Sleve McDichael", "Dean Wesrey",
"Karl Dandleton", "Player",
NA,
"Mike Sernandez",
"Glenallen Mixon",
"Rey McSriff"
),
v2 = c(
"Most points", "in a game", "55", "43", "41", "Most varsity",
"games played", "111", "109",
"104"
),
v3 = c(
"Season", "(year ending)", "2001", "2000", "2010", "Season",
"(year ending)", "2005",
"2004", "2002"
)
)
# Match with regex on variable v2
unbreak_rows(bb3,"^Most",v2)The resulting table:
| v1 | v2 | v3 |
|---|---|---|
| Player | Most points in a game | Season (year ending) |
| Sleve McDichael | 55 | 2001 |
| Dean Wesrey | 43 | 2000 |
| Karl Dandleton | 41 | 2010 |
| Player | Most varsity games played | Season (year ending) |
| Mike Sernandez | 111 | 2005 |
| Glenallen Mixon | 109 | 2004 |
| Rey McSriff | 104 | 2002 |
annotate_mf() and annotate_mf_all()
Sometimes embedded subheaders can’t be matched by content or context, but they share the same formatting in a spreadsheet file.
annotate_mf() flattens four common approaches to confer meaningful
formatting to cells and adds this as a character string to the target
variable.
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet,orig = Task, new=Task_annotated)annotate_mf_all() applies the same approach to all values in the
dataset.
example_spreadsheet_all <- system.file("extdata/boutiques.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet_all)Lastly, regex_valign() can adjust the whitespace (padding) within a
character vector with one element per line, for easier parsing with
readr.
guests <-
unlist(strsplit(c("6 COAHUILA 20/03/2020
712 COAHUILA 20/03/2020"),"\n"))
guests
regex_valign(guests, "\\b(?=[A-Z])")The inconsistent whitespace between the elements in each line can be adjusted after matching a position of interest through regular expressions.