unheadr

The goal of unheadr is to help wrangle data when it has embedded subheaders, or when values are wrapped across several rows.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("luisDVA/unheadr")

Some of the possible uses of unheadr are now described in this publication:

Verde Arregoitia, L. D., Cooper, N., D’Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. Hystrix, the Italian Journal of Mammalogy, 29(2), 155-161. Open Access, DOI 10.4404/hystrix-00133-2018

Usage

untangle2() function

The star of the package. Puts embedded subheaders into their own variable, using regular expressions to identify them.

data(primates2017)
# head(primates2017,n=20)

The first half of the dataset looks like the table below. Note that there are rows that correspond to values in grouping variables, which should be in their own column. Instead, they are embedded within the data rectangle. This is a pretty common practice. This looks OK and is easy to read, but hard to work with (for example: calculating group-wise summaries).

In this example, values for an implicit ‘geographic region’ variable and an implicit ‘taxonomic family’ variable are embedded in the column that contains the observational units (the scientific names of various primates).

scientific_name	common_name	red_list_status	mass_kg
Asia	NA	NA	NA
CERCOPITHECIDAE	NA	NA	NA
Trachypithecus obscurus	Dusky Langur	NT	7.13
Presbytis sumatra	Black Sumatran Langur	EN	6.00
Rhinopithecus roxellana	Golden Snub-nosed Monkey	EN	NA
Trachypithecus auratus	East Javan Langur	VU	6.25
Semnopithecus johnii	Nilgiri Langur	VU	11.45
Trachypithecus delacouri	Delacour’s Langur	CR	NA
Trachypithecus leucocephalus	White-headed Langur	CR	8.00
Presbytis comata	Javan Langur	EN	6.70
Macaca pagensis	Pagai Macaque	CR	4.50
Trachypithecus germaini	Germain’s Langur	EN	8.83
Macaca munzala	Arunachal Macaque	EN	NA
Macaca mulatta	Rhesus Macaque	LC	9.90
Semnopithecus hector	Terai Sacred Langur	NT	15.20
HYLOBATIDAE	NA	NA	NA
Hylobates funereus	East Bornean Gray Gibbon	EN	NA
Hylobates klossii	Kloss’s Gibbon	EN	5.80
Nomascus concolor	Western Black Crested Gibbon	CR	7.71
LORISIDAE	NA	NA	NA

For a tidier structure, these subheaders embedded in the scientific_name column need to be plucked out and placed in their own variable. This is the main objective of unheadr and what untangle2() was made for.

If these subheaders can be matched in bulk with a regular expression because they share a prefix, suffix, or anyting in common, we can save a lot of time. Otherwise, they can be matched by name.

Sorting out the mess in the example data:

# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(primates2017, "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region)

The function can be used with magrittr pipes as a dplyr-type verb.

primates2017 %>%
  untangle2("DAE$", scientific_name, family) %>%
  untangle2("Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region) %>%
  head(n = 20)

scientific_name	common_name	red_list_status	mass_kg	family	region
Trachypithecus obscurus	Dusky Langur	NT	7.13	CERCOPITHECIDAE	Asia
Presbytis sumatra	Black Sumatran Langur	EN	6.00	CERCOPITHECIDAE	Asia
Rhinopithecus roxellana	Golden Snub-nosed Monkey	EN	NA	CERCOPITHECIDAE	Asia
Trachypithecus auratus	East Javan Langur	VU	6.25	CERCOPITHECIDAE	Asia
Semnopithecus johnii	Nilgiri Langur	VU	11.45	CERCOPITHECIDAE	Asia
Trachypithecus delacouri	Delacour’s Langur	CR	NA	CERCOPITHECIDAE	Asia
Trachypithecus leucocephalus	White-headed Langur	CR	8.00	CERCOPITHECIDAE	Asia
Presbytis comata	Javan Langur	EN	6.70	CERCOPITHECIDAE	Asia
Macaca pagensis	Pagai Macaque	CR	4.50	CERCOPITHECIDAE	Asia
Trachypithecus germaini	Germain’s Langur	EN	8.83	CERCOPITHECIDAE	Asia
Macaca munzala	Arunachal Macaque	EN	NA	CERCOPITHECIDAE	Asia
Macaca mulatta	Rhesus Macaque	LC	9.90	CERCOPITHECIDAE	Asia
Semnopithecus hector	Terai Sacred Langur	NT	15.20	CERCOPITHECIDAE	Asia
Hylobates funereus	East Bornean Gray Gibbon	EN	NA	HYLOBATIDAE	Asia
Hylobates klossii	Kloss’s Gibbon	EN	5.80	HYLOBATIDAE	Asia
Nomascus concolor	Western Black Crested Gibbon	CR	7.71	HYLOBATIDAE	Asia
Nycticebus menagensis	Philippine Slow Loris	VU	0.28	LORISIDAE	Asia
Nycticebus bengalensis	Bengal Slow Loris	VU	1.21	LORISIDAE	Asia
Allocebus trichotis	Hairy-eared Dwarf Lemur	VU	0.09	CHEIROGALEIDAE	Madagascar
Microcebus tavaratra	Tavaratra Mouse Lemur	VU	0.06	CHEIROGALEIDAE	Madagascar

Now we can easily perform grouping operations and summarize the data (e.g. calculating average body mass by Family).

At this point, refer to the links in the vignette and the function help for more information and examples on the use of the other helper functions.

unbreak_vals() function

This is a niche function for very specific uses. It uses regex to fix values that are broken across two rows. This usually happens when we are formatting a table and we need to fit it on a page.

dogsDesc <- tibble::tibble(
  dogs =
    c("Retriever", "(Golden)", "Retriever", "(Labrador)", "Bulldog", "(French)"),
  coat = c("long", NA, "short", NA, "short", NA)
)

head(dogsDesc)

We can match the opening brackets with regex.

unbreak_vals(dogsDesc,"^\\(",dogs, dogs_unbroken)

unwrap_cols() function

Use this function to unwrap and glue values that have been wrapped across multiple rows for presentation purposes, with an inconsistent number of empty or NA values padding out the columns.

# set up the data
nyk <- tibble::tribble(
  ~player, ~listed_height_m., ~teams_chronological, ~position,
  "Marcus Camby", 2.11, "Raptors", "Power forward",
  NA, NA, "Knicks", "Center",
  NA, NA, "Nuggets", NA,
  NA, NA, "Clippers", NA,
  NA, NA, "Trail Blazers", NA,
  NA, NA, "Rockets", NA,
  NA, NA, "Knicks", NA,
  "Allan Houston", 1.98, "Pistons", "Shooting guard",
  NA, NA, "Knicks", NA,
  "Latrell Sprewell", 1.96, "Warriors", "Small forward",
  NA, NA, "Knicks", NA,
  NA, NA, "Timberwolves", NA
)

Paste the wrapped elements, separating with commas.

unwrap_cols(nyk, groupingVar = player, separator = ", ")

annotate_mf() function

Sometimes embedded subheaders can’t be matched by content or context, but they share the same formatting in a spreadsheet file.

The annotate_mf() function flattens four common approaches to confer meaningful formatting to cells and adds this as a character string to the target variable.

example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet,orig = Task, new=Task_annotated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

unheadr

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

unheadr

Installation

Usage