Skip to content
Handle Data With Embedded Subheaders
R
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R better documentation for unbreak_rows ☠️ Dec 13, 2019
data Initial commit Jul 4, 2018
inst/extdata pushed spreadsheet fns to master Oct 6, 2019
man
tests add unbreak rows function Dec 12, 2019
vignettes pushed spreadsheet fns to master Oct 6, 2019
.Rbuildignore add codecov Oct 14, 2019
.gitignore Initial commit Jul 4, 2018
.travis.yml check Oct 7, 2019
DESCRIPTION better dataset documentation, fix typos Oct 15, 2019
LICENSE
LICENSE.md Initial commit Jul 4, 2018
NAMESPACE
NEWS.md major update: tests, annotating function, excel files, etc Oct 7, 2019
README.Rmd add output to readme file Dec 12, 2019
README.md add output to readme file Dec 12, 2019
codecov.yml
unheadr.Rproj Initial commit Jul 4, 2018

README.md

unheadr

Travis build status Codecov test coverage

The goal of unheadr is to help wrangle data when it has embedded subheaders, or when values are wrapped across several rows.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("luisDVA/unheadr")

Some of the possible uses of unheadr are now described in this publication:

Verde Arregoitia, L. D., Cooper, N., D’Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. Hystrix, the Italian Journal of Mammalogy, 29(2), 155-161. Open Access, DOI 10.4404/hystrix-00133-2018

Usage

untangle2() function

The star of the package. Puts embedded subheaders into their own variable, using regular expressions to identify them.

data(primates2017)
# head(primates2017,n=20)

The first half of the dataset looks like the table below. Note that there are rows that correspond to values in grouping variables, which should be in their own column. Instead, they are embedded within the data rectangle. This is a pretty common practice. This looks OK and is easy to read, but hard to work with (for example: calculating group-wise summaries).

In this example, values for an implicit ‘geographic region’ variable and an implicit ‘taxonomic family’ variable are embedded in the column that contains the observational units (the scientific names of various primates).

scientific_name common_name red_list_status mass_kg
Asia NA NA NA
CERCOPITHECIDAE NA NA NA
Trachypithecus obscurus Dusky Langur NT 7.13
Presbytis sumatra Black Sumatran Langur EN 6.00
Rhinopithecus roxellana Golden Snub-nosed Monkey EN NA
Trachypithecus auratus East Javan Langur VU 6.25
Semnopithecus johnii Nilgiri Langur VU 11.45
Trachypithecus delacouri Delacour’s Langur CR NA
Trachypithecus leucocephalus White-headed Langur CR 8.00
Presbytis comata Javan Langur EN 6.70
Macaca pagensis Pagai Macaque CR 4.50
Trachypithecus germaini Germain’s Langur EN 8.83
Macaca munzala Arunachal Macaque EN NA
Macaca mulatta Rhesus Macaque LC 9.90
Semnopithecus hector Terai Sacred Langur NT 15.20
HYLOBATIDAE NA NA NA
Hylobates funereus East Bornean Gray Gibbon EN NA
Hylobates klossii Kloss’s Gibbon EN 5.80
Nomascus concolor Western Black Crested Gibbon CR 7.71
LORISIDAE NA NA NA

For a tidier structure, these subheaders embedded in the scientific_name column need to be plucked out and placed in their own variable. This is the main objective of unheadr and what untangle2() was made for.

If these subheaders can be matched in bulk with a regular expression because they share a prefix, suffix, or anyting in common, we can save a lot of time. Otherwise, they can be matched by name.

Sorting out the mess in the example data:

# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(primates2017, "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region)

The function can be used with magrittr pipes as a dplyr-type verb.

primates2017 %>%
  untangle2("DAE$", scientific_name, family) %>%
  untangle2("Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region) %>%
  head(n = 20)
scientific_name common_name red_list_status mass_kg family region
Trachypithecus obscurus Dusky Langur NT 7.13 CERCOPITHECIDAE Asia
Presbytis sumatra Black Sumatran Langur EN 6.00 CERCOPITHECIDAE Asia
Rhinopithecus roxellana Golden Snub-nosed Monkey EN NA CERCOPITHECIDAE Asia
Trachypithecus auratus East Javan Langur VU 6.25 CERCOPITHECIDAE Asia
Semnopithecus johnii Nilgiri Langur VU 11.45 CERCOPITHECIDAE Asia
Trachypithecus delacouri Delacour’s Langur CR NA CERCOPITHECIDAE Asia
Trachypithecus leucocephalus White-headed Langur CR 8.00 CERCOPITHECIDAE Asia
Presbytis comata Javan Langur EN 6.70 CERCOPITHECIDAE Asia
Macaca pagensis Pagai Macaque CR 4.50 CERCOPITHECIDAE Asia
Trachypithecus germaini Germain’s Langur EN 8.83 CERCOPITHECIDAE Asia
Macaca munzala Arunachal Macaque EN NA CERCOPITHECIDAE Asia
Macaca mulatta Rhesus Macaque LC 9.90 CERCOPITHECIDAE Asia
Semnopithecus hector Terai Sacred Langur NT 15.20 CERCOPITHECIDAE Asia
Hylobates funereus East Bornean Gray Gibbon EN NA HYLOBATIDAE Asia
Hylobates klossii Kloss’s Gibbon EN 5.80 HYLOBATIDAE Asia
Nomascus concolor Western Black Crested Gibbon CR 7.71 HYLOBATIDAE Asia
Nycticebus menagensis Philippine Slow Loris VU 0.28 LORISIDAE Asia
Nycticebus bengalensis Bengal Slow Loris VU 1.21 LORISIDAE Asia
Allocebus trichotis Hairy-eared Dwarf Lemur VU 0.09 CHEIROGALEIDAE Madagascar
Microcebus tavaratra Tavaratra Mouse Lemur VU 0.06 CHEIROGALEIDAE Madagascar

Now we can easily perform grouping operations and summarize the data (e.g. calculating average body mass by Family).

At this point, refer to the links in the vignette and the function help for more information and examples on the use of the other helper functions.

unbreak_vals() function

This is a niche function for very specific uses. It uses regex to fix values that are broken across two rows. This usually happens when we are formatting a table and we need to fit it on a page.

dogsDesc <- tibble::tibble(
  dogs =
    c("Retriever", "(Golden)", "Retriever", "(Labrador)", "Bulldog", "(French)"),
  coat = c("long", NA, "short", NA, "short", NA)
)

head(dogsDesc)

We can match the opening brackets with regex.

unbreak_vals(dogsDesc,"^\\(",dogs, dogs_unbroken)

unwrap_cols() function

Use this function to unwrap and glue values that have been wrapped across multiple rows for presentation purposes, with an inconsistent number of empty or NA values padding out the columns.

# set up the data
nyk <- tibble::tribble(
  ~player, ~listed_height_m., ~teams_chronological, ~position,
  "Marcus Camby", 2.11, "Raptors", "Power forward",
  NA, NA, "Knicks", "Center",
  NA, NA, "Nuggets", NA,
  NA, NA, "Clippers", NA,
  NA, NA, "Trail Blazers", NA,
  NA, NA, "Rockets", NA,
  NA, NA, "Knicks", NA,
  "Allan Houston", 1.98, "Pistons", "Shooting guard",
  NA, NA, "Knicks", NA,
  "Latrell Sprewell", 1.96, "Warriors", "Small forward",
  NA, NA, "Knicks", NA,
  NA, NA, "Timberwolves", NA
)

Paste the wrapped elements, separating with commas.

unwrap_cols(nyk, groupingVar = player, separator = ", ")

unbreak_rows() function

This function merges sets of two contiguous rows ‘upwards’ by pasting the values of the lagging row to the values of the leading row (identified using regular expressions).

The following table of basketball records has two sets of header rows with values broken across two contiguous rows.

v1 v2 v3
Player Most points Season
NA in a game (year ending)
Sleve McDichael 55 2001
Dean Wesrey 43 2000
Karl Dandleton 41 2010
Player Most varsity Season
NA games played (year ending)
Mike Sernandez 111 2005
Glenallen Mixon 109 2004
Rey McSriff 104 2002

unbreak_rows() merges these rows if we can match them with a common pattern.

bball <- data.frame(
  stringsAsFactors = FALSE,
  v1 = c(
    "Player", NA, "Sleve McDichael", "Dean Wesrey",
    "Karl Dandleton", "Player",
    NA,
    "Mike Sernandez",
    "Glenallen Mixon",
    "Rey McSriff"
  ),
  v2 = c(
    "Most points", "in a game", "55", "43", "41", "Most varsity",
    "games played", "111", "109",
    "104"
  ),
  v3 = c(
    "Season", "(year ending)", "2001", "2000", "2010", "Season",
    "(year ending)", "2005",
    "2004", "2002"
  )
)

# Match with regex on variable v2
unbreak_rows(bb3,"^Most",v2)

The resulting table:

v1 v2 v3
Player Most points in a game Season (year ending)
Sleve McDichael 55 2001
Dean Wesrey 43 2000
Karl Dandleton 41 2010
Player Most varsity games played Season (year ending)
Mike Sernandez 111 2005
Glenallen Mixon 109 2004
Rey McSriff 104 2002

annotate_mf() function

Sometimes embedded subheaders can’t be matched by content or context, but they share the same formatting in a spreadsheet file.

The annotate_mf() function flattens four common approaches to confer meaningful formatting to cells and adds this as a character string to the target variable.

example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet,orig = Task, new=Task_annotated)
You can’t perform that action at this time.