Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
274 lines (232 sloc) 9.09 KB

Why nesting is worth the awkwardness

Jenny Bryan 2018-04-12

library(gapminder)
library(tidyverse)

gapminder data for Asia only

gap <- gapminder %>%
  filter(continent == "Asia") %>%
  mutate(yr1952 = year - 1952)
ggplot(gap, aes(x = lifeExp, y = country)) +
  geom_point()

Countries are in alphabetical order.

Set factor levels with intent. Example: order based on life expectancy in 2007, the last year in this dataset. Imagine you want this to persist across an entire analysis.

gap <- gap %>%
  mutate(country = fct_reorder2(country, .x = year, .y = lifeExp))
ggplot(gap, aes(x = lifeExp, y = country)) +
  geom_point()

Much better!

Now imagine we want to fit a model to each country and look at dot plots of slope and intercept.

dplyr::group_by() + tidyr::nest() created a nested data frame and is an alternative to splitting into country-specific data frames. Those data frames end up, instead, in a list-column. The country variable remains as a normal factor.

gap_nested <- gap %>%
  group_by(country) %>%
  nest()

gap_nested
#> # A tibble: 33 x 2
#>    country          data             
#>    <fct>            <list>           
#>  1 Afghanistan      <tibble [12 × 6]>
#>  2 Bahrain          <tibble [12 × 6]>
#>  3 Bangladesh       <tibble [12 × 6]>
#>  4 Cambodia         <tibble [12 × 6]>
#>  5 China            <tibble [12 × 6]>
#>  6 Hong Kong, China <tibble [12 × 6]>
#>  7 India            <tibble [12 × 6]>
#>  8 Indonesia        <tibble [12 × 6]>
#>  9 Iran             <tibble [12 × 6]>
#> 10 Iraq             <tibble [12 × 6]>
#> # ... with 23 more rows
gap_nested$data[[1]]
#> # A tibble: 12 x 6
#>    continent  year lifeExp      pop gdpPercap yr1952
#>    <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
#>  1 Asia       1952    28.8  8425333      779.     0.
#>  2 Asia       1957    30.3  9240934      821.     5.
#>  3 Asia       1962    32.0 10267083      853.    10.
#>  4 Asia       1967    34.0 11537966      836.    15.
#>  5 Asia       1972    36.1 13079460      740.    20.
#>  6 Asia       1977    38.4 14880372      786.    25.
#>  7 Asia       1982    39.9 12881816      978.    30.
#>  8 Asia       1987    40.8 13867957      852.    35.
#>  9 Asia       1992    41.7 16317921      649.    40.
#> 10 Asia       1997    41.8 22227415      635.    45.
#> 11 Asia       2002    42.1 25268405      727.    50.
#> 12 Asia       2007    43.8 31889923      975.    55.

gap_fitted <- gap_nested %>%
  mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
gap_fitted
#> # A tibble: 33 x 3
#>    country          data              fit     
#>    <fct>            <list>            <list>  
#>  1 Afghanistan      <tibble [12 × 6]> <S3: lm>
#>  2 Bahrain          <tibble [12 × 6]> <S3: lm>
#>  3 Bangladesh       <tibble [12 × 6]> <S3: lm>
#>  4 Cambodia         <tibble [12 × 6]> <S3: lm>
#>  5 China            <tibble [12 × 6]> <S3: lm>
#>  6 Hong Kong, China <tibble [12 × 6]> <S3: lm>
#>  7 India            <tibble [12 × 6]> <S3: lm>
#>  8 Indonesia        <tibble [12 × 6]> <S3: lm>
#>  9 Iran             <tibble [12 × 6]> <S3: lm>
#> 10 Iraq             <tibble [12 × 6]> <S3: lm>
#> # ... with 23 more rows
gap_fitted$fit[[1]]
#> 
#> Call:
#> lm(formula = lifeExp ~ yr1952, data = .x)
#> 
#> Coefficients:
#> (Intercept)       yr1952  
#>     29.9073       0.2753

gap_fitted <- gap_fitted %>%
  mutate(
    intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
    slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
  )
gap_fitted
#> # A tibble: 33 x 5
#>    country          data              fit      intercept slope
#>    <fct>            <list>            <list>       <dbl> <dbl>
#>  1 Afghanistan      <tibble [12 × 6]> <S3: lm>      29.9 0.275
#>  2 Bahrain          <tibble [12 × 6]> <S3: lm>      52.7 0.468
#>  3 Bangladesh       <tibble [12 × 6]> <S3: lm>      36.1 0.498
#>  4 Cambodia         <tibble [12 × 6]> <S3: lm>      37.0 0.396
#>  5 China            <tibble [12 × 6]> <S3: lm>      47.2 0.531
#>  6 Hong Kong, China <tibble [12 × 6]> <S3: lm>      63.4 0.366
#>  7 India            <tibble [12 × 6]> <S3: lm>      39.3 0.505
#>  8 Indonesia        <tibble [12 × 6]> <S3: lm>      36.9 0.635
#>  9 Iran             <tibble [12 × 6]> <S3: lm>      45.0 0.497
#> 10 Iraq             <tibble [12 × 6]> <S3: lm>      50.1 0.235
#> # ... with 23 more rows
ggplot(gap_fitted, aes(x = intercept, y = country)) +
  geom_point()

ggplot(gap_fitted, aes(x = slope, y = country)) +
  geom_point()

The split() + lapply() + do.call(rbind, ...) approach.

Split gap into many data frames, one per country.

gap_split <- split(gap, gap$country)

Fit a model to each country.

gap_split_fits <- lapply(
  gap_split,
  function(df) {
    lm(lifeExp ~ yr1952, data = df)
  }
)
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases

Oops … the unused levels of country are a problem (empty data frames in our list).

Drop unused levels in country and split.

gap_split <- split(droplevels(gap), droplevels(gap)$country)
head(gap_split, 2)
#> $Japan
#> # A tibble: 12 x 7
#>    country continent  year lifeExp       pop gdpPercap yr1952
#>    <fct>   <fct>     <int>   <dbl>     <int>     <dbl>  <dbl>
#>  1 Japan   Asia       1952    63.0  86459025     3217.     0.
#>  2 Japan   Asia       1957    65.5  91563009     4318.     5.
#>  3 Japan   Asia       1962    68.7  95831757     6577.    10.
#>  4 Japan   Asia       1967    71.4 100825279     9848.    15.
#>  5 Japan   Asia       1972    73.4 107188273    14779.    20.
#>  6 Japan   Asia       1977    75.4 113872473    16610.    25.
#>  7 Japan   Asia       1982    77.1 118454974    19384.    30.
#>  8 Japan   Asia       1987    78.7 122091325    22376.    35.
#>  9 Japan   Asia       1992    79.4 124329269    26825.    40.
#> 10 Japan   Asia       1997    80.7 125956499    28817.    45.
#> 11 Japan   Asia       2002    82.0 127065841    28605.    50.
#> 12 Japan   Asia       2007    82.6 127467972    31656.    55.
#> 
#> $`Hong Kong, China`
#> # A tibble: 12 x 7
#>    country          continent  year lifeExp     pop gdpPercap yr1952
#>    <fct>            <fct>     <int>   <dbl>   <int>     <dbl>  <dbl>
#>  1 Hong Kong, China Asia       1952    61.0 2125900     3054.     0.
#>  2 Hong Kong, China Asia       1957    64.8 2736300     3629.     5.
#>  3 Hong Kong, China Asia       1962    67.6 3305200     4693.    10.
#>  4 Hong Kong, China Asia       1967    70.0 3722800     6198.    15.
#>  5 Hong Kong, China Asia       1972    72.0 4115700     8316.    20.
#>  6 Hong Kong, China Asia       1977    73.6 4583700    11186.    25.
#>  7 Hong Kong, China Asia       1982    75.4 5264500    14561.    30.
#>  8 Hong Kong, China Asia       1987    76.2 5584510    20038.    35.
#>  9 Hong Kong, China Asia       1992    77.6 5829696    24758.    40.
#> 10 Hong Kong, China Asia       1997    80.0 6495918    28378.    45.
#> 11 Hong Kong, China Asia       2002    81.5 6762476    30209.    50.
#> 12 Hong Kong, China Asia       2007    82.2 6980412    39725.    55.

Fit model to each country and get coefs().

gap_split_coefs <- lapply(
  gap_split,
  function(df) {
    coef(lm(lifeExp ~ yr1952, data = df))
  }
)
head(gap_split_coefs, 2)
#> $Japan
#> (Intercept)      yr1952 
#>  65.1220513   0.3529042 
#> 
#> $`Hong Kong, China`
#> (Intercept)      yr1952 
#>  63.4286410   0.3659706

Now we need to put everything back togethers. Row bind the list of coefs. Coerce from matrix back to data frame.

gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))

Restore country variable from row names.

gap_split_coefs$country <- rownames(gap_split_coefs)
str(gap_split_coefs)
#> 'data.frame':    33 obs. of  3 variables:
#>  $ (Intercept): num  65.1 63.4 66.3 61.8 49.7 ...
#>  $ yr1952     : num  0.353 0.366 0.267 0.341 0.555 ...
#>  $ country    : chr  "Japan" "Hong Kong, China" "Israel" "Singapore" ...
ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
  geom_point()

Uh-oh, we lost the order of the country factor, due to coercion from factor to character (list and then row names).

The nest() approach allows you to keep data as data vs. in attributes, such as list or row names. Preserves factors and their levels or integer variables. Designs away various opportunities for different pieces of the dataset to get “out of sync” with each other, by leaving them in a data frame at all times.

First in an interesting series of blog posts exploring these patterns and asking whether the tidyverse still needs a way to include the nesting variable in the nested data: https://coolbutuseless.bitbucket.io/2018/03/03/split-apply-combine-my-search-for-a-replacement-for-group_by---do/