Permalink
Browse files

Rename and polish nesting example

  • Loading branch information...
jennybc committed Apr 11, 2018
1 parent f0ffc39 commit 4f25dea13b694ed1aac83dba1cb2a65d450360b9
@@ -4,7 +4,8 @@ Materials for [RStudio webinar](https://www.rstudio.com/resources/webinars/):

Thinking inside the box: you can do that inside a data frame?!
Jenny Bryan
Wednesday, April 11 at 1:00pm ET / 10:00am PT
Wednesday, April 11 at 1:00pm ET / 10:00am PT
[rstd.io/row-work](https://rstd.io/row-work) *shortlink to this repo*

## Abstract

@@ -22,7 +23,5 @@ Not all are used in webinar
* **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
* **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png).
* **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
* **Group and summarise.** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
* **Split-apply-combine.** Nesting vs splitting.
- Downside of `split()`: First-class grouping variable(s) --> character vector of names --> variable is a big drag. Integer-y numerics must be coerced back, factors must be recreated, with original levels. Transitting data through attributes is an anti-pattern.
- Downside of `nest()`: When you inspect the list-column, you can't see values of grouping (key) variables. Grouping variables not necessarily/easily available for simple map (coolbutuseless's posts and PR).
* **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
* **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -24,162 +24,107 @@ gap <- gapminder %>%
filter(continent == "Asia") %>%
mutate(yr1952 = year - 1952)

#+ alpha-order
ggplot(gap, aes(x = lifeExp, y = country)) +
geom_point()
#' Random arrangement of countries

#' Countries are in alphabetical order.
#'
#' Set factor levels with intent. Imagine you want this to persist across an
#' entire analysis.
#' Set factor levels with intent. Example: order based on life expectancy in
#' 2007, the last year in this dataset. Imagine you want this to persist across
#' an entire analysis.
gap <- gap %>%
mutate(country = fct_reorder2(country, x = -1 * year, y = lifeExp))
mutate(country = fct_reorder2(country, x = year, y = lifeExp))

#+ principled-order
ggplot(gap, aes(x = lifeExp, y = country)) +
geom_point()


#' Much better!
#'
#' Now imagine we want to fit a model to each country and lot at dot plots of
#' Now imagine we want to fit a model to each country and look at dot plots of
#' slope and intercept.
#'
#' Nested approach ... leaves `country` as factor.
#' `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and is an
#' alternative to splitting into country-specific data frames. Those data frames
#' end up, instead, in a list-column. The `country` variable remains as a normal
#' factor.
gap_nested <- gap %>%
group_by(country) %>%
nest()

gap_nested
gap_nested$data[[1]]

gap_fitted <- gap_nested %>%
mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
gap_fitted
gap_fitted$fit[[1]]

gap_fitted <- gap_fitted %>%
mutate(
intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
)
gap_fitted

#+ principled-order-coef-ests
ggplot(gap_fitted, aes(x = intercept, y = country)) +
geom_point()

ggplot(gap_fitted, aes(x = slope, y = country)) +
geom_point()

#' The `split()` + `lapply()` + `do.call(rbind, ...)` approach
#' Much fussing
#' The `split()` + `lapply()` + `do.call(rbind, ...)` approach.
#'
#' Split gap into many data frames, one per country.
gap_split <- split(gap, gap$country)

#' Fit a model to each country.
gap_split_fits <- lapply(
gap_split,
function(df) {
lm(lifeExp ~ yr1952, data = df)
}
)
## oops ... the unused levels of country are a problem

#' Oops ... the unused levels of country are a problem (empty data frames in our
#' list).
#'
#' Drop unused levels in country and split.
gap_split <- split(droplevels(gap), droplevels(gap)$country)
head(gap_split, 2)

#' Fit model to each country and get `coefs()`.
gap_split_coefs <- lapply(
gap_split,
function(df) {
coef(lm(lifeExp ~ yr1952, data = df))
}
)
head(gap_split_coefs, 2)

#' Now we need to put everything back togethers. Row bind the list of coefs.
#' Coerce from matrix back to data frame.
gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))

#' Restore `country` variable from row names.
gap_split_coefs$country <- rownames(gap_split_coefs)
str(gap_split_coefs)

#+ revert-to-alphabetical
ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
geom_point()

ggplot(gap_split_coefs, aes(x = yr1952, y = country)) +
geom_point()
#' We are back to the random order of countries.












gap <- gapminder %>%
filter(year %in% c(1952, 2007), continent != "Oceania") %>%
droplevels() %>%
select(continent, year, lifeExp) %>%
mutate(continent = fct_reorder2(continent, x = year, y = lifeExp)) %>%
arrange(continent, year)
View(gap)

levels(gap$continent)

ggplot(gap, aes(x = year, y = lifeExp, color = continent)) +
geom_jitter(width = 10) + geom_smooth(method = "lm", se = FALSE)

gap_nested <- gap %>%
group_by(continent) %>%
nest()
gap_nested

gap_nested$data[[1]]
t.test(lifeExp ~ year, data = gap_nested$data[[1]])

gap_tested <- gap_nested %>%
mutate(tt = map(data, ~ t.test(lifeExp ~ year, data = .x)))

gap_tested$tt[[1]]
gap_tested$tt[[1]][["statistic"]]

gap_tested <- gap_nested %>%
mutate(tt = map(data, ~ t.test(lifeExp ~ year, data = .x)),
tt = map_dbl(tt, "statistic"))
gap_tested

gap_split <- split(gap, gap$continent)
gap_split_tested <- lapply(
gap_split,
function(df) t.test(lifeExp ~ year, data = df)
)
gap_split_tested <- lapply(gap_split_tested, `[[`, "statistic")




gap <- gapminder %>%
mutate(country = fct_reorder2(country, x = year, y = lifeExp)) %>%
arrange(country, year)
View(filter(gap, year == 2007))


gap <- gapminder %>%
filter(country %in% c("Japan", "China", "Pakistan", "Afghanistan")) %>%
droplevels()

ggplot(gap, aes(x = year, y = lifeExp, color = country)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)


gap <- gapminder %>%
filter(country %in% c("Japan", "China", "Pakistan", "Afghanistan")) %>%
droplevels() %>%
mutate(
country = fct_reorder2(country, x = year, y = lifeExp),
yr1952 = year - 1952
)

ggplot(gap, aes(x = year, y = lifeExp, color = country)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

levels(gap$country)

#' Much better! Now we do more analyses, that require split-apply-combine.

gap_nested <- gap %>%
group_by(country) %>%
nest()

gap_fitted <- gap_nested %>%
mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))

gap_fitted$fit[[1]]

#' Uh-oh, we lost the order of the `country` factor, due to coercion from factor
#' to character (list and then row names).
#'
#' The `nest()` approach allows you to keep data as data vs. in attributes, such
#' as list or row names. Preserves factors and their levels or integer
#' variables. Designs away various opportunities for different pieces of the
#' dataset to get "out of sync" with each other, by leaving them in a data frame
#' at all times.
#'
#' First in an interesting series of blog posts exploring these patterns and
#' asking whether the tidyverse still needs a way to include the nesting
#' variable in the nested data:
#' <https://coolbutuseless.bitbucket.io/2018/03/03/split-apply-combine-my-search-for-a-replacement-for-group_by---do/>
Oops, something went wrong.

0 comments on commit 4f25dea

Please sign in to comment.