Flow for multiple group by + summarise steps #55

sharlagelfand · 2021-05-25T14:35:53Z

Want to test out if it's possible to do group_by -> summarise -> group_by -> summarise (or e.g. group_by -> summarise -> summarise) - @jhofman will provide an example

jhofman · 2021-05-25T18:00:29Z

@sharlagelfand: There were too many observations in the bike data, so here's an artificial but hopefully still interesting one: take a few famous baseball players, compute their batting average for each year they played, noting the team they played for, and then look at their median batting average over the time with that team.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab) %>%
  group_by(id, team) %>%
  summarize(median_ba = median(ba)) %>%
  ggplot(aes(x = id, y = median_ba, color = team)) +
  geom_point(position = position_dodge(width = 0.25)) +
  labs(x = "Player", y = "Median batting average over time with each team")

I don't love the styling of this plot, but perhaps it's enough to get started with?

sharlagelfand · 2021-05-25T18:39:04Z

Thanks @jhofman! This actually brings up another question about how to handle summary operations that are combinations of multiple variables, e.g. ba = h / ab - right now we don't have a way to show distributions of two variables or how the relationship between them derives a new variable... I'll create an issue for that, and see if we can come up with an example that just does multiple steps without making us encounter the "derived from multiple variables" for now

jhofman · 2021-05-27T15:07:16Z

noting two things:

funny enough, batting averages are a good example of where simpson's paradox pops up because of different number of at bats in a season (see here).
the first group-by + summarize in the example i created was kind of silly, it could just be a mutate.

jhofman · 2021-07-15T13:40:53Z

Snoozing this until we make progress on #62 for multiple variable manipulations.

sharlagelfand mentioned this issue May 25, 2021

Handle derivations from multiple variables #62

Closed

jhofman assigned jhofman, dggoldst, giorgi-ghviniashvili and sharlagelfand Jul 15, 2021

jhofman added the snoozed label Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow for multiple group by + summarise steps #55

Flow for multiple group by + summarise steps #55

sharlagelfand commented May 25, 2021

jhofman commented May 25, 2021

sharlagelfand commented May 25, 2021

jhofman commented May 27, 2021

jhofman commented Jul 15, 2021

Flow for multiple group by + summarise steps #55

Flow for multiple group by + summarise steps #55

Comments

sharlagelfand commented May 25, 2021

jhofman commented May 25, 2021

sharlagelfand commented May 25, 2021

jhofman commented May 27, 2021

jhofman commented Jul 15, 2021