Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow for multiple group by + summarise steps #55

Open
sharlagelfand opened this issue May 25, 2021 · 4 comments
Open

Flow for multiple group by + summarise steps #55

sharlagelfand opened this issue May 25, 2021 · 4 comments
Assignees
Labels

Comments

@sharlagelfand
Copy link
Collaborator

Want to test out if it's possible to do group_by -> summarise -> group_by -> summarise (or e.g. group_by -> summarise -> summarise) - @jhofman will provide an example

@jhofman
Copy link
Contributor

jhofman commented May 25, 2021

@sharlagelfand: There were too many observations in the bike data, so here's an artificial but hopefully still interesting one: take a few famous baseball players, compute their batting average for each year they played, noting the team they played for, and then look at their median batting average over the time with that team.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab) %>%
  group_by(id, team) %>%
  summarize(median_ba = median(ba)) %>%
  ggplot(aes(x = id, y = median_ba, color = team)) +
  geom_point(position = position_dodge(width = 0.25)) +
  labs(x = "Player", y = "Median batting average over time with each team")

I don't love the styling of this plot, but perhaps it's enough to get started with?

image

@sharlagelfand
Copy link
Collaborator Author

Thanks @jhofman! This actually brings up another question about how to handle summary operations that are combinations of multiple variables, e.g. ba = h / ab - right now we don't have a way to show distributions of two variables or how the relationship between them derives a new variable... I'll create an issue for that, and see if we can come up with an example that just does multiple steps without making us encounter the "derived from multiple variables" for now

@jhofman
Copy link
Contributor

jhofman commented May 27, 2021

noting two things:

  1. funny enough, batting averages are a good example of where simpson's paradox pops up because of different number of at bats in a season (see here).
  2. the first group-by + summarize in the example i created was kind of silly, it could just be a mutate.

@jhofman
Copy link
Contributor

jhofman commented Jul 15, 2021

Snoozing this until we make progress on #62 for multiple variable manipulations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants