Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by adjacent values (i.e. no sort) #237

Closed
hadley opened this issue Feb 3, 2014 · 8 comments
Closed

Group by adjacent values (i.e. no sort) #237

hadley opened this issue Feb 3, 2014 · 8 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Feb 3, 2014

a la http://stackoverflow.com/questions/21511257. Used to work with only adjacent groups. Easy enough to do with cumsum() and diff(), but probably not very efficient, and definitively not expressive.

@romainfrancois
Copy link
Member

Not sure I understand what is needed here. I can't really make too much sense of the SO thread.

@hadley
Copy link
Member Author

hadley commented Feb 4, 2014

Basically need a function with good name that does this:

group <- function(x) cumsum(c(1, diff(x) != 0))
group(c(2,2,3,4,2))

i.e. it increments the group number every time the value changes.

or

Rcpp::cppFunction("IntegerVector group(NumericVector x) {
  int n = x.size();
  IntegerVector y(n);

  int grp = 1;
  y[0] = 1;
  for (int i = 1; i < n; ++i) {
    if (x[i] != x[i - 1]) grp++;
    y[i] = grp;
  }

  return y;
}")

but it needs to work for all basic vector types.

@romainfrancois
Copy link
Member

Thanks. I understand better now. Could this be one of these new ways to group. Something like:

data %.% adjacent_group_by(x, y, z )

This way we could then summarise, etc ...

@hadley
Copy link
Member Author

hadley commented Feb 4, 2014

Ooh, yes, and it would be a really nice performance optimisation if the data was already sorted into groups.

@hadley
Copy link
Member Author

hadley commented Feb 5, 2014

Maybe group_by_adj()?

@romainfrancois
Copy link
Member

Some initial code:

> df <- data.frame( x = 1, a = rep(c(1,2), each = 4), b = rep( letters[1:2], each = 2), stringsAsFactors = TRUE )
> gdf <- grouped_df_adj_impl( df, list( quote(x), quote(a), quote(b) ), FALSE )
> summarise(gdf, n = n() )
Source: local data frame [4 x 4]
Groups: x, a

  x a b n
1 1 1 a 2
2 1 1 b 2
3 1 2 a 2
4 1 2 b 2

As for group_by_adj I guess we can have it on the R side as a sibling to group_by. Maybe I'll let you deal with this @hadley.

Indeed one interesting thing with these is that for a given group, all data is adjacent. This could lead to interesting optimizations, essentially bring back what we used to have when group_by used to arrange the data. I would not make these optimisations a priority though as they are likely to need quite some code and care. Perhaps it is best to first experiment with various grouping strategies.

@hadley hadley added this to the v0.2 milestone Mar 17, 2014
@hadley hadley self-assigned this Mar 17, 2014
@hadley hadley modified the milestones: 0.3, v0.2 Apr 7, 2014
@hadley hadley modified the milestones: 0.3, 0.3.1 Aug 1, 2014
@hadley hadley modified the milestones: 0.3.1, 0.4 Nov 18, 2014
@hadley hadley changed the title Function to generate "group" number Group by adjacent values (i.e. no sort) Oct 22, 2015
@hadley
Copy link
Member Author

hadley commented Mar 1, 2016

This doesn't seem so important to me now. We might come back to again if we invest more in performance.

@hadley hadley closed this as completed Mar 1, 2016
@cj-wilson
Copy link

Is there are status or up-vote on this feature? I would like the ability to capture the next group in a group_by sequence and use that in a summarise or mutate. Syntax aside, it would be nice to get lead_by(.x, type = group) or lead(group_id + 1) to capture the next group.

I have solved this in data.table with .I before, and similar shaped data frames with rle() and custom functions, but both get ugly switching back and forth.

What I have are long running tasks where I'm trying to create a rollup date.

df <- data.frame(polling_date = c(rep(as.Date("2016-10-16"), 3),
rep(as.Date("2016-10-17"), 3),
rep(as.Date("2016-11-18"), 3)),
task_id = c(rep(1, 6), rep(2, 3)))
polling_date task_id
1 2016-10-16 1
2 2016-10-16 1
3 2016-10-16 1
4 2016-10-17 1
5 2016-10-17 1
6 2016-10-17 1
7 2016-11-18 2
8 2016-11-18 2
9 2016-11-18 2

Ideally I'd like to be able to lead to the next group and mutate a complete_date to look like below.

polling_date task_id completed_date
1 2016-10-16 1 2016-10-17
2 2016-10-16 1 2016-10-17
3 2016-10-16 1 2016-10-17
4 2016-10-17 1 2016-10-17
5 2016-10-17 1 2016-10-17
6 2016-10-17 1 2016-10-17
7 2016-11-18 2 2016-11-18
8 2016-11-18 2 2016-11-18
9 2016-11-18 2 2016-11-18

But a naive group_by() and lead() does not work.

df %>% group_by(polling_date) %>%
mutate(completed_date = as.Date(ifelse(lead(polling_date) <= polling_date + lubridate::days(1),
lead(polling_date),
polling_date)))

polling_date task_id completed_date

1 2016-10-16 1 2016-10-16
2 2016-10-16 1 2016-10-16
3 2016-10-16 1
4 2016-10-17 1 2016-10-17
5 2016-10-17 1 2016-10-17
6 2016-10-17 1
7 2016-11-18 2 2016-11-18
8 2016-11-18 2 2016-11-18
9 2016-11-18 2

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants