Group by adjacent values (i.e. no sort) #237

hadley · 2014-02-03T15:43:48Z

a la http://stackoverflow.com/questions/21511257. Used to work with only adjacent groups. Easy enough to do with cumsum() and diff(), but probably not very efficient, and definitively not expressive.

The text was updated successfully, but these errors were encountered:

romainfrancois · 2014-02-04T22:10:23Z

Not sure I understand what is needed here. I can't really make too much sense of the SO thread.

hadley · 2014-02-04T22:50:55Z

Basically need a function with good name that does this:

group <- function(x) cumsum(c(1, diff(x) != 0))
group(c(2,2,3,4,2))

i.e. it increments the group number every time the value changes.

or

Rcpp::cppFunction("IntegerVector group(NumericVector x) {
  int n = x.size();
  IntegerVector y(n);

  int grp = 1;
  y[0] = 1;
  for (int i = 1; i < n; ++i) {
    if (x[i] != x[i - 1]) grp++;
    y[i] = grp;
  }

  return y;
}")

but it needs to work for all basic vector types.

romainfrancois · 2014-02-04T22:57:46Z

Thanks. I understand better now. Could this be one of these new ways to group. Something like:

data %.% adjacent_group_by(x, y, z )

This way we could then summarise, etc ...

hadley · 2014-02-04T23:10:41Z

Ooh, yes, and it would be a really nice performance optimisation if the data was already sorted into groups.

hadley · 2014-02-05T03:40:37Z

Maybe group_by_adj()?

romainfrancois · 2014-02-14T13:58:20Z

Some initial code:

> df <- data.frame( x = 1, a = rep(c(1,2), each = 4), b = rep( letters[1:2], each = 2), stringsAsFactors = TRUE )
> gdf <- grouped_df_adj_impl( df, list( quote(x), quote(a), quote(b) ), FALSE )
> summarise(gdf, n = n() )
Source: local data frame [4 x 4]
Groups: x, a

  x a b n
1 1 1 a 2
2 1 1 b 2
3 1 2 a 2
4 1 2 b 2

As for group_by_adj I guess we can have it on the R side as a sibling to group_by. Maybe I'll let you deal with this @hadley.

Indeed one interesting thing with these is that for a given group, all data is adjacent. This could lead to interesting optimizations, essentially bring back what we used to have when group_by used to arrange the data. I would not make these optimisations a priority though as they are likely to need quite some code and care. Perhaps it is best to first experiment with various grouping strategies.

hadley · 2016-03-01T19:35:47Z

This doesn't seem so important to me now. We might come back to again if we invest more in performance.

cj-wilson · 2016-11-15T16:44:44Z

Is there are status or up-vote on this feature? I would like the ability to capture the next group in a group_by sequence and use that in a summarise or mutate. Syntax aside, it would be nice to get lead_by(.x, type = group) or lead(group_id + 1) to capture the next group.

I have solved this in data.table with .I before, and similar shaped data frames with rle() and custom functions, but both get ugly switching back and forth.

What I have are long running tasks where I'm trying to create a rollup date.

df <- data.frame(polling_date = c(rep(as.Date("2016-10-16"), 3),
rep(as.Date("2016-10-17"), 3),
rep(as.Date("2016-11-18"), 3)),
task_id = c(rep(1, 6), rep(2, 3)))
polling_date task_id
1 2016-10-16 1
2 2016-10-16 1
3 2016-10-16 1
4 2016-10-17 1
5 2016-10-17 1
6 2016-10-17 1
7 2016-11-18 2
8 2016-11-18 2
9 2016-11-18 2

Ideally I'd like to be able to lead to the next group and mutate a complete_date to look like below.

polling_date task_id completed_date
1 2016-10-16 1 2016-10-17
2 2016-10-16 1 2016-10-17
3 2016-10-16 1 2016-10-17
4 2016-10-17 1 2016-10-17
5 2016-10-17 1 2016-10-17
6 2016-10-17 1 2016-10-17
7 2016-11-18 2 2016-11-18
8 2016-11-18 2 2016-11-18
9 2016-11-18 2 2016-11-18

But a naive group_by() and lead() does not work.

df %>% group_by(polling_date) %>%
mutate(completed_date = as.Date(ifelse(lead(polling_date) <= polling_date + lubridate::days(1),
lead(polling_date),
polling_date)))

polling_date task_id completed_date

1 2016-10-16 1 2016-10-16
2 2016-10-16 1 2016-10-16
3 2016-10-16 1
4 2016-10-17 1 2016-10-17
5 2016-10-17 1 2016-10-17
6 2016-10-17 1
7 2016-11-18 2 2016-11-18
8 2016-11-18 2 2016-11-18
9 2016-11-18 2

romainfrancois added a commit that referenced this issue Feb 14, 2014

some back end code for #237

b26c17c

romainfrancois added a commit that referenced this issue Feb 14, 2014

refine adj grouping with the grouped_df_adj_impl function. #237

4468fee

hadley added this to the v0.2 milestone Mar 17, 2014

hadley added the enhancement label Mar 17, 2014

hadley self-assigned this Mar 17, 2014

hadley modified the milestones: 0.3, v0.2 Apr 7, 2014

hadley modified the milestones: 0.3, 0.3.1 Aug 1, 2014

hadley modified the milestones: 0.3.1, 0.4 Nov 18, 2014

hadley changed the title ~~Function to generate "group" number~~ Group by adjacent values (i.e. no sort) Oct 22, 2015

hadley closed this as completed Mar 1, 2016

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group by adjacent values (i.e. no sort) #237

Group by adjacent values (i.e. no sort) #237

hadley commented Feb 3, 2014

romainfrancois commented Feb 4, 2014

hadley commented Feb 4, 2014

romainfrancois commented Feb 4, 2014

hadley commented Feb 4, 2014

hadley commented Feb 5, 2014

romainfrancois commented Feb 14, 2014

hadley commented Mar 1, 2016

cj-wilson commented Nov 15, 2016

Group by adjacent values (i.e. no sort) #237

Group by adjacent values (i.e. no sort) #237

Comments

hadley commented Feb 3, 2014

romainfrancois commented Feb 4, 2014

hadley commented Feb 4, 2014

romainfrancois commented Feb 4, 2014

hadley commented Feb 4, 2014

hadley commented Feb 5, 2014

romainfrancois commented Feb 14, 2014

hadley commented Mar 1, 2016

cj-wilson commented Nov 15, 2016