Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Sequential Summarise fixes issue #44 #89

Merged
merged 2 commits into from Oct 5, 2012

Conversation

Projects
None yet
2 participants

Summarise will now work sequentially, so you can use earlier summaries in
later summaries, fixes #44

@jimhester jimhester Summarise sequentially
Summarise will now work sequentially, so you can use earlier summaries in
later summaries
06d65d5

@hadley hadley and 1 other commented on an outdated diff Aug 8, 2012

R/helper-summarise.r
@@ -30,7 +29,15 @@ summarise <- function(.data, ...) {
names <- unname(unlist(lapply(match.call(expand = FALSE)$`...`, deparse)))
names(cols)[missing_names] <- names[missing_names]
}
-
- quickdf(cols)
+ cols <- cols[names(cols) != ""]
+ env <- new.env(parent=parent.frame())
+ for(name in names(.data)){
+ assign(name,.data[[name]],envir=env)
+ }
+ ret <- list()
+ for (col in names(cols)) {
+ ret[[col]] <- eval(cols[[col]], ret, env)
+ }
@hadley

hadley Aug 8, 2012

Owner

Hmmm, could you take an approach more similar to mutate?

@jimhester

jimhester Aug 8, 2012

I revised the patch to just convert the input data frame to a list to remove the ugly environment copying I was doing. This also improved performance, however the conversion still makes this version slower than the non iterative version. This version is much closer to the mutate implementation.

@jimhester jimhester Convert data frame to list
Rather than copying the data frame to the environment, convert the data frame
to a list, so that the new values can be added to it.
e51fc37
Owner

hadley commented Aug 8, 2012

Do you have any benchmarks before and after that you can share?

Here are some benchmarks, the performance in the first benchmark actually better than the original, the second only slightly worse, plus the second is somewhat of an edge case.

source('plyr/benchmark/data.r')
Summarize
ldply(list(bench4,bench5,bench6),function(x){ system.time(ddply(x,.(group_a,group_b),summarise,avu=mean(unif),avn=mean(norm)))[3]})
1   0.686
2   0.969
3   4.506
Summarize Iterative
ldply(list(bench4,bench5,bench6),function(x){ system.time(ddply(x,.(group_a,group_b),summarise_itr,avu=mean(unif),avn=mean(norm)))[3]})
1   0.895
2   1.072
3   4.248

slow example from #2

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

sizes<-seq(1e4,1e5,1e4)
names(sizes)<-sizes
Summarize
ldply(sizes,function(x){ system.time(ddply(head(d,x),.(grp1,grp2),summarise,avx = mean(x), avy=mean(y)))[3]})
     .id elapsed
1  10000   6.704
2  20000  16.027
3  30000  27.096
4  40000  36.005
5  50000  52.887
6  60000  64.602
7  70000  83.536
8  80000 100.781
9  90000 120.673
10 1e+05 148.256
Summarize Iterative
ldply(sizes,function(x){ system.time(ddply(head(d,x),.(grp1,grp2),summarise_itr,avx = mean(x), avy=mean(y)))[3]})
     .id elapsed
1  10000   6.964
2  20000  16.067
3  30000  27.254
4  40000  39.707
5  50000  53.886
6  60000  70.211
7  70000  86.588
8  80000 106.581
9  90000 133.271
10 1e+05 155.562
Owner

hadley commented Aug 9, 2012

I had a play around to see if I could make it any faster but I didn't see any particular improvements. One small trick is to avoid a copy when converting to a list:

  # Convert to a list in place (no copy)
  attr(.data, "class") <- NULL
  attr(.data, "row.names") <- NULL

but that marginal impact on performance for these benchmarks (because the data sets being summarised are small)

@hadley hadley added a commit that referenced this pull request Oct 5, 2012

@hadley hadley Merge pull request #89 from jimhester/master
Sequential Summarise fixes issue #44
0c4baa2

@hadley hadley merged commit 0c4baa2 into hadley:master Oct 5, 2012

Owner

hadley commented Oct 5, 2012

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment