Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Sequential Summarise fixes issue #44 #89

Merged
merged 2 commits into from

2 participants

@jimhester

Summarise will now work sequentially, so you can use earlier summaries in
later summaries, fixes #44

@jimhester jimhester Summarise sequentially
Summarise will now work sequentially, so you can use earlier summaries in
later summaries
06d65d5
R/helper-summarise.r
@@ -30,7 +29,15 @@ summarise <- function(.data, ...) {
names <- unname(unlist(lapply(match.call(expand = FALSE)$`...`, deparse)))
names(cols)[missing_names] <- names[missing_names]
}
-
- quickdf(cols)
+ cols <- cols[names(cols) != ""]
+ env <- new.env(parent=parent.frame())
+ for(name in names(.data)){
+ assign(name,.data[[name]],envir=env)
+ }
+ ret <- list()
+ for (col in names(cols)) {
+ ret[[col]] <- eval(cols[[col]], ret, env)
+ }
@hadley Owner
hadley added a note

Hmmm, could you take an approach more similar to mutate?

I revised the patch to just convert the input data frame to a list to remove the ugly environment copying I was doing. This also improved performance, however the conversion still makes this version slower than the non iterative version. This version is much closer to the mutate implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jimhester jimhester Convert data frame to list
Rather than copying the data frame to the environment, convert the data frame
to a list, so that the new values can be added to it.
e51fc37
@hadley
Owner

Do you have any benchmarks before and after that you can share?

@jimhester

Here are some benchmarks, the performance in the first benchmark actually better than the original, the second only slightly worse, plus the second is somewhat of an edge case.

source('plyr/benchmark/data.r')
Summarize
ldply(list(bench4,bench5,bench6),function(x){ system.time(ddply(x,.(group_a,group_b),summarise,avu=mean(unif),avn=mean(norm)))[3]})
1   0.686
2   0.969
3   4.506
Summarize Iterative
ldply(list(bench4,bench5,bench6),function(x){ system.time(ddply(x,.(group_a,group_b),summarise_itr,avu=mean(unif),avn=mean(norm)))[3]})
1   0.895
2   1.072
3   4.248

slow example from #2

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

sizes<-seq(1e4,1e5,1e4)
names(sizes)<-sizes
Summarize
ldply(sizes,function(x){ system.time(ddply(head(d,x),.(grp1,grp2),summarise,avx = mean(x), avy=mean(y)))[3]})
     .id elapsed
1  10000   6.704
2  20000  16.027
3  30000  27.096
4  40000  36.005
5  50000  52.887
6  60000  64.602
7  70000  83.536
8  80000 100.781
9  90000 120.673
10 1e+05 148.256
Summarize Iterative
ldply(sizes,function(x){ system.time(ddply(head(d,x),.(grp1,grp2),summarise_itr,avx = mean(x), avy=mean(y)))[3]})
     .id elapsed
1  10000   6.964
2  20000  16.067
3  30000  27.254
4  40000  39.707
5  50000  53.886
6  60000  70.211
7  70000  86.588
8  80000 106.581
9  90000 133.271
10 1e+05 155.562
@hadley
Owner

I had a play around to see if I could make it any faster but I didn't see any particular improvements. One small trick is to avoid a copy when converting to a list:

  # Convert to a list in place (no copy)
  attr(.data, "class") <- NULL
  attr(.data, "row.names") <- NULL

but that marginal impact on performance for these benchmarks (because the data sets being summarised are small)

@hadley hadley merged commit 0c4baa2 into from
@hadley
Owner

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 26, 2012
  1. @jimhester

    Summarise sequentially

    jimhester authored
    Summarise will now work sequentially, so you can use earlier summaries in
    later summaries
Commits on Aug 8, 2012
  1. @jimhester

    Convert data frame to list

    jimhester authored
    Rather than copying the data frame to the environment, convert the data frame
    to a list, so that the new values can be added to it.
This page is out of date. Refresh to see the latest.
Showing with 7 additions and 4 deletions.
  1. +7 −4 R/helper-summarise.r
View
11 R/helper-summarise.r
@@ -18,8 +18,8 @@
#' duration = max(year) - min(year),
#' nteams = length(unique(team)))
summarise <- function(.data, ...) {
- cols <- eval(substitute(list(...)), .data, parent.frame())
-
+ cols <- as.list(substitute(list(...))[-1])
+
# ... not a named list, figure out names by deparsing call
if(is.null(names(cols))) {
missing_names <- rep(TRUE, length(cols))
@@ -30,7 +30,10 @@ summarise <- function(.data, ...) {
names <- unname(unlist(lapply(match.call(expand = FALSE)$`...`, deparse)))
names(cols)[missing_names] <- names[missing_names]
}
-
- quickdf(cols)
+ .data <- as.list(.data)
+ for (col in names(cols)) {
+ .data[[col]] <- eval(cols[[col]], .data, parent.frame())
+ }
+ quickdf(.data[names(cols)])
}
summarize <- summarise
Something went wrong with that request. Please try again.