Slow example #2

Closed
hadley opened this Issue Feb 24, 2010 · 6 comments

Comments

Projects
None yet
2 participants
@hadley
Owner

hadley commented Feb 24, 2010

library(plyr)

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y))
@hadley

This comment has been minimized.

Show comment Hide comment
@hadley

hadley May 5, 2010

Owner

This is a pathological example because there are so many groups - ~90,000 for 100,000 observations. This is a situation where customised transform_by and summarise_by functions would be useful, because the output types are known in advance.

Owner

hadley commented May 5, 2010

This is a pathological example because there are so many groups - ~90,000 for 100,000 observations. This is a situation where customised transform_by and summarise_by functions would be useful, because the output types are known in advance.

@hadley

This comment has been minimized.

Show comment Hide comment
@hadley

hadley Jun 28, 2010

Owner

The idempotent data frame helps somewhat here:

> system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
103.711  21.499 125.523 
> i <- idata.frame(d)
> system.time(ddply(i, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
 73.654   0.251  74.008 
Owner

hadley commented Jun 28, 2010

The idempotent data frame helps somewhat here:

> system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
103.711  21.499 125.523 
> i <- idata.frame(d)
> system.time(ddply(i, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
 73.654   0.251  74.008 
@hadley

This comment has been minimized.

Show comment Hide comment
@hadley

hadley Jun 29, 2010

Owner

Naive use of ave is better, but using interaction directly is best:

> system.time({
+   d$avx <- ave(d$x, list(d$grp1, d$grp2))
+   d$avy <- ave(d$y, list(d$grp1, d$grp2))
+ })
   user  system elapsed 
 39.300   0.279  40.809 
> 
> system.time({
+   d$avx <- ave(d$x, interaction(d$grp1, d$grp2, drop = T))
+   d$avy <- ave(d$y, interaction(d$grp1, d$grp2, drop = T))
+ })
   user  system elapsed 
  6.735   0.209   7.064 
Owner

hadley commented Jun 29, 2010

Naive use of ave is better, but using interaction directly is best:

> system.time({
+   d$avx <- ave(d$x, list(d$grp1, d$grp2))
+   d$avy <- ave(d$y, list(d$grp1, d$grp2))
+ })
   user  system elapsed 
 39.300   0.279  40.809 
> 
> system.time({
+   d$avx <- ave(d$x, interaction(d$grp1, d$grp2, drop = T))
+   d$avy <- ave(d$y, interaction(d$grp1, d$grp2, drop = T))
+ })
   user  system elapsed 
  6.735   0.209   7.064 
@dkulp2

This comment has been minimized.

Show comment Hide comment
@dkulp2

dkulp2 Oct 16, 2012

I'm not sure if this is related or just exacerbated by the large number of groups. If using an un-named function as the summary then it is fast. Using summarize is OK, but it hangs for a long time at the end of the computation. If you try to give a name to a function then ddply typically runs so long and uses up so much memory that it's not worth timing.

> m <- function(df) { mean(df$x) }
> system.time(foo <- ddply(d, .(grp1, grp2), m,.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
 45.826  21.846  69.497 
> system.time(foo <- ddply(d, .(grp1, grp2), avx=m,.progress='text'))
(USER INTERRUPT)
Timing stopped at: 97.059 51.123 150.527 
> system.time(foo <- ddply(d, .(grp1, grp2), summarise, avx = mean(x),.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
115.335  63.607 180.758 

dkulp2 commented Oct 16, 2012

I'm not sure if this is related or just exacerbated by the large number of groups. If using an un-named function as the summary then it is fast. Using summarize is OK, but it hangs for a long time at the end of the computation. If you try to give a name to a function then ddply typically runs so long and uses up so much memory that it's not worth timing.

> m <- function(df) { mean(df$x) }
> system.time(foo <- ddply(d, .(grp1, grp2), m,.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
 45.826  21.846  69.497 
> system.time(foo <- ddply(d, .(grp1, grp2), avx=m,.progress='text'))
(USER INTERRUPT)
Timing stopped at: 97.059 51.123 150.527 
> system.time(foo <- ddply(d, .(grp1, grp2), summarise, avx = mean(x),.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
115.335  63.607 180.758 
@hadley

This comment has been minimized.

Show comment Hide comment
@hadley

hadley Oct 16, 2012

Owner

Yes, this is know because much of the overhead is creating the data frames. (Because you've (incorrectly) named the argument in the second form, it uses the default, identity, which will be v. slow because it doesn't do any reduction)

Owner

hadley commented Oct 16, 2012

Yes, this is know because much of the overhead is creating the data frames. (Because you've (incorrectly) named the argument in the second form, it uses the default, identity, which will be v. slow because it doesn't do any reduction)

@hadley

This comment has been minimized.

Show comment Hide comment
@hadley

hadley Jan 2, 2014

Owner

See dplyr for a solution to this.

Owner

hadley commented Jan 2, 2014

See dplyr for a solution to this.

@hadley hadley closed this Jan 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment