Skip to content
This repository

Slow example #2

Closed
hadley opened this Issue February 24, 2010 · 6 comments

2 participants

Hadley Wickham dkulp2
Hadley Wickham
Owner
library(plyr)

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y))
Hadley Wickham
Owner
hadley commented May 05, 2010

This is a pathological example because there are so many groups - ~90,000 for 100,000 observations. This is a situation where customised transform_by and summarise_by functions would be useful, because the output types are known in advance.

Hadley Wickham
Owner
hadley commented June 28, 2010

The idempotent data frame helps somewhat here:

> system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
103.711  21.499 125.523 
> i <- idata.frame(d)
> system.time(ddply(i, .(grp1, grp2), summarise, avx = mean(x), avy=mean(y)))
   user  system elapsed 
 73.654   0.251  74.008 
Hadley Wickham
Owner
hadley commented June 28, 2010

Naive use of ave is better, but using interaction directly is best:

> system.time({
+   d$avx <- ave(d$x, list(d$grp1, d$grp2))
+   d$avy <- ave(d$y, list(d$grp1, d$grp2))
+ })
   user  system elapsed 
 39.300   0.279  40.809 
> 
> system.time({
+   d$avx <- ave(d$x, interaction(d$grp1, d$grp2, drop = T))
+   d$avy <- ave(d$y, interaction(d$grp1, d$grp2, drop = T))
+ })
   user  system elapsed 
  6.735   0.209   7.064 
dkulp2

I'm not sure if this is related or just exacerbated by the large number of groups. If using an un-named function as the summary then it is fast. Using summarize is OK, but it hangs for a long time at the end of the computation. If you try to give a name to a function then ddply typically runs so long and uses up so much memory that it's not worth timing.

> m <- function(df) { mean(df$x) }
> system.time(foo <- ddply(d, .(grp1, grp2), m,.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
 45.826  21.846  69.497 
> system.time(foo <- ddply(d, .(grp1, grp2), avx=m,.progress='text'))
(USER INTERRUPT)
Timing stopped at: 97.059 51.123 150.527 
> system.time(foo <- ddply(d, .(grp1, grp2), summarise, avx = mean(x),.progress='text'))
  |========================================================================| 100%
   user  system elapsed 
115.335  63.607 180.758 
Hadley Wickham
Owner

Yes, this is know because much of the overhead is creating the data frames. (Because you've (incorrectly) named the argument in the second form, it uses the default, identity, which will be v. slow because it doesn't do any reduction)

Hadley Wickham
Owner

See dplyr for a solution to this.

Hadley Wickham hadley closed this January 02, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.