ddply bug: combining difftimes with differing units forces a single unit on all results, but does not convert values #193

Closed
arilamstein opened this Issue Jan 6, 2014 · 2 comments

Comments

Projects
None yet
2 participants

I think that there is a bug in how ddply handles calculating time differences. Specifically, when diff returns different units for each subgroup, I think that plyr combines the results in such a way that the numbers stay the same, but the units of all subgroups are forced to the units of the first subgroup. Here is an exaple:

Consider a data.frame with 3 users, each having 2 timestamp values

df = data.frame(
 userId = c(1,1,2,2,3,3),
 value  = c("2012-01-01 12:00:00", "2012-01-01 12:00:01", # Time diff of 1 secs
                 "2012-01-01 12:00:00", "2012-01-01 13:00:00", # Time diff of 1 hours
                 "2012-01-01 12:00:00", "2013-01-01 13:00:00")) # diff of 366.0417 days
df$value = as.POSIXct(df$value)

Now let's use ddply to split df by userId, apply diff to the timestamps in each subgroup, and combine the results with summarise:

df_plyr = ddply(df, .(userId), summarise, d=diff(value))
df_ply
  userId             d
1      1   1.0000 secs
2      2   1.0000 secs
3      3 366.0417 secs

These results are incorrect. The units in df_plyr[2,2] should read "days" and the results in df_plyr[3,2] should be "days". Either that, or the values in those rows should have been converts to to seconds.

Contributor

crowding commented Jan 6, 2014

rbind.fill produces correct results for primitive, list, factor, array and POSIXct values and that's all it promises.

It's technically possible to put handling in rbind.fill to convert difftimes to the same unit, but difftime doesn't have c or [<- methods defined so it would amount to writing difftime functionality that difftime itself doesn't have:

c(diff(df$value[1:2]), diff(df$value[3:4]))
# [1] 1 1
x <- .difftime(numeric(3),"secs")
for (i in 1:3) x[i] <- df$value[i+3] - df$value[i]
x
# Time differences in secs
# [1]   1  -1 366

Workaround would be to convert to seconds explicitly:

df2 <- ddply(df, .(userId), summarise, d=`units<-`(diff(value), "secs"))

Thanks Peter. I think that you are correct about this behavior existing at a lower level than plyr.

I can follow the gist of your explanation, but I don't fully understand what it means for difftime to have (or not have) the c or [<- methods defined. Probably because of this I can verify that your solution works, but I don't understand quite what what code like "d=units<-" does. What's pretty clear to me, though, is that when I combine two difftime objects I get unexpected behavior - it looks like the difftime objects are coverted to numbers and they lose their units. This is a problem because 2 seconds < 1 day, but if you drop the units then 2 > 1.

The solution that I've implemented is to just do the split-apply-combine by hand:

ret = c()
user_ids = unique(df$user_id);
for(user_id in user_ids)
{
  tmp = df[df$user_id == user_id, ]
  d   = diff(tmp$posix)
  units(d) = "hours"
  ret = c(ret, d)
}

@arilamstein arilamstein closed this Jan 7, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment