Faster colsplit function #13

Open
skranz opened this Issue Feb 6, 2012 · 1 comment

Projects

None yet

2 participants

@skranz

While reshape is great, I just found that in my application the colsplit function is kind of a bottleneck that makes the code quite slow. The problem is that str_split_fixed seems to be pretty slow on large vectors. I programmed an alternative version that is much quicker for large vectors that have many duplicates. Below is the function and an example that illustrates the issue in my application. In the example, the new function has more than a 100 fold speed increase.

# Version of colsplit that works much faster for large vectors
# with many duplicates
colsplit = function (string, pattern, names,split.unique = NROW(string)>100)
{
  # Original Computation: split all string
  # Problem str_split_fixed can be quite slow for long vectors
  if (!split.unique) {
    vars <- str_split_fixed(string, pattern, n = length(names))
    df <- data.frame(alply(vars, 2, type.convert, as.is = TRUE),
                     stringsAsFactors = FALSE)
    names(df) <- names

    # Only split unique strings and match afterwards
    # works much faster for long vectors with many duplicates
  } else {
    uni.string = unique(string)
    # Only have speed gains in case there are substantially less
    # unique strings than normal strings
    if (length(uni.string)>0.5*length(string))
      return(colsplit(string,pattern,names,split.unique = FALSE))
    uni.df <- colsplit(uni.string,pattern,names,split.unique=FALSE)
    rows <- match(string,uni.string)
    df <- uni.df[rows,]
  }
  df
}


# An example with timing
library(reshape2)
library(stringr)
library(plyr)

T = 10000
prod = c("A","B","C")
attr = c("x","y","z")
cross.paste = function(left,right,sep="_") {
  as.character(t(outer(left,right,paste,sep=sep)))
}
df = as.data.frame(cbind(1:T,matrix(runif(T*3*3),T,3*3)))
colnames(df) = c("t",cross.paste(prod,attr))
head(df)

# Melt df
df.melt = melt(df,id.var="t")
NROW(df.melt)
head(df.melt)

# Want to separate prod and attr: use colsplit

# Original version is very slow
system.time(df.split <- reshape2::colsplit(df.melt$variable,"_",c("prod","attr")))
#user  system elapsed
#52.87    0.14   59.2

# Modified version works much quicker in this example
system.time(df.split2 <- colsplit(df.melt$variable,"_",c("prod","attr"),split.unique=TRUE))
#user  system elapsed
#0.39    0.00    0.51

# Results are the same
identical(df.split,df.split2)

# Finish the transformation to the desired format in df.final
df.work = cbind(df.split2,df.melt)
head(df.work)
df.final = dcast(data=df.work,t + prod ~ attr,value.var = "value")
head(df.final)
@mannyishere

+1 vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment