rbind.fill performance and consistency improvements #172

Closed
wants to merge 18 commits into
from

Projects

None yet

2 participants

@crowding
Contributor
crowding commented Sep 2, 2013

This is a substantial refactoring of rbind.fill that improves performance for combining large numbers of groups and improves consistency of behavior for different column types.

  • Avoid inducing array copies for primitive/list types, ensuring linear performance (fixes #148).
  • Special handling for factor and POSIXct types to avoid copying and ensure linear performance in those cases as well.
  • Dims are handled independently of type and other attributes; factor-arrays, char-arrays and list-arrays are supported.
  • Arrays of any dimension are supported.
  • Colnames of matrix columns are preserved (like rbind.)
  • Factor columns rbinded to other data types are coerced to character.
  • Consequently match_df and join handle combinations of character and factor on the index column (fixes #128).

Supporting performance measurements here: http://crowding.github.io/plyr/rbind-fill.html

performance over revisions

@hadley hadley commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
}
}
- quickdf(output)
+ quickdf(lapply(output, function(x) x()))
}
output_template <- function(dfs, nrows) {
@hadley
hadley Sep 5, 2013 Owner

Maybe this doesn't need to be it's own function now that it's so much simpler?

@hadley hadley commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
}
seen[matching] <- TRUE
if (all(seen)) break # Quit as soon as all done
}
- # Set up factors
- for(var in vars[is_factor]) {
- all <- unique(lapply(dfs, function(df) levels(df[[var]])))
- output[[var]] <- factor(output[[var]], levels = unique(unlist(all)),
- exclude = NULL)
- }
+ output
+}
+
+allocate_column <- function(example, nrows, dfs, var) {
@hadley
hadley Sep 5, 2013 Owner

Could you add a little documentation to this function? i.e. add basic roxygen docs (description + params), but use # instead of #' so it's not turned into an Rd file

@hadley hadley commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
+ if (is.array(example)) {
+
+ if ("dimnames" %in% names(a)) {
+ a$dimnames[1] <- list(NULL)
+ if (!is.null(names(a$dimnames)))
+ names(a$dimnames)[1] <- NA
+ }
+
+ # Check that all other args have consistent dims
+ df_has <- vapply(dfs, function(df) var %in% names(df), FALSE)
+ dims <- unique(lapply(dfs[df_has], function(df) dim(df[[var]])[-1]))
+ if (length(dims) > 1)
+ stop("Array variable ", var, " has inconsistent dims")
+
+ # Adjust assignment statement
+ assignment[[2]] <- as.call(
@hadley
hadley Sep 5, 2013 Owner

I think this would be cleaner with a helper function with a clear name.

@hadley hadley and 1 other commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
- output[[var]] <- rep(NA, nrows)
+ if (inherits(example, "POSIXt")) {
@hadley
hadley Sep 5, 2013 Owner

else if ? And then in else set handler to type ?

@hadley
hadley Sep 5, 2013 Owner

Maybe complain about non vector inputs?

@crowding
crowding Sep 6, 2013 Contributor

I think it fails reasonably on the call to 'vector' , but I added a test.

@hadley hadley commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
}
- output
+ column <- vector(type, length)
+ if (!isList) {
+ column[] <- NA
+ }
+ attributes(column) <- a
+
+ #It is especially important never to inspect the column when in the
+ #main rbind.fill loop. To avoid that, we've done specialization
+ #(figuring out the array assignment form and data type) ahead of
+ #time, and instead of returning the column, we return a mutator
+ #function that closes over the column.
+ switch(
@hadley
hadley Sep 5, 2013 Owner

I think this would be cleaner if you returned a list of get and set functions.

@hadley hadley and 1 other commented on an outdated diff Sep 5, 2013
R/rbind-fill.r
+ what <- as.character(what)
+ eval(assignment)
+ },
+ factor = function(rows, what) {
+ if(nargs() == 0) {
+ class(column) <<- class
+ return(column)
+ }
+ #duplicate what `[<-.factor` does
+ what <- match(what, levels)
+ #no need to check since we already computed levels
+ eval(assignment)
+ },
+ time = function(rows, what) {
+ if (nargs() == 0) {
+ class(column) <<- class
@hadley
hadley Sep 5, 2013 Owner

Why do you need this?

@crowding
crowding Sep 5, 2013 Contributor

Any type with a non-primitive [<- method is going to force a copy on modification. So the only way to get linear performance out of factor and POSIXct is to inline their [<- methods and only set the class attribute at the end.

@hadley
hadley Sep 5, 2013 Owner

Oh, I see - I was getting confused because of the as.POSIXct below - but that's not what [ is dispatching on.

@hadley hadley referenced this pull request in hadley/dplyr Oct 30, 2013
Closed

Implement C++ version of rbind.fill #101

@hadley
Owner
hadley commented Jan 2, 2014

Thanks. Can you please merge/rebase, update docs with latest roxygen2 and resubmit?

@hadley hadley closed this Jan 2, 2014
@crowding crowding deleted the crowding:fastrbind branch Feb 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment