Skip to content

dplyr 0.4.0

Compare
Choose a tag to compare
@hadley hadley released this 09 Jan 12:37

New features

  • add_rownames() turns row names into an explicit variable (#639).
  • as_data_frame() efficiently coerces a list into a data frame (#749).
  • bind_rows() and bind_cols() efficiently bind a list of data frames by
    row or column. combine() applies the same coercion rules to vectors
    (it works like c() or unlist() but is consistent with the bind_rows()
    rules).
  • right_join() (include all rows in y, and matching rows in x) and
    full_join() (include all rows in x and y) complete the family of
    mutating joins (#96).
  • group_indices() computes a unique integer id for each group (#771). It
    can be called on a grouped_df without any arguments or on a data frame
    with same arguments as group_by().

New vignettes

  • vignette("data_frame") describes dplyr functions that make it easier
    and faster to create and coerce data frames. It subsumes the old memory
    vignette.
  • vignette("two-table") describes how two-table verbs work in dplyr.

Minor improvements

  • data_frame() (and as_data_frame() & tbl_df()) now explicitly
    forbid columns that are data frames or matrices (#775). All columns
    must be either a 1d atomic vector or a 1d list.

  • do() uses lazyeval to correctly evaluate its arguments in the correct
    environment (#744), and new do_() is the SE equivalent of do() (#718).
    You can modify grouped data in place: this is probably a bad idea but it's
    sometimes convenient (#737). do() on grouped data tables now passes in all
    columns (not all columns except grouping vars) (#735, thanks to @kismsu).
    do() with database tables no longer potentially includes grouping
    variables twice (#673). Finally, do() gives more consistent outputs when
    there are no rows or no groups (#625).

  • first() and last() preserve factors, dates and times (#509).

  • Overhaul of single table verbs for data.table backend. They now all use
    a consistent (and simpler) code base. This ensures that (e.g.) n()
    now works in all verbs (#579).

  • In *_join(), you can now name only those variables that are different between
    the two tables, e.g. inner_join(x, y, c("a", "b", "c" = "d")) (#682).
    If non-join colums are the same, dplyr will add .x and .y
    suffixes to distinguish the source (#655).

  • mutate() handles complex vectors (#436) and forbids POSIXlt results
    (instead of crashing) (#670).

  • select() now implements a more sophisticated algorithm so if you're
    doing multiples includes and excludes with and without names, you're more
    likely to get what you expect (#644). You'll also get a better error
    message if you supply an input that doesn't resolve to an integer
    column position (#643).

  • Printing has recieved a number of small tweaks. All print() method methods
    invisibly return their input so you can interleave print() statements into a
    pipeline to see interim results. print() will column names of 0 row data
    frames (#652), and will never print more 20 rows (i.e.
    options(dplyr.print_max) is now 20), not 100 (#710). Row names are no
    never printed since no dplyr method is guaranteed to preserve them (#669).

    glimpse() prints the number of observations (#692)

    type_sum() gains a data frame method.

  • summarise() handles list output columns (#832)

  • slice() works for data tables (#717). Documentation clarifies that
    slice can't work with relational databases, and the examples show
    how to achieve the same results using filter() (#720).

  • dplyr now requires RSQLite >= 1.0. This shouldn't affect your code
    in any way (except that RSQLite now doesn't need to be attached) but does
    simplify the internals (#622).

  • Functions that need to combine multiple results into a single column
    (e.g. join(), bind_rows() and summarise()) are more careful about
    coercion.

    Joining factors with the same levels in the same order preserves the
    original levels (#675). Joining factors with non-identical levels
    generates a warning and coerces to character (#684). Joining a character
    to a factor (or vice versa) generates a warning and coerces to character.
    Avoid these warnings by ensuring your data is compatible before joining.

    rbind_list() will throw an error if you attempt to combine an integer and
    factor (#751). rbind()ing a column full of NAs is allowed and just
    collects the appropriate missing value for the column type being collected
    (#493).

    summarise() is more careful about NA, e.g. the decision on the result
    type will be delayed until the first non NA value is returned (#599).
    It will complain about loss of precision coercions, which can happen for
    expressions that return integers for some groups and a doubles for others
    (#599).

  • A number of functions gained new or improved hybrid handlers: first(),
    last(), nth() (#626), lead() & lag() (#683), %in% (#126). That means
    when you use these functions in a dplyr verb, we handle them in C++, rather
    than calling back to R, and hence improving performance.

    Hybrid min_rank() correctly handles NaN values (#726). Hybrid
    implementation of nth() falls back to R evaluation when n is not
    a length one integer or numeric, e.g. when it's an expression (#734).

    Hybrid dense_rank(), min_rank(), cume_dist(), ntile(), row_number()
    and percent_rank() now preserve NAs (#774)

  • filter returns its input when it has no rows or no columns (#782).

  • Join functions keep attributes (e.g. time zone information) from the
    left argument for POSIXct and Date objects (#819), and only
    only warn once about each incompatibility (#798).

Bug fixes

  • [.tbl_df correctly computes row names for 0-column data frames, avoiding
    problems with xtable (#656). [.grouped_df will silently drop grouping
    if you don't include the grouping columns (#733).
  • data_frame() now acts correctly if the first argument is a vector to be
    recycled. (#680 thanks @jimhester)
  • filter.data.table() works if the table has a variable called "V1" (#615).
  • *_join() keeps columns in original order (#684).
    Joining a factor to a character vector doesn't segfault (#688).
    *_join functions can now deal with multiple encodings (#769),
    and correctly name results (#855).
  • *_join.data.table() works when data.table isn't attached (#786).
  • group_by() on a data table preserves original order of the rows (#623).
    group_by() supports variables with more than 39 characters thanks to
    a fix in lazyeval (#705). It gives meaninful error message when a variable
    is not found in the data frame (#716).
  • grouped_df() requires vars to be a list of symbols (#665).
  • min(.,na.rm = TRUE) works with Dates built on numeric vectors (#755)
  • rename_() generic gets missing .dots argument (#708).
  • row_number(), min_rank(), percent_rank(), dense_rank(), ntile() and
    cume_dist() handle data frames with 0 rows (#762). They all preserve
    missing values (#774). row_number() doesn't segfault when giving an external
    variable with the wrong number of variables (#781)
  • group_indices handles the edge case when there are no variables (#867)