Skip to content
Fast reading of delimited files
Branch: master
Clone or download
jimhester Fix bug when finding newlines using multiple threads
We can't use the logic for quoted newlines when multi-threading, because
we don't know if we have landed inside or outside the quotes.

Likely fixes #76
Latest commit df07dfd Apr 20, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Have generator functions respect R's RNG seed Apr 19, 2019
data-raw Add the locale and date code from readr Apr 12, 2019
inst More readable factors Apr 10, 2019
man Document missing arguments in vroom_write Apr 18, 2019
src
tests Fixes for issues found when trying to use vroom with readr Apr 18, 2019
vignettes Remove some more instances of readr write_() functions Apr 15, 2019
.Rbuildignore Add the fixed width file code, finish removing readr dependency Apr 12, 2019
.covrignore Add spdlog to covr ignore Mar 21, 2019
.gitignore Add the fixed width file code, finish removing readr dependency Apr 12, 2019
.travis.yml Try installing remotes PR Apr 9, 2019
DESCRIPTION
LICENSE Add license Dec 27, 2018
LICENSE.md Add license Dec 27, 2018
Makefile Using different strategy to ensure parallel make Apr 10, 2019
NAMESPACE
README.Rmd Add note about writing now Apr 18, 2019
README.md Add note about writing now Apr 18, 2019
_pkgdown.yml
appveyor.yml Add appveyor Dec 29, 2018
codecov.yml Add coverage badge Jan 10, 2019
vroom.Rproj Add Rstudio project Feb 9, 2019

README.md

🏎💨vroom

CRAN status Lifecycle: experimental Travis build status AppVeyor build status Codecov test coverage

The fastest delimited reader for R, 1.04 GB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package version time (sec) speedup throughput
vroom 0.0.0.9000 1.60 67.42 1.04 GB
data.table 1.12.0 19.75 5.47 84.38 MB
readr 1.3.1 26.61 4.06 62.64 MB
read.delim 3.5.1 108.13 1.00 15.42 MB

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

  • delimiter guessing*
  • custom delimiters (including multi-byte* and unicode* delimiters)
  • specification of column types (including type guessing)
    • numeric types (double, integer, number)
    • logical types
    • datetime types (datetime, date, time)
    • categorical types (characters, factors)
  • column selection, like dplyr::select()*
  • skipping headers, comments and blank lines
  • quoted fields
  • double and backslashed escapes
  • whitespace trimming
  • windows newlines
  • reading from multiple files or connections*
  • embedded newlines in headers and fields**
  • writing delimited files with as-needed quoting.

* these are additional features only in vroom.

** requires num_threads = 1.

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("jimhester/vroom")

Usage

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 x 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # … with 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t")
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files)
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#> 1  2013     1     1      810            810         0     1048
#> 2  2013     1     1     1451           1500        -9     1634
#> 3  2013     1     1     1452           1455        -3     1637
#> # … with 3.368e+05 more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

Further reading

See Getting started to jump start your use of vroom!

Benchmarks

The speed quoted above is from a dataset with 14,776,615 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

  • VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).
  • VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
  • VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
  • VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
  • VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

  • VROOM_USE_ALTREP_NUMERICS - If set use altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

  • VROOM_USE_ALTREP_CHR
  • VROOM_USE_ALTREP_FCT
  • VROOM_USE_ALTREP_INT
  • VROOM_USE_ALTREP_DBL
  • VROOM_USE_ALTREP_NUM
  • VROOM_USE_ALTREP_LGL
  • VROOM_USE_ALTREP_DTTM
  • VROOM_USE_ALTREP_DATE
  • VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane auto-refresh behavior calls object.size() which for Altrep objects can be extremely slow. This was fixed in rstudio#4210 and rstudio#4292, so it is recommended you use a daily version if you are trying to use vroom inside RStudio. For older versions a workaround is to use the ‘Manual Refresh Only’ option in the environment pane.

Thanks

You can’t perform that action at this time.