diff --git a/r/README.md b/r/README.md index c103000f5f657..b568a362c95cf 100644 --- a/r/README.md +++ b/r/README.md @@ -4,31 +4,57 @@ [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush) [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow) -[Apache Arrow](https://arrow.apache.org/) is a cross-language -development platform for in-memory data. It specifies a standardized +**[Apache Arrow](https://arrow.apache.org/) is a cross-language +development platform for in-memory data.** It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. -The `arrow` package exposes an interface to the Arrow C++ library to -access many of its features in R. This includes support for analyzing -large, multi-file datasets (`open_dataset()`), working with individual -Parquet (`read_parquet()`, `write_parquet()`) and Feather -(`read_feather()`, `write_feather()`) files, as well as lower-level -access to Arrow memory and messages. +**The `arrow` package exposes an interface to the Arrow C++ library, +enabling access to many of its features in R.** It provides low-level +access to the Arrow C++ library API and higher-level access through a +`dplyr` backend and familiar R functions. + +## What can the `arrow` package do? + +- Read and write **Parquet files** (`read_parquet()`, + `write_parquet()`), an efficient and widely used columnar format +- Read and write **Feather files** (`read_feather()`, + `write_feather()`), a format optimized for speed and + interoperability +- Analyze, process, and write **multi-file, larger-than-memory + datasets** (`open_dataset()`, `write_dataset()`) +- Read **large CSV and JSON files** with excellent **speed and + efficiency** (`read_csv_arrow()`, `read_json_arrow()`) +- Manipulate and analyze Arrow data with **`dplyr` verbs** +- Read and write files in **Amazon S3** buckets with no additional + function calls +- Exercise **fine control over column types** for seamless + interoperability with databases and data warehouse systems +- Use **compression codecs** including Snappy, gzip, Brotli, + Zstandard, LZ4, LZO, and bzip2 for reading and writing data +- Enable **zero-copy data sharing** between **R and Python** +- Connect to **Arrow Flight** RPC servers to send and receive large + datasets over networks +- Access and manipulate Arrow objects through **low-level bindings** + to the C++ library +- Provide a **toolkit for building connectors** to other applications + and services that use Arrow ## Installation +### Installing the latest release version + Install the latest release of `arrow` from CRAN with -```r +``` r install.packages("arrow") ``` Conda users can install `arrow` from conda-forge with -``` +``` shell conda install -c conda-forge --strict-channel-priority r-arrow ``` @@ -36,218 +62,245 @@ Installing a released version of the `arrow` package requires no additional system dependencies. For macOS and Windows, CRAN hosts binary packages that contain the Arrow C++ library. On Linux, source package installation will also build necessary C++ dependencies. For a faster, -more complete installation, set the environment variable `NOT_CRAN=true`. -See `vignette("install", package = "arrow")` for details. +more complete installation, set the environment variable +`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for +details. -## Installing a development version +### Installing a development version -Development versions of the package (binary and source) are built daily and hosted at -. To install from there: +Development versions of the package (binary and source) are built +nightly and hosted at . To +install from there: ``` r install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") ``` -Or - -```r -arrow::install_arrow(nightly = TRUE) -``` - -Conda users can install `arrow` nightlies from our nightlies channel using: +Conda users can install `arrow` nightly builds with -``` +``` shell conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow ``` -These daily package builds are not official Apache releases and are not -recommended for production use. They may be useful for testing bug fixes -and new features under active development. +If you already have a version of `arrow` installed, you can switch to +the latest nightly development version with -## Developing - -Windows and macOS users who wish to contribute to the R package and -don’t need to alter the Arrow C++ library may be able to obtain a -recent version of the library without building from source. On macOS, -you may install the C++ library using [Homebrew](https://brew.sh/): - -``` shell -# For the released version: -brew install apache-arrow -# Or for a development version, you can try: -brew install apache-arrow --HEAD +``` r +arrow::install_arrow(nightly = TRUE) ``` -On Windows, you can download a .zip file with the arrow dependencies from the -[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/), -and then set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. Version numbers in that -repository correspond to dates, and you will likely want the most recent. +These nightly package builds are not official Apache releases and are +not recommended for production use. They may be useful for testing bug +fixes and new features under active development. -If you need to alter both the Arrow C++ library and the R package code, -or if you can’t get a binary version of the latest C++ library -elsewhere, you’ll need to build it from source too. +## Usage -First, install the C++ library. See the [developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details. -It's recommended to make a `build` directory inside of the `cpp` directory of -the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`, -you'll first call `cmake` to configure the build and then `make install`. -For the R package, you'll need to enable several features in the C++ library -using `-D` flags: +Among the many applications of the `arrow` package, two of the most accessible are: -``` -cmake \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_JEMALLOC=ON \ - -DARROW_JSON=ON \ - -DARROW_PARQUET=ON \ - -DCMAKE_BUILD_TYPE=release \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - .. -``` - -where `..` is the path to the `cpp/` directory when you're in `cpp/build`. +- High-performance reading and writing of data files with multiple + file formats and compression codecs, including built-in support for + cloud storage +- Analyzing and manipulating bigger-than-memory data with `dplyr` + verbs -To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags: +The sections below describe these two uses and illustrate them with +basic examples. The sections below mention two Arrow data structures: -``` - -DARROW_S3=ON \ - -DARROW_MIMALLOC=ON \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_LZ4=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZLIB=ON \ - -DARROW_WITH_ZSTD=ON \ -``` +- `Table`: a tabular, column-oriented data structure capable of + storing and processing large amounts of data more efficiently than + R’s built-in `data.frame` and with SQL-like column data types that + afford better interoperability with databases and data warehouse + systems +- `Dataset`: a data structure functionally similar to `Table` but with + the capability to work on larger-than-memory data partitioned across + multiple files -Other flags that may be useful: +### Reading and writing data files with `arrow` -* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers -* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. +The `arrow` package provides functions for reading single data files in +several common formats. By default, calling any of these functions +returns an R `data.frame`. To return an Arrow `Table`, set argument +`as_data_frame = FALSE`. -Note that after any change to the C++ library, you must reinstall it and -run `make clean` or `git clean -fdx .` to remove any cached object code -in the `r/src/` directory before reinstalling the R package. This is -only necessary if you make changes to the C++ library source; you do not -need to manually purge object files if you are only editing R or C++ -code inside `r/`. +- `read_parquet()`: read a file in Parquet format +- `read_feather()`: read a file in Feather format (the Apache Arrow + IPC format) +- `read_delim_arrow()`: read a delimited text file (default delimiter + is comma) +- `read_csv_arrow()`: read a comma-separated values (CSV) file +- `read_tsv_arrow()`: read a tab-separated values (TSV) file +- `read_json_arrow()`: read a JSON data file -Once you’ve built the C++ library, you can install the R package and its -dependencies, along with additional dev dependencies, from the git -checkout: +For writing data to single files, the `arrow` package provides the +functions `write_parquet()` and `write_feather()`. These can be used +with R `data.frame` and Arrow `Table` objects. -``` shell -cd ../../r +For example, let’s write the Star Wars characters data that’s included +in `dplyr` to a Parquet file, then read it back in. Parquet is a popular +choice for storing analytic data; it is optimized for reduced file sizes +and fast read performance, especially for column-based access patterns. +Parquet is widely supported by many tools and platforms. -Rscript -e ' -options(repos = "https://cloud.r-project.org/") -if (!require("remotes")) install.packages("remotes") -remotes::install_deps(dependencies = TRUE) -' +First load the `arrow` and `dplyr` packages: -R CMD INSTALL . +``` r +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) ``` -If you need to set any compilation flags while building the C++ -extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For -example, if you are using `perf` to profile the R extensions, you may -need to set +Then write the `data.frame` named `starwars` to a Parquet file at +`file_path`: -``` shell -export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer +``` r +file_path <- tempfile() +write_parquet(starwars, file_path) ``` -If the package fails to install/load with an error like this: - - ** testing if installed package can be loaded from temporary location - Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib +Then read the Parquet file into an R `data.frame` named `sw`: -ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on -macOS to prevent problems at link time and is a no-op on other platforms). -Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to -wherever Arrow C++ was put in `make install`, e.g. `export -R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. - -When installing from source, if the R and C++ library versions do not -match, installation may fail. If you’ve previously installed the -libraries and want to upgrade the R package, you’ll need to update the -Arrow C++ library first. - -For any other build/configuration challenges, see the [C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) and -`vignette("install", package = "arrow")`. - -### Editing C++ code +``` r +sw <- read_parquet(file_path) +``` -The `arrow` package uses some customized tools on top of `cpp11` to -prepare its C++ code in `src/`. If you change C++ code in the R package, -you will need to set the `ARROW_R_DEV` environment variable to `TRUE` -(optionally, add it to your`~/.Renviron` file to persist across -sessions) so that the `data-raw/codegen.R` file is used for code -generation. +R object attributes are preserved when writing data to Parquet or +Feather files and when reading those files back into R. This enables +round-trip writing and reading of `sf::sf` objects, R `data.frame`s with +with `haven::labelled` columns, and `data.frame`s with other custom +attributes. -We use Google C++ style in our C++ code. Check for style errors with +For reading and writing larger files or sets of multiple files, `arrow` +defines `Dataset` objects and provides the functions `open_dataset()` +and `write_dataset()`, which enable analysis and processing of +bigger-than-memory data, including the ability to partition data into +smaller chunks without loading the full data into memory. For examples +of these functions, see `vignette("dataset", package = "arrow")`. - ./lint.sh +All these functions can read and write files in the local filesystem or +in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more +details, see `vignette("fs", package = "arrow")` -Fix any style issues before committing with +### Using `dplyr` with `arrow` - ./lint.sh --fix +The `arrow` package provides a `dplyr` backend enabling manipulation of +Arrow tabular data with `dplyr` verbs. To use it, first load both +packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or +`Dataset` object. For example, read the Parquet file written in the +previous example into an Arrow `Table` named `sw`: -The lint script requires Python 3 and `clang-format-8`. If the command -isn’t found, you can explicitly provide the path to it like -`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get -this by installing LLVM via Homebrew and running the script as -`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh` +``` r +sw <- read_parquet(file_path, as_data_frame = FALSE) +``` -### Running tests +Next, pipe on `dplyr` verbs: -Some tests are conditionally enabled based on the availability of certain -features in the package build (S3 support, compression libraries, etc.). -Others are generally skipped by default but can be enabled with environment -variables or other settings: +``` r +result <- sw %>% + filter(homeworld == "Tatooine") %>% + rename(height_cm = height, mass_kg = mass) %>% + mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% + arrange(desc(birth_year)) %>% + select(name, height_in, mass_lbs) +``` -* All tests are skipped on Linux if the package builds without the C++ libarrow. - To make the build fail if libarrow is not available (as in, to test that - the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE` -* Some tests are disabled unless `ARROW_R_DEV=TRUE` -* Tests that require allocating >2GB of memory to test Large types are disabled - unless `ARROW_LARGE_MEMORY_TESTS=TRUE` -* Integration tests against a real S3 bucket are disabled unless credentials - are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available - on request -* S3 tests using [MinIO](https://min.io/) locally are enabled if the - `minio server` process is found running. If you're running MinIO with custom - settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and - `MINIO_PORT` to override the defaults. +The `arrow` package uses lazy evaluation to delay computation until the +result is required. This speeds up processing by enabling the Arrow C++ +library to perform multiple computations in one operation. `result` is +an object with class `arrow_dplyr_query` which represents all the +computations to be performed: -### Useful functions +``` r +result +#> Table (query) +#> name: string +#> height_in: expr +#> mass_lbs: expr +#> +#> * Filter: equal(homeworld, "Tatooine") +#> * Sorted by birth_year [desc] +#> See $.data for the source Arrow object +``` -Within an R session, these can help with package development: +To perform these computations and materialize the result, call +`compute()` or `collect()`. `compute()` returns an Arrow `Table`, +suitable for passing to other `arrow` or `dplyr` functions: ``` r -devtools::load_all() # Load the dev package -devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names -devtools::document() # Update roxygen documentation -pkgdown::build_site() # To preview the documentation website -devtools::check() # All package checks; see also below -covr::package_coverage() # See test coverage statistics +result %>% compute() +#> Table +#> 10 rows x 3 columns +#> $name +#> $height_in +#> $mass_lbs ``` -Any of those can be run from the command line by wrapping them in `R -e -'$COMMAND'`. There’s also a `Makefile` to help with some common tasks -from the command line (`make test`, `make doc`, `make clean`, etc.) - -### Full package validation +`collect()` returns an R `data.frame`, suitable for viewing or passing +to other R functions for analysis or visualization: -``` shell -R CMD build . -R CMD check arrow_*.tar.gz --as-cran +``` r +result %>% collect() +#> # A tibble: 10 x 3 +#> name height_in mass_lbs +#> +#> 1 C-3PO 65.7 165. +#> 2 Cliegg Lars 72.0 NA +#> 3 Shmi Skywalker 64.2 NA +#> 4 Owen Lars 70.1 265. +#> 5 Beru Whitesun lars 65.0 165. +#> 6 Darth Vader 79.5 300. +#> 7 Anakin Skywalker 74.0 185. +#> 8 Biggs Darklighter 72.0 185. +#> 9 Luke Skywalker 67.7 170. +#> 10 R5-D4 38.2 70.5 ``` + +The `arrow` package works with most single-table `dplyr` verbs except those that +compute aggregates, such as `summarise()` and `mutate()` after +`group_by()`. Inside `dplyr` verbs, Arrow offers support for many +functions and operators, with common functions mapped to their base R and +tidyverse equivalents. The +[changelog](https://arrow.apache.org/docs/r/news/index.html) lists many of them. +If there are additional functions you would +like to see implemented, please file an issue as described in the +[Getting help](#getting-help) section below. + +For `dplyr` queries on `Table` objects, if the `arrow` package detects +an unimplemented function within a `dplyr` verb, it automatically calls +`collect()` to return the data as an R `data.frame` before processing +that `dplyr` verb. For queries on `Dataset` objects (which can be larger +than memory), it raises an error if the function is unimplemented; +you need to explicitly tell it to `collect()`. + +### Additional features + +Other applications of `arrow` are described in the following vignettes: + +- `vignette("python", package = "arrow")`: use `arrow` and + `reticulate` to pass data between R and Python +- `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC + servers to send and receive data +- `vignette("arrow", package = "arrow")`: access and manipulate Arrow + objects through low-level bindings to the C++ library + +## Getting help + +If you encounter a bug, please file an issue with a minimal reproducible +example on the [Apache Jira issue +tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create +an account or log in, then click **Create** to file an issue. Select the +project **Apache Arrow (ARROW)**, select the component **R**, and begin +the issue summary with **`[R]`** followed by a space. For more +information, see the **Report bugs and propose features** section of the +[Contributing to Apache +Arrow](https://arrow.apache.org/docs/developers/contributing.html) page +in the Arrow developer documentation. + +We welcome questions, discussion, and contributions from users of the +`arrow` package. For information about mailing lists and other venues +for engaging with the Arrow developer and user communities, please see +the [Apache Arrow Community](https://arrow.apache.org/community/) page. + +------------------------------------------------------------------------ + +All participation in the Apache Arrow project is governed by the Apache +Software Foundation’s [code of +conduct](https://www.apache.org/foundation/policies/conduct.html). diff --git a/r/vignettes/arrow.Rmd b/r/vignettes/arrow.Rmd index e38296828fb2f..21cbbe48d61ed 100644 --- a/r/vignettes/arrow.Rmd +++ b/r/vignettes/arrow.Rmd @@ -72,6 +72,45 @@ to other applications and services that use Arrow. One example is Spark: the move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). +# Object hierarchy + +## Metadata objects + +Arrow defines the following classes for representing metadata: + +| Class | Description | How to create an instance | +| ---------- | -------------------------------------------------- | -------------------------------- | +| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` | +| `Field` | a character string name and a `DataType` | `field(name, type)` | +| `Schema` | list of `Field`s | `schema(...)` | + +## Data objects + +Arrow defines the following classes for representing zero-dimensional (scalar), +one-dimensional (array/vector-like), and two-dimensional (tabular/data +frame-like) data: + +| Dim | Class | Description | How to create an instance | +| --- | -------------- | ----------------------------------------- | -------------------------------------------------------------------------- | +| 0 | `Scalar` | single value and its `DataType` | `Scalar$create(value, type)` | +| 1 | `Array` | vector of values and its `DataType` | `Array$create(vector, type)` | +| 1 | `ChunkedArray` | vectors of values and their `DataType` | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)` | +| 2 | `RecordBatch` | list of `Array`s with a `Schema` | `RecordBatch$create(...)` or alias `record_batch(...)` | +| 2 | `Table` | list of `ChunkedArray` with a `Schema` | `Table$create(...)` or `arrow::read_*(file, as_data_frame = FALSE)` | +| 2 | `Dataset` | list of `Table`s with the same `Schema` | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)` | + +Each of these is defined as an `R6` class in the `arrow` R package and +corresponds to a class of the same name in the Arrow C++ library. The `arrow` +package provides a variety of `R6` and S3 methods for interacting with instances +of these classes. + +For convenience, the `arrow package also defines several synthetic classes that +do not exist in the C++ library, including: + +* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray` +* `ArrowTabular`: inherited by `RecordBatch` and `Table` +* `ArrowObject`: inherited by all Arrow objects + # Internals ## Mapping of R <--> Arrow types diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 32389b9516202..b5e17578b2905 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -20,11 +20,11 @@ and what is on the immediate development roadmap. The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about 10 years of the trip data in a public AWS S3 bucket. +of about 10 years of the trip data in a public Amazon S3 bucket. -The total file size is around 37 gigabytes, even in the efficient Parquet file format. -That's bigger than memory on most people's computers, -so we can't just read it all in and stack it into a single data frame. +The total file size is around 37 gigabytes, even in the efficient Parquet file +format. That's bigger than memory on most people's computers, so we can't just +read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, @@ -102,11 +102,11 @@ ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) The default file format for `open_dataset()` is Parquet; if we had a directory of Arrow format files, we could include `format = "arrow"` in the call. -Other supported formats include: "feather" (an alias for "arrow", as Feather v2 -is the Arrow file format), "csv", "tsv" (for tab-delimited), and "text" for -generic text-delimited files. For text files, you can pass any parsing options -("delim", "quote", etc.) to `open_dataset()` that you would otherwise pass to -`read_csv_arrow()`. +Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather +v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` +for generic text-delimited files. For text files, you can pass any parsing +options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise +pass to `read_csv_arrow()`. The `partitioning` argument lets us specify how the file paths provide information about how the dataset is chunked into different files. Our files in this example @@ -119,12 +119,12 @@ have file paths like ``` By providing a character vector to `partitioning`, we're saying that the first -path segment gives the value for "year" and the second segment is "month". -Every row in `2009/01/data.parquet` has a value of 2009 for "year" -and 1 for "month", even though those columns may not actually be present in the file. +path segment gives the value for `year` and the second segment is `month`. +Every row in `2009/01/data.parquet` has a value of 2009 for `year` +and 1 for `month`, even though those columns may not actually be present in the file. Indeed, when we look at the dataset, we see that in addition to the columns present -in every file, there are also columns "year" and "month". +in every file, there are also columns `year` and `month`. ```{r, eval = file.exists("nyc-taxi")} ds @@ -139,7 +139,7 @@ passenger_count: int8 trip_distance: float pickup_longitude: float pickup_latitude: float -rate_code_id: string +rate_code_id: null store_and_fwd_flag: string dropoff_longitude: float dropoff_latitude: float @@ -150,10 +150,6 @@ mta_tax: float tip_amount: float tolls_amount: float total_amount: float -improvement_surcharge: float -pickup_location_id: int32 -dropoff_location_id: int32 -congestion_surcharge: float year: int32 month: int32 @@ -182,16 +178,16 @@ files, we've parsed file paths to identify partitions, and we've read the headers of the Parquet files to inspect their schemas so that we can make sure they all line up. -In the current release, `arrow` supports methods for selecting a window of data: -`select()`, `rename()`, and `filter()`. Aggregation is not yet supported, -nor is deriving or projecting new columns, so before you call `summarize()` or -`mutate()`, you'll need to `collect()` the data first, -which pulls your selected window of data into an in-memory R data frame. -While we could have made those methods `collect()` the data they needed -automatically and invisibly to the end user, -we thought it best to make it explicit when you're pulling data into memory -so that you can construct your queries most efficiently -and not be surprised when some query consumes way more resources than expected. +In the current release, `arrow` supports the dplyr verbs `mutate()`, +`transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and +`arrange()`. Aggregation is not yet supported, so before you call `summarise()` +or other verbs with aggregate functions, use `collect()` to pull the selected +subset of the data into an in-memory R data frame. + +If you attempt to call unsupported `dplyr` verbs or unimplemented functions in +your query on an Arrow Dataset, the `arrow` package raises an error. However, +for `dplyr` queries on `Table` objects (which are typically smaller in size) the +package automatically calls `collect()` before processing that `dplyr` verb. Here's an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with @@ -201,10 +197,11 @@ fares greater than $100 in 2015, broken down by the number of passengers: system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% + mutate(tip_pct = 100 * tip_amount / total_amount) %>% group_by(passenger_count) %>% collect() %>% - summarize( - tip_pct = median(100 * tip_amount / total_amount), + summarise( + median_tip_pct = median(tip_pct), n = n() ) %>% print()) @@ -213,34 +210,38 @@ system.time(ds %>% ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} cat(" # A tibble: 10 x 3 - passenger_count tip_pct n - - 1 0 9.84 380 - 2 1 16.7 143087 - 3 2 16.6 34418 - 4 3 14.4 8922 - 5 4 11.4 4771 - 6 5 16.7 5806 - 7 6 16.7 3338 - 8 7 16.7 11 - 9 8 16.7 32 -10 9 16.7 42 + passenger_count median_tip_pct n + + 1 0 9.84 380 + 2 1 16.7 143087 + 3 2 16.6 34418 + 4 3 14.4 8922 + 5 4 11.4 4771 + 6 5 16.7 5806 + 7 6 16.7 3338 + 8 7 16.7 11 + 9 8 16.7 32 +10 9 16.7 42 user system elapsed 4.436 1.012 1.402 ") ``` -We just selected a window out of a dataset with around 2 billion rows -and aggregated on it in under 2 seconds on my laptop. How does this work? +We just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated on it in under 2 seconds on my laptop. How does +this work? -First, `select()`/`rename()`, `filter()`, and `group_by()` -record their actions but don't evaluate on the data until you run `collect()`. +First, +`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +`group_by()`, and `arrange()` record their actions but don't evaluate on the +data until you run `collect()`. ```{r, eval = file.exists("nyc-taxi")} ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% + mutate(tip_pct = 100 * tip_amount / total_amount) %>% group_by(passenger_count) ``` @@ -250,21 +251,22 @@ FileSystemDataset (query) tip_amount: float total_amount: float passenger_count: int8 +tip_pct: expr -* Filter: ((total_amount > 100:double) and (year == 2015:double)) +* Filter: ((total_amount > 100) and (year == 2015)) * Grouped by passenger_count See $.data for the source Arrow object ") ``` -This returns instantly and shows the window selection you've made, without +This returns instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, -you can build up a query that selects down to a small window without generating +you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -we can select a window of data from a much larger dataset by collecting the +we can select a subset of data from a much larger dataset by collecting the smaller slices from each file--we don't have to load the whole dataset in memory in order to slice from it. @@ -278,9 +280,17 @@ avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. -For one, you can specify a `schema` argument to declare the columns and their data types. -This is useful if you have data files that have different storage schema -(for example, a column could be `int32` in one and `int8` in another) +For one, if you are working with a single file or a set of files that are not +all in the same directory, you can provide a file path or a vector of multiple +file paths to `open_dataset()`. This is useful if, for example, you have a +single CSV file that is too big to read into memory. You could pass the file +path to `open_dataset()`, use `group_by()` to partition the Dataset into +manageable chunks, then use `write_dataset()` to write each chunk to a separate +Parquet file---all without needing to read the full CSV file into R. + +You can specify a `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. @@ -289,7 +299,7 @@ The schema specification just lets you declare what you want the result to be. Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. This would be useful, in our taxi dataset example, if you wanted to keep -"month" as a string instead of an integer for some reason. +`month` as a string instead of an integer for some reason. Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, @@ -322,9 +332,10 @@ by calling `write_dataset()` on it: write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the "payment_type" column is something we often filter on, -so we want to partition the data by that variable. By doing so we ensure that a filter like -`payment_type == 3` will touch only a subset of files where payment_type is always 3. +Next, let's imagine that the `payment_type` column is something we often filter +on, so we want to partition the data by that variable. By doing so we ensure +that a filter like `payment_type == "Cash"` will touch only a subset of files +where `payment_type` is always `"Cash"`. One natural way to express the columns you want to partition on is to use the `group_by()` method: @@ -339,33 +350,35 @@ This will write files to a directory tree that looks like this: ```r system("tree nyc-taxi/feather") +``` -# feather -# ├── payment_type=1 -# │ └── part-5.feather -# ├── payment_type=2 -# │ └── part-0.feather -# ... -# └── payment_type=5 -# └── part-2.feather -# -# 5 directories, 25 files +``` +## feather +## ├── payment_type=1 +## │ └── part-18.feather +## ├── payment_type=2 +## │ └── part-19.feather +## ... +## └── payment_type=UNK +## └── part-17.feather +## +## 18 directories, 23 files ``` -Note that the directory names are `payment_type=1` and similar: +Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when we call `open_dataset()` on this directory, we don't have to declare what the partitions are because they can be read from the file paths. -(To instead write bare values for partition segments, -i.e. `1` rather than `payment_type=1`, call `write_dataset()` with `hive_style = FALSE`.) +(To instead write bare values for partition segments, i.e. `Cash` rather than +`payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == 3` is the only data we ever care about, +Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, and we just want to drop the rest and have a smaller working set. For this, we can `filter()` them out when writing: ```r ds %>% - filter(payment_type == 3) %>% + filter(payment_type == "Cash") %>% write_dataset("nyc-taxi/feather", format = "feather") ``` @@ -381,4 +394,4 @@ ds %>% ``` Note that while you can select a subset of columns, -you cannot currently rename columns when writing. +you cannot currently rename columns when writing a dataset.