multidplyr

multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with partition() and then the data stays on each node until you explicitly retrieve it with collect(). This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by partools by Norm Matloff and distributedR by the Vertica Analytics team.

Due to the overhead associated with communicating between the nodes, you won't expect to see much performance improvement on basic dplyr verbs with less than ~10 million observations. However, you'll see improvements much faster if you're doing more complex operations with do().

To learn more, read the vignette.

Installation

To install from GitHub:

# install.packages("devtools")
devtools::install_github("hadley/multidplyr")

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
R		R
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
multidplyr.Rproj		multidplyr.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multidplyr

Installation

About

Releases

Packages

Languages

kendonB/multidplyr

Folders and files

Latest commit

History

Repository files navigation

multidplyr

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages