R package to provide mclapply() syntax for Windows machines
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
tests
DESCRIPTION
LICENSE
NAMESPACE
README.md

README.md

parallelsugar

An R package to provide mclapply() syntax for Windows machines. Has no effect on other platforms.

Note, this is an update of the script formerly found at

http://www.stat.cmu.edu/~nmv/setup/mclapply.hack.R

If you wish to continue using that version (for whatever reason), you can find the script at

http://edustatistics.org/nathanvan/setup/mclapply.hack.R

and the accompanying blog post describing its use here.

Installation

Step 0: If you do not already have devtools installed, install it using the instructions here. Note that for the purposes of this package, installing Rtools is not necessary.

Step 1: Install parallelsugar directly from my GitHub repository using install_github('nathanvan/parallelsugar'). For the purposes of this package, you may ignore the error about Rtools (unless you already have it installed, in which case the warning will not appear.)

> library(devtools)
WARNING: Rtools is required to build R packages, but is not currently
installed.
   ... snip ...
> install_github('nathanvan/parallelsugar')
Downloading github repo nathanvan/parallelsugar@master
Installing parallelsugar
  ... snip ...
* DONE (parallelsugar)

Usage examples

Basic Usage

On Windows, the following line will take about 40 seconds to run because by default, mclapply from the parallel package is implemented as a serial function on Windows systems.

library(parallel) 

system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.00    0.00   40.06 

If we load parallelsugar, the default implementation of parallel::mclapply, which used fork based clusters, will be overwritten by parallelsugar::mclapply, which is implemented with socket clusters. The above line of code will then take closer to 10 seconds.

library(parallelsugar)
## 
## Attaching package: ‘parallelsugar’
## 
## The following object is masked from ‘package:parallel’:
## 
##     mclapply
    
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.04    0.08   12.98 

Use of global variables and packages

By design, parallelsugar approximates a fork based cluster -- every object that is within scope to the master R process is copied over to the processes on the other sockets. This implies that

  • you can quickly run out of memory, and
  • you can waste a lot of time copying over unnecessary objects hanging around in your R session.

Be warned!

## Load a package 
library(Matrix)

## Define a global variable
a.global.variable <- Matrix::Diagonal(3)

## Define a global function 
wait.then.square <- function(xx){
  ## Wait for 5 seconds
  Sys.sleep(5);
  ## Square the argument
  xx^2 
}

## Check that it works with plain lapply
serial.output <- lapply( 1:4, function(xx) {
      return( wait.then.square(xx) + a.global.variable )
    }) 

## Test with the modified mclapply  
par.output <- mclapply( 1:4, function(xx) {
      return( wait.then.square(xx) + a.global.variable )
    })

## Are they equal? 
all.equal( serial.output, par.output )
## [1] TRUE

Request for feedback and help

I put this together because it helped to solve a specific problem that I was having. If it solves your problem, please let me know. If it needs to be modified to solve your problem, please either

  • open an issue on GitHub, or
  • even better, fork, fix, and issue a pull request.