Feature request: parallel functions #9

Closed
kendonB opened this Issue Jun 15, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@kendonB

kendonB commented Jun 15, 2016

Currently, there's no nice way to get progress bars for parallel::*apply functions - the best I'm able to do on Windows is write to a .txt log file from each process, which is cumbersome.

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Jun 15, 2016

Owner

@kendonB: not sure about mclapply, but the par*apply functions split the workload and push that to the workers, so the main process is only idling thus no real progress to show. Right now I can't see an easy way of implementing the request, but I am open to suggestions.

Owner

psolymos commented Jun 15, 2016

@kendonB: not sure about mclapply, but the par*apply functions split the workload and push that to the workers, so the main process is only idling thus no real progress to show. Right now I can't see an easy way of implementing the request, but I am open to suggestions.

@kendonB

This comment has been minimized.

Show comment
Hide comment
@kendonB

kendonB Jun 15, 2016

Perhaps a solution is through a text file on disk. The main process could periodically read the file and output something to the console. I have no idea how easy this might be and how deep into the parallel package functions you might have to go to get it to periodically monitor something.

kendonB commented Jun 15, 2016

Perhaps a solution is through a text file on disk. The main process could periodically read the file and output something to the console. I have no idea how easy this might be and how deep into the parallel package functions you might have to go to get it to periodically monitor something.

@kvnkuang

This comment has been minimized.

Show comment
Hide comment
@kvnkuang

kvnkuang Sep 7, 2016

Hi there, I recently created a package to track the parallel apply functions (mc*apply). It's on CRAN now: https://cran.r-project.org/web/packages/pbmcapply/index.html.

kvnkuang commented Sep 7, 2016

Hi there, I recently created a package to track the parallel apply functions (mc*apply). It's on CRAN now: https://cran.r-project.org/web/packages/pbmcapply/index.html.

@psolymos psolymos added the enhancement label Sep 7, 2016

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 7, 2016

Owner

@kvnkuang : thanks for the note, it is great to see the package addressing the feature request for forking type parallelism. I consider #9 closed for now.

Owner

psolymos commented Sep 7, 2016

@kvnkuang : thanks for the note, it is great to see the package addressing the feature request for forking type parallelism. I consider #9 closed for now.

@psolymos psolymos closed this Sep 7, 2016

@kendonB

This comment has been minimized.

Show comment
Hide comment
@kendonB

kendonB Sep 7, 2016

I'd suggest keeping this open as it doesn't yet work for Windows

kendonB commented Sep 7, 2016

I'd suggest keeping this open as it doesn't yet work for Windows

@kvnkuang

This comment has been minimized.

Show comment
Hide comment
@kvnkuang

kvnkuang Sep 7, 2016

Hey @kendonB, since forking is not supported on Windows, mc*apply will throw an error if you try to run it on Windows with num.cores > 1. So unfortunately the package cannot work on Windows as a result.

kvnkuang commented Sep 7, 2016

Hey @kendonB, since forking is not supported on Windows, mc*apply will throw an error if you try to run it on Windows with num.cores > 1. So unfortunately the package cannot work on Windows as a result.

@psolymos psolymos reopened this Sep 7, 2016

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 7, 2016

Owner

Maybe a solution similar to parLapplyLB could be implemented by increased communication overhead among the workers. This could work on Windows (and other OS as well).

Owner

psolymos commented Sep 7, 2016

Maybe a solution similar to parLapplyLB could be implemented by increased communication overhead among the workers. This could work on Windows (and other OS as well).

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 8, 2016

Owner

@kendonB : see my take on a possible solution 9bf861b . The same idea can be carried forward for similar functions (pbsapply and pbreplicate for sure because these are based on pblapply). By adding the cl argument after ... I can see that I add the option for parallel processing to pblapply instead of having it in a separate function.

The main difference relative to what parallel::parLapply does is this:

> parallel::splitIndices(10, 4)
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5

[[3]]
[1] 6 7

[[4]]
[1]  8  9 10

> splitpb(10, 4)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6 7 8

[[3]]
[1]  9 10

which means that instead of passing the chunks to the workers at once, we do it multiple times while updating the progress bar. This means increased communication overhead between the master and workers, which is a price one has to pay for a progress bar. Currently I can't see any work-around to speed things up even more. See a little example in the commit cited above for timings.

mclapply can be added in a similar manner, even as cl defined as an integer. I would rather remove the cluster-auto-detect feature as I find it quite dangerous (e.g. you might have to push objects to the workers anyways due to lack of shared memory in a non-forking situation, but more importantly, setting up RNGs safely cannot be done when the cluster is created AND destroyed within the function).

Owner

psolymos commented Sep 8, 2016

@kendonB : see my take on a possible solution 9bf861b . The same idea can be carried forward for similar functions (pbsapply and pbreplicate for sure because these are based on pblapply). By adding the cl argument after ... I can see that I add the option for parallel processing to pblapply instead of having it in a separate function.

The main difference relative to what parallel::parLapply does is this:

> parallel::splitIndices(10, 4)
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5

[[3]]
[1] 6 7

[[4]]
[1]  8  9 10

> splitpb(10, 4)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6 7 8

[[3]]
[1]  9 10

which means that instead of passing the chunks to the workers at once, we do it multiple times while updating the progress bar. This means increased communication overhead between the master and workers, which is a price one has to pay for a progress bar. Currently I can't see any work-around to speed things up even more. See a little example in the commit cited above for timings.

mclapply can be added in a similar manner, even as cl defined as an integer. I would rather remove the cluster-auto-detect feature as I find it quite dangerous (e.g. you might have to push objects to the workers anyways due to lack of shared memory in a non-forking situation, but more importantly, setting up RNGs safely cannot be done when the cluster is created AND destroyed within the function).

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 8, 2016

Owner

This is now in the pb-parallel branch. Here is a todo list:

  • implement parallel option in this as part of pblapply through cl argument
  • implement mclapply based forking when is.integer(cl)
  • test forking on Unix
  • update examples with parallel feature and add timings (use dontrun{})
  • remind folks that objects need to be pushed to cluster
  • remind folks that safe RNG is not set-up is their responsibility
Owner

psolymos commented Sep 8, 2016

This is now in the pb-parallel branch. Here is a todo list:

  • implement parallel option in this as part of pblapply through cl argument
  • implement mclapply based forking when is.integer(cl)
  • test forking on Unix
  • update examples with parallel feature and add timings (use dontrun{})
  • remind folks that objects need to be pushed to cluster
  • remind folks that safe RNG is not set-up is their responsibility

@psolymos psolymos added this to the v1.3 milestone Sep 8, 2016

@psolymos psolymos self-assigned this Sep 8, 2016

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 8, 2016

Owner

Forking on Ubuntu Linux technically works, but the performance is very bad. So far it looks like that neither my implementation, nor @kvnkuang 's pbmcapply::pbmclapply seem to give huge improvement in this particular bootstrap example:

> n=10000
> x <- rnorm(n)
> y <- rnorm(n, crossprod(t(model.matrix(~x)), c(0,1)), sd=0.5)
d <- data.frame(y, x)
## model fitting and bootstrap
mod <- lm(y~x, d)
ndat <- model.frame(mod)
B <- 100
bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
fun <- function(z) {
    if (missing(z))
        z <- sample(nrow(ndat), nrow(ndat), TRUE)
    coef(lm(mod$call$formula, data=ndat[z,]))
}
> d <- data.frame(y, x)
> ## model fitting and bootstrap
> mod <- lm(y~x, d)
> ndat <- model.frame(mod)
> B <- 100
> bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
> fun <- function(z) {
+     if (missing(z))
+         z <- sample(nrow(ndat), nrow(ndat), TRUE)
+     coef(lm(mod$call$formula, data=ndat[z,]))
+ }
> system.time(res1 <- lapply(1:B, function(i) fun(bid[,i])))
   user  system elapsed
  1.444   0.016   1.460
> system.time(res1pb <- pblapply(1:B, function(i) fun(bid[,i])))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
   user  system elapsed
  1.460   0.036   1.495
> system.time(res2mc <- mclapply(1:B, function(i) fun(bid[,i]), mc.cores = 2L))
   user  system elapsed
  0.004   0.008   0.959
> system.time(res1pbmc <- pblapply(1:B, function(i) fun(bid[,i]), cl = 2L))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
   user  system elapsed
  3.848   0.900   1.612
> system.time(res1pbmcx <- pbmclapply(1:B, function(i) fun(bid[,i]), mc.cores = 2L))
  |========================================================| 100%   
   user  system elapsed
  0.152   0.020   1.564

As opposed to forking, snow type clusters work much faster and the improvement is reasonable even with increased overhead:

> cl <- makeCluster(2L)
> clusterExport(cl, c("fun", "mod", "ndat", "bid"))
> system.time(res1cl <- parLapply(cl = cl, 1:B, function(i) fun(bid[,i])))
   user  system elapsed
  0.004   0.000   0.984
> system.time(res1pbcl <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
   user  system elapsed
  0.076   0.008   1.163
> stopCluster(cl)

I am also tempted to find some clever way of how splitpb works. Currently it splits the problem of nx jobs into nn = ceiling(nx / ncl) partitions. That is reasonable if say nn is <50 or <25 so that the progress bar advances smoothly. For larger problems, we might use a constant k to keep the number of partitions a maximum number, say 50 or 100. Instead of splitpb(nx, ncl) I can use splitpb(nx, ncl*k). This would provide a smooth progress bar but minimize overhead for large problems. Could also help in the forking case when number of iterations (B) is large.

Additional todo items:

  • implement tuning for splitpb
  • test bootstrap case with B=1000 and see how much tuning helps.
Owner

psolymos commented Sep 8, 2016

Forking on Ubuntu Linux technically works, but the performance is very bad. So far it looks like that neither my implementation, nor @kvnkuang 's pbmcapply::pbmclapply seem to give huge improvement in this particular bootstrap example:

> n=10000
> x <- rnorm(n)
> y <- rnorm(n, crossprod(t(model.matrix(~x)), c(0,1)), sd=0.5)
d <- data.frame(y, x)
## model fitting and bootstrap
mod <- lm(y~x, d)
ndat <- model.frame(mod)
B <- 100
bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
fun <- function(z) {
    if (missing(z))
        z <- sample(nrow(ndat), nrow(ndat), TRUE)
    coef(lm(mod$call$formula, data=ndat[z,]))
}
> d <- data.frame(y, x)
> ## model fitting and bootstrap
> mod <- lm(y~x, d)
> ndat <- model.frame(mod)
> B <- 100
> bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
> fun <- function(z) {
+     if (missing(z))
+         z <- sample(nrow(ndat), nrow(ndat), TRUE)
+     coef(lm(mod$call$formula, data=ndat[z,]))
+ }
> system.time(res1 <- lapply(1:B, function(i) fun(bid[,i])))
   user  system elapsed
  1.444   0.016   1.460
> system.time(res1pb <- pblapply(1:B, function(i) fun(bid[,i])))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
   user  system elapsed
  1.460   0.036   1.495
> system.time(res2mc <- mclapply(1:B, function(i) fun(bid[,i]), mc.cores = 2L))
   user  system elapsed
  0.004   0.008   0.959
> system.time(res1pbmc <- pblapply(1:B, function(i) fun(bid[,i]), cl = 2L))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
   user  system elapsed
  3.848   0.900   1.612
> system.time(res1pbmcx <- pbmclapply(1:B, function(i) fun(bid[,i]), mc.cores = 2L))
  |========================================================| 100%   
   user  system elapsed
  0.152   0.020   1.564

As opposed to forking, snow type clusters work much faster and the improvement is reasonable even with increased overhead:

> cl <- makeCluster(2L)
> clusterExport(cl, c("fun", "mod", "ndat", "bid"))
> system.time(res1cl <- parLapply(cl = cl, 1:B, function(i) fun(bid[,i])))
   user  system elapsed
  0.004   0.000   0.984
> system.time(res1pbcl <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
   user  system elapsed
  0.076   0.008   1.163
> stopCluster(cl)

I am also tempted to find some clever way of how splitpb works. Currently it splits the problem of nx jobs into nn = ceiling(nx / ncl) partitions. That is reasonable if say nn is <50 or <25 so that the progress bar advances smoothly. For larger problems, we might use a constant k to keep the number of partitions a maximum number, say 50 or 100. Instead of splitpb(nx, ncl) I can use splitpb(nx, ncl*k). This would provide a smooth progress bar but minimize overhead for large problems. Could also help in the forking case when number of iterations (B) is large.

Additional todo items:

  • implement tuning for splitpb
  • test bootstrap case with B=1000 and see how much tuning helps.
@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 12, 2016

Owner

See some timing results in this blog post.

Owner

psolymos commented Sep 12, 2016

See some timing results in this blog post.

@psolymos psolymos referenced this issue Sep 14, 2016

Merged

Pb parallel #10

@psolymos

This comment has been minimized.

Show comment
Hide comment
@psolymos

psolymos Sep 14, 2016

Owner

PR #10 closes this feature request.

Owner

psolymos commented Sep 14, 2016

PR #10 closes this feature request.

@psolymos psolymos closed this Sep 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment