BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85

HenrikBengtsson · 2017-02-06T05:37:26Z

Issue

Using makeClusterFunctionsTorque() on a TORQUE compute cluster, at first reduceResultsList() returns list of NULL elements but a bit later a list of the actual values. This even after waitForJobs() returns TRUE. I suspect this is due to the infamous NFS delay and not polling the existence of the result files.

Example

Sourcing the following test.R script:

library("batchtools")
unlink("registry", recursive = TRUE)
stopifnot(!file_test("-d", "registry"))
reg <- makeRegistry("registry")
cat("#!/bin/bash\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n", file = "my.tmpl")
reg$cluster.functions <- makeClusterFunctionsTorque(template = "my.tmpl")
print(reg)

jobs <- batchMap(fun = identity, 1:3, reg = reg)
print(jobs)

submitJobs(reg = reg)

done <- waitForJobs(reg = reg)
stopifnot(done)

t0 <- Sys.time()
count <- 1L
repeat {
  y <- reduceResultsList(reg = reg)
  str(list(count = count, dt = Sys.time() - t0, y = y))
  if (!is.null(y[[1]])) break
  Sys.sleep(0.1)
  count <- count + 1L
}

gives the following:

> library("batchtools")
> unlink("registry", recursive = TRUE)
> stopifnot(!file_test("-d", "registry"))
> reg <- makeRegistry("registry")
> cat("#!/bin/bash\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n", file = "my.tmpl")
> reg$cluster.functions <- makeClusterFunctionsTorque(template = "my.tmpl")
> print(reg)
Job Registry
  Name    : Torque
  File dir: /home/henrik/batchtools-review/registry
  Work dir: /home/henrik/batchtools-review
  Jobs    : 0
  Seed    : 567656413

> jobs <- batchMap(fun = identity, 1:3, reg = reg)
Adding 3 jobs ...

> print(jobs)
   job.id
1:      1
2:      2
3:      3

> submitJobs(reg = reg)
Submitting 3 jobs in 3 chunks using cluster functions 'Torque' ...

> done <- waitForJobs(reg = reg)
> stopifnot(done)

> t0 <- Sys.time()
> count <- 1L
> repeat {
+   y <- reduceResultsList(reg = reg)
+   str(list(count = count, dt = Sys.time() - t0, y = y))
+   if (!is.null(y[[1]])) break
+   count < .... [TRUNCATED] 
List of 3
 $ count: int 1
 $ dt   :Class 'difftime'  atomic [1:1] 0.00505
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 3
 $ count: int 2
 $ dt   :Class 'difftime'  atomic [1:1] 0.017
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
[...]
List of 3
 $ count: int 425
 $ dt   :Class 'difftime'  atomic [1:1] 4.87
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 3
 $ count: int 426
 $ dt   :Class 'difftime'  atomic [1:1] 4.89
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : int 1
  ..$ : int 2
  ..$ : int 3
>

Troubleshooting / Suggestion

It looks like batchtools:::.reduceResultsList(), which is called by reduceResultsList(), silently skips reading any result files for which file.exists(fns) returns FALSE.

If that's indeed the intended behavior, then I cannot tell whether the bug is in reduceResultsList() assuming all the results files should be there when batchtools:::.reduceResultsList() is called or in waitForJobs() that should not return until all the result files are there.

As my example code shows, I suspect that there's a delay in the network file system (NFS) causing the already written result files to not be visible from the master machine until several seconds later. The file.exists() suggests this.

FYI, it looks like calling dir(path = file.path(reg$file.path, "results")) forces NFS to sync its view such that file.exists() returns TRUE. However, I don't know whether that is a bullet proof solution.

To me it seems like batchtools (waitForJobs() or something) needs to pool for the result files before trying they are queried.

LATE UPDATES:

I see that batchtools:::.loadResult() and hence loadResult() gives an error if the file is not there.
I see from help("reduceResultsList") that the "otherwise NULL` behavior" is expected from that function.
I see that reduceResults() will try to read the result files directly, so if they're not there an error will be generated. PS / feedback. The different default / error behavior to reduceResultsList() is a bit confusing given their similarity in names.

Session information

This is on a Scyld cluster with TORQUE / PBS + Moab.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] batchtools_0.9.1-9000 data.table_1.10.4     R.archive_0.3.0      

loaded via a namespace (and not attached):
 [1] prettyunits_1.0.2 digest_0.6.12     assertthat_0.1    R.methodsS3_1.7.1
 [5] rappdirs_0.3.1    R6_2.2.0          backports_1.0.5   magrittr_1.5     
 [9] progress_1.1.2    stringi_1.1.2     R.oo_1.21.0       R.utils_2.5.0    
[13] brew_1.0-6        checkmate_1.8.2   tools_3.3.2

The text was updated successfully, but these errors were encountered:

mllg · 2017-02-06T22:23:03Z

Henrik,

Thanks for the detailed report. I'll change the result functions to act more consistently in the next days.

Regarding the file system latency: I guess you are hitting the attribute cache of NFS. Are you only experiencing problems for the results? If somehow possible, can you try if you get reliable results with one of:

file.exists = function(x) { isdir = file.info(x)$isdir; !is.na(isdir) & !isdir }
file.exists = function(x) basename(x) %in% list.files(dirname(x))

?

HenrikBengtsson · 2017-02-07T05:05:29Z

Using:

file_exists_1 <- function(x) { isdir <- file.info(x)$isdir; !is.na(isdir) & !isdir }
file_exists_2 <-  function(x) { basename(x) %in% list.files(dirname(x)) }

on the first result file:

pathname <- file.path(reg$file.dir, "results", "1.rds")

I get:

List of 6                                                                     
 $ count        : int 1
 $ file.exists  : logi FALSE
 $ file_exists_1: logi FALSE
 $ file_exists_2: logi TRUE
 $ dt           :Class 'difftime'  atomic [1:1] 0.0138
  .. ..- attr(*, "units")= chr "secs"
 $ y            :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 6
 $ count        : int 2
 $ file.exists  : logi TRUE
 $ file_exists_1: logi TRUE
 $ file_exists_2: logi TRUE
 $ dt           :Class 'difftime'  atomic [1:1] 0.165
  .. ..- attr(*, "units")= chr "secs"
 $ y            :List of 3
  ..$ : int 1
  ..$ : int 2
  ..$ : int 3

I suspect that the first call to file_exists_2(pathname) triggers the NFS cache to be updated. This is in line with my previous comment where I noticed that dir(path = file.path(reg$file.path, "results")) seems fix it.

mllg · 2017-02-07T08:57:52Z

I've started with a simple waitForFiles() function here.
You can set fs.latency in makeClusterFunctions* to a positive number (e.g., 30) to let waitForJobs() also wait for the results.

If the approach works, I will try to generalize and apply the same approach to other file system ops.

HenrikBengtsson · 2017-02-10T21:19:56Z

Would it make sense to make fs.latency=30 (the default) rather than zero? This would cover more users (one less thing to worry about). The downside is that users might not be / become aware of their NFS cache delay, but not sure if that's a big problem (e.g. I doubt they would go an optimize the settings to lower it anyway).

HenrikBengtsson · 2017-02-10T21:24:13Z

BTW, I wonder if there's could be a well defined / documented system call "out there" that triggers a blocking updating of the NFS cache? Maybe list.files() / dir() calls this, which is why it appears to force NFS to be updated (at least on my system).

HenrikBengtsson · 2017-02-11T22:54:12Z

Just reporting back. Trying which what's on the master branch right now, I get:

> done <- waitForJobs(reg = reg)                                              
Syncing 3 files ...
> done
[1] TRUE
> y <- reduceResultsList(reg = reg)
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/henrik/registry/results/1.rds', probable reason 'No such file or directory'

which I consider an improvement, because it's better to get an error than an incorrect value (= NULL).

And, sure, enough, I can refresh the NFS cache by calling:

> dir(file.path(reg$file.dir, "results"))
[1] "1.rds" "2.rds" "3.rds"

and after this, I can read the results:

> y <- reduceResultsList(reg = reg)
> str(y)
List of 3
 $ : num 1
 $ : num 2
 $ : num 3

…efore reading batchtools result file. Related to: mllg/batchtools#85

mllg · 2017-02-13T10:09:13Z

I've merged a heuristic into master and the default timeout is 65s (NFS keeps the cache up to 60s, so this hopefully works). I try a file.exists() first and then call list.files() if not all files were found in the first attempt. This is disabled for Interactive and Multicore as these backends do not involve multiple NFS hosts and a client should invalidate its own cache after writing to the directory. Let me know if you are also experiencing problems in these modes.

As you already noticed, I've changed the reduce functions in the master branch to throw an error if the result file is not found. Both changes put together should solve your issues on your system.

Sorry for bothering you with this stuff, but on all of my systems I cannot reproduce. And thanks for the in-depth analysis.

HenrikBengtsson · 2017-02-14T05:48:26Z

No worries, I'm happy to help out getting this package as stable as possible.

So, good news. Today I've put batchtools 0.9.1-9000 (commit 46e2bfe) for some real-world serious testing on our TOEQUE / PBS cluster. I did this via an early-version of future.batchtools which internally utilizes the batchtools package. This allowed me to run through the same system tests that I also run sequential, plain futures (on top of parallel package), and future.BatchJobs. Several of these tests also asserts identical results (regardless of computation backend). The tests runs for hours. I get all OK in these tests.

HenrikBengtsson · 2017-02-14T05:49:23Z

Feel free to close this issue whenever your done.

mllg · 2017-02-14T08:36:44Z

Great! Re-open if you encounter any problems.

HenrikBengtsson added a commit to HenrikBengtsson/future.batchtools that referenced this issue Feb 11, 2017

BUG FIX: value() for BatchtoolsFuture tries to force NFS cache sync b…

a7df427

…efore reading batchtools result file. Related to: mllg/batchtools#85

mllg closed this as completed Feb 14, 2017

HenrikBengtsson mentioned this issue Feb 15, 2017

[REVIEW]: batchtools: Tools for R to work on batch systems openjournals/joss-reviews#135

Closed

17 tasks

yimyom mentioned this issue Jun 30, 2020

makeRegistry(NA) causes jobs to expire on Slurm cluster #237

Open

This was referenced Jun 5, 2021

Slurm readLog() Error - Option to change fs.latency & scheduler.latency from batchtools_slurm or future::tweak HenrikBengtsson/future.batchtools#73

Closed

Error in readLog() on Slurm cluster #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85

BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85

HenrikBengtsson commented Feb 6, 2017

mllg commented Feb 6, 2017

HenrikBengtsson commented Feb 7, 2017

mllg commented Feb 7, 2017

HenrikBengtsson commented Feb 10, 2017

HenrikBengtsson commented Feb 10, 2017

HenrikBengtsson commented Feb 11, 2017

mllg commented Feb 13, 2017

HenrikBengtsson commented Feb 14, 2017

HenrikBengtsson commented Feb 14, 2017

mllg commented Feb 14, 2017

BUG: Post waitForJobs(), reduceResultsList() returns list with NULL:s and a bit later the values #85

BUG: Post waitForJobs(), reduceResultsList() returns list with NULL:s and a bit later the values #85

Comments

HenrikBengtsson commented Feb 6, 2017

Issue

Example

Troubleshooting / Suggestion

Session information

mllg commented Feb 6, 2017

HenrikBengtsson commented Feb 7, 2017

mllg commented Feb 7, 2017

HenrikBengtsson commented Feb 10, 2017

HenrikBengtsson commented Feb 10, 2017

HenrikBengtsson commented Feb 11, 2017

mllg commented Feb 13, 2017

HenrikBengtsson commented Feb 14, 2017

HenrikBengtsson commented Feb 14, 2017

mllg commented Feb 14, 2017

BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85

BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85