-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Post waitForJobs(), reduceResultsList() returns list with NULL
:s and a bit later the values
#85
Comments
Henrik, Thanks for the detailed report. I'll change the result functions to act more consistently in the next days. Regarding the file system latency: I guess you are hitting the attribute cache of NFS. Are you only experiencing problems for the results? If somehow possible, can you try if you get reliable results with one of: file.exists = function(x) { isdir = file.info(x)$isdir; !is.na(isdir) & !isdir }
file.exists = function(x) basename(x) %in% list.files(dirname(x)) ? |
Using: file_exists_1 <- function(x) { isdir <- file.info(x)$isdir; !is.na(isdir) & !isdir }
file_exists_2 <- function(x) { basename(x) %in% list.files(dirname(x)) } on the first result file: pathname <- file.path(reg$file.dir, "results", "1.rds") I get: List of 6
$ count : int 1
$ file.exists : logi FALSE
$ file_exists_1: logi FALSE
$ file_exists_2: logi TRUE
$ dt :Class 'difftime' atomic [1:1] 0.0138
.. ..- attr(*, "units")= chr "secs"
$ y :List of 3
..$ : NULL
..$ : NULL
..$ : NULL
List of 6
$ count : int 2
$ file.exists : logi TRUE
$ file_exists_1: logi TRUE
$ file_exists_2: logi TRUE
$ dt :Class 'difftime' atomic [1:1] 0.165
.. ..- attr(*, "units")= chr "secs"
$ y :List of 3
..$ : int 1
..$ : int 2
..$ : int 3 I suspect that the first call to |
I've started with a simple If the approach works, I will try to generalize and apply the same approach to other file system ops. |
Would it make sense to make |
BTW, I wonder if there's could be a well defined / documented system call "out there" that triggers a blocking updating of the NFS cache? Maybe |
Just reporting back. Trying which what's on the master branch right now, I get: > done <- waitForJobs(reg = reg)
Syncing 3 files ...
> done
[1] TRUE
> y <- reduceResultsList(reg = reg)
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file '/home/henrik/registry/results/1.rds', probable reason 'No such file or directory' which I consider an improvement, because it's better to get an error than an incorrect value (= NULL). And, sure, enough, I can refresh the NFS cache by calling: > dir(file.path(reg$file.dir, "results"))
[1] "1.rds" "2.rds" "3.rds" and after this, I can read the results: > y <- reduceResultsList(reg = reg)
> str(y)
List of 3
$ : num 1
$ : num 2
$ : num 3 |
…efore reading batchtools result file. Related to: mllg/batchtools#85
I've merged a heuristic into master and the default timeout is 65s (NFS keeps the cache up to 60s, so this hopefully works). I try a As you already noticed, I've changed the reduce functions in the master branch to throw an error if the result file is not found. Both changes put together should solve your issues on your system. Sorry for bothering you with this stuff, but on all of my systems I cannot reproduce. And thanks for the in-depth analysis. |
No worries, I'm happy to help out getting this package as stable as possible. So, good news. Today I've put batchtools 0.9.1-9000 (commit 46e2bfe) for some real-world serious testing on our TOEQUE / PBS cluster. I did this via an early-version of future.batchtools which internally utilizes the batchtools package. This allowed me to run through the same system tests that I also run sequential, plain futures (on top of parallel package), and future.BatchJobs. Several of these tests also asserts identical results (regardless of computation backend). The tests runs for hours. I get all OK in these tests. |
Feel free to close this issue whenever your done. |
Great! Re-open if you encounter any problems. |
Issue
Using
makeClusterFunctionsTorque()
on a TORQUE compute cluster, at firstreduceResultsList()
returns list ofNULL
elements but a bit later a list of the actual values. This even afterwaitForJobs()
returnsTRUE
. I suspect this is due to the infamous NFS delay and not polling the existence of the result files.Example
Sourcing the following
test.R
script:gives the following:
Troubleshooting / Suggestion
It looks like
batchtools:::.reduceResultsList()
, which is called byreduceResultsList()
, silently skips reading any result files for whichfile.exists(fns)
returnsFALSE
.If that's indeed the intended behavior, then I cannot tell whether the bug is in
reduceResultsList()
assuming all the results files should be there whenbatchtools:::.reduceResultsList()
is called or inwaitForJobs()
that should not return until all the result files are there.As my example code shows, I suspect that there's a delay in the network file system (NFS) causing the already written result files to not be visible from the master machine until several seconds later. The
file.exists()
suggests this.FYI, it looks like calling
dir(path = file.path(reg$file.path, "results"))
forces NFS to sync its view such thatfile.exists()
returnsTRUE
. However, I don't know whether that is a bullet proof solution.To me it seems like batchtools (
waitForJobs()
or something) needs to pool for the result files before trying they are queried.LATE UPDATES:
batchtools:::.loadResult()
and henceloadResult()
gives an error if the file is not there.help("reduceResultsList") that the "otherwise
NULL` behavior" is expected from that function.reduceResults()
will try to read the result files directly, so if they're not there an error will be generated. PS / feedback. The different default / error behavior toreduceResultsList()
is a bit confusing given their similarity in names.Session information
This is on a Scyld cluster with TORQUE / PBS + Moab.
The text was updated successfully, but these errors were encountered: