Job requeue on Slurm - 'no such file or directory' & workaround #280

stuvet · 2021-07-24T19:54:34Z

I've been troubleshooting stability of batchtools when used on Slurm with the default makeClusterFunctionsSlurm (PR #276 & #277 ).

The last (rare) error I can reproduce is:

Expected Behaviour

If a submitted job is requeued by Slurm:
1. batchtools should not report an expired status -> Mapped missing Slurm job state codes #277.
2. If the job should have run without error at first submission, the requeued job should also run successfully (assuming no fatal hardware errors)

Problem

Slurm jobs which are requeued because of a previous hardware failure fail within 30 seconds of starting the second run.

Reprex

Awkward, because it relies on an available (non-mission-critical) Slurm cluster, but manually deleting the worker node (via GCP) of a running & error-free job results in a requeue, a delay, then a reliable error about 20 seconds after the job begins its second run (file path removed for posting):

Error in gzfile(file, "rb") : cannot open the connection
Calls: <Anonymous> -> doJobCollection.character -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '.../jobs/job929872958e6074e5662a4c9hd3f312f4.rds', probable reason 'No such file or directory'

Cause

batchtools:::doJobCollection.character deletes the jobCollection file.rds on the first run, so when the failed job gets requeued the file is no longer there, causing the error.
Handling the error with an informative message would be helpful.

Workaround

Passing chunks.as.arrayjobs = TRUE in the resources request prevents this error (even if jobs are submitted singly) as it prevents the first run of the job deleting the jobCollection .RDS.
- This workaround also works via future.batchtools even though it doesn't result in array jobs.

Questions

Apart from needing to clean up the files afterwards, can you see any downsides of using chunks.as.arrayjobs = TRUE for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs from future.batchtools, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).
Perhaps a more explicit option would be better - allow.requeue or prevent.requeue?

The text was updated successfully, but these errors were encountered:

mllg · 2021-07-28T09:29:32Z

It would be possible to just not delete the job files (and let sweepRegistry() handle this) or to introduce an additional option to turn this on or off. i tend to just leave the files there.

* Apart from needing to clean up the files afterwards, can you see any downsides of using `chunks.as.arrayjobs = TRUE` for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs from `future.batchtools`, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).

I've been working on slurm clusters where the support for array jobs is turned off, so this would be a problem.

stuvet mentioned this issue Jul 24, 2021

Mapped missing Slurm job state codes #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job requeue on Slurm - 'no such file or directory' & workaround #280

Job requeue on Slurm - 'no such file or directory' & workaround #280

stuvet commented Jul 24, 2021 •

edited

Loading

mllg commented Jul 28, 2021 •

edited

Loading

Job requeue on Slurm - 'no such file or directory' & workaround #280

Job requeue on Slurm - 'no such file or directory' & workaround #280

Comments

stuvet commented Jul 24, 2021 • edited Loading

Expected Behaviour

Problem

Reprex

Cause

Workaround

Questions

mllg commented Jul 28, 2021 • edited Loading

stuvet commented Jul 24, 2021 •

edited

Loading

mllg commented Jul 28, 2021 •

edited

Loading