Skip to content

Commit

Permalink
fix: batch collect jobs for scancel (snakemake#2114)
Browse files Browse the repository at this point in the history
### Description

When using --slurm, exiting snakemake by SIGINT or ctrl-c helpfully
cancels spawned jobs. However, this is quite unreliable as it often
hangs for a few minutes and exits without canceling the jobs in the end
(without indicating that it failed either).

Slurm documentation on
[scancel](https://slurm.schedmd.com/scancel.html#SECTION_PERFORMANCE)
notes that a large number of scancel calls at the same time may result
in denial of service.

Snakemake runs scancel on each job
[individually](https://github.com/snakemake/snakemake/blob/main/snakemake/executors/slurm/slurm_submit.py#L136).
Instead, job ids should be collected and cancelled all at once.

fixes snakemake#2113

### QC
* [X] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [X] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).

Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
  • Loading branch information
yamanq and johanneskoester committed Feb 18, 2023
1 parent b12d803 commit 0b1fe31
Showing 1 changed file with 5 additions and 4 deletions.
9 changes: 5 additions & 4 deletions snakemake/executors/slurm/slurm_submit.py
Expand Up @@ -134,22 +134,23 @@ def additional_general_args(self):
return [" --slurm-jobstep", "--jobs 1"]

def cancel(self):
for job in self.active_jobs:
jobid = job.jobid
# Jobs are collected to reduce load on slurmctld
jobids = " ".join([job.jobid for job in self.active_jobs])
if len(jobids) > 0:
try:
# timeout set to 60, because a scheduler cycle usually is
# about 30 sec, but can be longer in extreme cases.
# Under 'normal' circumstances, 'scancel' is executed in
# virtually no time.
subprocess.check_output(
f"scancel {jobid}",
f"scancel {jobids}",
text=True,
shell=True,
timeout=60,
stderr=subprocess.PIPE,
)
except subprocess.TimeoutExpired:
logger.warning(f"Unable to cancel job {jobid} within a minute.")
logger.warning(f"Unable to cancel jobs within a minute.")
self.shutdown()

def get_account_arg(self, job):
Expand Down

0 comments on commit 0b1fe31

Please sign in to comment.