segfault when waiting for bulk jobs > 20 mins #27

reyjul · 2019-06-04T12:55:10Z

Hi,

When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

Same happens if I loop through the job ids with the s.wait() function:

for jobid in joblist:
   s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

No problem if jobs last only 10 minutes:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

I came up with this little piece of code to bypass the bug:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
   while (s.jobStatus(jobid)=="running"):
      time.sleep(10)
   print "job %s done" % jobid

Yields:

job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done

The text was updated successfully, but these errors were encountered:

mkher64 · 2020-04-02T00:32:10Z

This is a bug in job.c - slurmdrmaa_find_job_info()/slurmdrmaa_job_on_missing(). Some time after a job finishes slurm_load_job() returns an error. That is expected slurm behavior. I'm not sure what on_missing() is supposed to do, but the code continues as though it filled job_info, when that's not true. I suspect it should throw error here.
Alternately on_missing() could get job information from slurmdb. Unfortunately, I'm not familiar enough with this library or slurm to do it right now.

benmwebb mentioned this issue Mar 11, 2021

Don't segfault if parent job doesn't exist #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault when waiting for bulk jobs > 20 mins #27

segfault when waiting for bulk jobs > 20 mins #27

reyjul commented Jun 4, 2019

mkher64 commented Apr 2, 2020

segfault when waiting for bulk jobs > 20 mins #27

segfault when waiting for bulk jobs > 20 mins #27

Comments

reyjul commented Jun 4, 2019

mkher64 commented Apr 2, 2020