You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:
This is a bug in job.c - slurmdrmaa_find_job_info()/slurmdrmaa_job_on_missing(). Some time after a job finishes slurm_load_job() returns an error. That is expected slurm behavior. I'm not sure what on_missing() is supposed to do, but the code continues as though it filled job_info, when that's not true. I suspect it should throw error here.
Alternately on_missing() could get job information from slurmdb. Unfortunately, I'm not familiar enough with this library or slurm to do it right now.
Hi,
When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:
Same happens if I loop through the job ids with the s.wait() function:
However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:
No problem if jobs last only 10 minutes:
I came up with this little piece of code to bypass the bug:
Yields:
The text was updated successfully, but these errors were encountered: