You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for merging my previous fix. This one is in a similar vein.
On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.
Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.
Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.
*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig 2016-11-04 15:09:49.000000000 +0000--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c 2017-06-09 15:05:38.000000000 +0100****************** 131,138 ****
if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
self->on_missing(self);
! } else {! fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
}
}
if (job_info) {
--- 131,150 ----
if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
self->on_missing(self);
! } else! // We should detect the error corresponding to "Socket timed out" and report! // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE! // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,! // which simply indicates the job is still running?? Maybe we should try it and see. )! // To see what _slurm_errno corresponds to which message let's look at! // 'slurm_strerror' in the slurm source code...! // https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c! if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||! _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR! ) {! fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);! } else {! fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
}
}
if (job_info) {
Cheers,
TIM
The text was updated successfully, but these errors were encountered:
FSD_ERRNO_TIMEOUTbecomesDRMAA_ERRNO_EXIT_TIMEOUT and is used when calling e.g. drmaa_wait() with a timeout specified and that timeout is reached. I believe that FSD_ERRNO_DRM_COMMUNICATION_FAILURE is the correct error code.
Hi,
Thanks for merging my previous fix. This one is in a similar vein.
On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.
Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.
Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.
Cheers,
TIM
The text was updated successfully, but these errors were encountered: