Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all errors reported as FSD_ERRNO_INTERNAL_ERROR #1

Closed
tbooth opened this issue Jul 12, 2017 · 2 comments
Closed

all errors reported as FSD_ERRNO_INTERNAL_ERROR #1

tbooth opened this issue Jul 12, 2017 · 2 comments

Comments

@tbooth
Copy link

tbooth commented Jul 12, 2017

Hi,

Thanks for merging my previous fix. This one is in a similar vein.

On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.

Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.

Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.

*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig	2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c	2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
  			}
  		}
  		if (job_info) {
--- 131,150 ----
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else
!                 // We should detect the error corresponding to "Socket timed out" and report
!                 // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
!                 // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
!                 //   which simply indicates the job is still running?? Maybe we should try it and see. )
!                 // To see what _slurm_errno corresponds to which message let's look at
!                 // 'slurm_strerror' in the slurm source code...
!                 //   https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
!             if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
!                  _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
!                ) {
!                 fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
!             } else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
  			}
  		}
  		if (job_info) {

Cheers,

TIM

@natefoo
Copy link
Owner

natefoo commented Nov 16, 2017

FSD_ERRNO_TIMEOUT becomes DRMAA_ERRNO_EXIT_TIMEOUT and is used when calling e.g. drmaa_wait() with a timeout specified and that timeout is reached. I believe that FSD_ERRNO_DRM_COMMUNICATION_FAILURE is the correct error code.

Thanks for the patch!

@natefoo
Copy link
Owner

natefoo commented Nov 16, 2017

Fixed in 83fc288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants