Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

Closed
strazto opened this issue Jan 9, 2020 · 2 comments · Fixed by #187
Closed
Labels

Comments

@strazto
Copy link
Contributor

strazto commented Jan 9, 2020

Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.

When submitting using drake, upon failure of a target, the workflow is supposed to stop, and the workers terminated.

When a target fails, I subsequently get the following error:

qdel: illegally formed job identifier: cmq7082

This corresponds to the job name for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).

When I examine the output of qstat -u mstr3336 -x , I see the following:

pbsserver:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3965541.pbsserv mstr3336 small    run_make_h  60985   1   1   16gb 23:59 F 00:44
3965544[].pbsse mstr3336 small    cmq7082             --    1   1    4gb 23:59 F   --

We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.

Referring to the SGE child class:

clustermq/R/qsys_sge.r

Lines 26 to 38 in e7c68ed

finalize = function(quiet=self$workers_running == 0) {
if (!private$is_cleaned_up) {
system(paste("qdel", private$job_id),
ignore.stdout=quiet, ignore.stderr=quiet, wait=FALSE)
private$is_cleaned_up = TRUE
}
}
),
private = list(
job_id = NULL
)
)

We see that the finalize function calls qdel on job_id, which seems okay, but looking closer at the submit jobs implementation:

clustermq/R/qsys_sge.r

Lines 14 to 17 in e7c68ed

submit_jobs = function(...) {
opts = private$fill_options(...)
private$job_id = opts$job_name
filled = private$fill_template(opts)

job_id is simply given by job_name.

(job_name is inherited from the following:

clustermq/R/qsys.r

Lines 221 to 237 in e7c68ed

fill_options = function(...) {
values = utils::modifyList(private$defaults, list(...))
values$master = private$master
if (grepl("CMQ_AUTH", private$template)) {
# note: auth will be obligatory in the future and this check will
# be removed (i.e., filling will fail if no field in template)
values$auth = private$auth = paste(sample(letters, 5, TRUE), collapse="")
} else {
values$auth = NULL
warning("Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication",
immediate.=TRUE)
}
if (!"job_name" %in% names(values))
values$job_name = paste0("cmq", private$port)
private$workers_total = values$n_jobs
values
},

)

Uh oh! This is not in concordance with the PBS specs:

PBS Professional 18.2 User’s Guide UG-13

Excerpt from PBS Guide

Submitting a PBS Job Chapter 2

2.1.3 The Job Identifier

After you submit a job, PBS returns a job identifier. Format for a job:
<sequence number>.<server name>

Format for a job array:

<sequence number>[].<server name>.<domain>

You’ll need the job identifier for any actions involving the job, such as checking job status, modifying the job, tracking the job, or deleting the job


Additionally, the environment variable PBS_JOBID is exposed for the .pbs script.

So it's clear that either:

  1. the return from the qsub for the batch job is needed, or
  2. the PBS_JOBID somehow needs to be sent back to master.

My intuition tells me that getting the return of qsub is the simpler option, though given the following:

clustermq/R/qsys_sge.r

Lines 19 to 22 in e7c68ed

success = system("qsub", input=filled, ignore.stdout=TRUE)
if (success != 0) {
print(filled)
stop("Job submission failed with error code ", success)

The result of system(...) is the command's error status.

After checking the man page for system, we can see that by setting intern = TRUE, and then doing a little extra work to retrieve the command output, we are able to access both.

I'll experiment with this, and then put in a PR if all goes well

@strazto
Copy link
Contributor Author

strazto commented Jan 9, 2020

@mschubert , would you be able to review my PR regarding this?

@mschubert
Copy link
Owner

mschubert commented Jan 27, 2020

For completeness, rest of discussion is in #187

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants