"Job reported running on node no longer exists or is not in running state" msgs #1556

drtoss · 2020-03-06T01:10:43Z

Since switching to version 19, we are getting 100's of thousands of these scheduler messages per day. The message is issued by collect_jobs_on_nodes() when it finds a node with a running job where the job is not in a list of jobs given to collect_jobs_on_nodes(). The list of jobs comes from a call to resource_resv_filter() to find all jobs that are not in reservations.

This means that the message is listed for every node that has a job running in a reservation. This is a perfectly normally situation, and it is not clear why the code is remarking on it. The comments right before the message say:

/* Race Condition occurred: nodes were queried when a job existed.
 * Jobs were queried when the job no longer existed.  Make note
 * of it on the job so the node's resources_assigned values can be
 * recalculated later.
 */

The case described by the comment is not what is happening for us when the messages are output.

The text was updated successfully, but these errors were encountered:

bhroam · 2020-03-11T19:01:53Z

The function collect_jobs_on_nodes() turns the 'jobs' attribute on a node into a resource_resv ** array. The race condition that is being reported here is when a job ends between when we query the nodes and query the jobs. The problem arises because we call the function not only for a node, but also for a reservation. The 'jobs' attribute lists all jobs on the nodes, in reservations or not. When we call the function for a reservation, the jobs not in the reservation are not found and the function considers them ghost jobs. This should only happen when a reservation requests part of a node, and you have jobs running inside and outside a reservation.

Probably the best way to fix this issue is to just not print anything. The scheduler recovers from the ghost job race condition pretty well. The race condition happens pretty often on a busy system, so there is no real benefit to seeing this message.

This should be a pretty simple fix if you want this fixed now. I would suggest just removing the message and building PBS. You could even submit a PR if you wanted to give back to the community.

To build PBS, see:
https://pbspro.atlassian.net/wiki/spaces/PBSPro/pages/13991940/Building+PBS+Pro+Using+rpmbuild

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Job reported running on node no longer exists or is not in running state" msgs #1556

"Job reported running on node no longer exists or is not in running state" msgs #1556

drtoss commented Mar 6, 2020

bhroam commented Mar 11, 2020

"Job reported running on node no longer exists or is not in running state" msgs #1556

"Job reported running on node no longer exists or is not in running state" msgs #1556

Comments

drtoss commented Mar 6, 2020

bhroam commented Mar 11, 2020