Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Job reported running on node no longer exists or is not in running state" msgs #1556

Open
drtoss opened this issue Mar 6, 2020 · 1 comment

Comments

@drtoss
Copy link
Contributor

drtoss commented Mar 6, 2020

Since switching to version 19, we are getting 100's of thousands of these scheduler messages per day. The message is issued by collect_jobs_on_nodes() when it finds a node with a running job where the job is not in a list of jobs given to collect_jobs_on_nodes(). The list of jobs comes from a call to resource_resv_filter() to find all jobs that are not in reservations.

This means that the message is listed for every node that has a job running in a reservation. This is a perfectly normally situation, and it is not clear why the code is remarking on it. The comments right before the message say:

/* Race Condition occurred: nodes were queried when a job existed.
 * Jobs were queried when the job no longer existed.  Make note
 * of it on the job so the node's resources_assigned values can be
 * recalculated later.
 */

The case described by the comment is not what is happening for us when the messages are output.

@bhroam
Copy link
Contributor

bhroam commented Mar 11, 2020

The function collect_jobs_on_nodes() turns the 'jobs' attribute on a node into a resource_resv ** array. The race condition that is being reported here is when a job ends between when we query the nodes and query the jobs. The problem arises because we call the function not only for a node, but also for a reservation. The 'jobs' attribute lists all jobs on the nodes, in reservations or not. When we call the function for a reservation, the jobs not in the reservation are not found and the function considers them ghost jobs. This should only happen when a reservation requests part of a node, and you have jobs running inside and outside a reservation.

Probably the best way to fix this issue is to just not print anything. The scheduler recovers from the ghost job race condition pretty well. The race condition happens pretty often on a busy system, so there is no real benefit to seeing this message.

This should be a pretty simple fix if you want this fixed now. I would suggest just removing the message and building PBS. You could even submit a PR if you wanted to give back to the community.

To build PBS, see:
https://pbspro.atlassian.net/wiki/spaces/PBSPro/pages/13991940/Building+PBS+Pro+Using+rpmbuild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants