Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hook event endjob #2290

Merged
merged 55 commits into from Oct 8, 2021
Merged

Hook event endjob #2290

merged 55 commits into from Oct 8, 2021

Conversation

sdass23
Copy link
Contributor

@sdass23 sdass23 commented Mar 9, 2021

In order to provide more detailed accounting of jobs as required by some sites, recording the job information and the time of when a job ends is required. Therefore an endjob hook event has been added to the PBS server. Registered endjob hook scripts are executed when a job or subjob execution ends. Each script is provided information about the job and any associated attributes.

Link to Design Doc

toonen and others added 24 commits August 20, 2020 13:54
… PBS string constants and the associated reverse lookkup
…lso fixed when job end time is captured; removed recent endtime database updates; corrected endjob server hook processing
…s hook call and reque endjob process hook calls; also remove debug code
@CLAassistant
Copy link

CLAassistant commented Mar 9, 2021

CLA assistant check
All committers have signed the CLA.

@sdass23 sdass23 changed the title Hook event endjob Hook event endjob (WIP) Mar 10, 2021
@toonen
Copy link
Contributor

toonen commented Sep 2, 2021

This pull request is ready for review.

There are currently four (4) tests in tests.functional.pbs_hook_endjob.TestHookEndJob that are not passing. They all appear to fail due to an unexpected response from the server or PBS command. Details about the failures are documented in comments tagged with "FIXME" above each of the failing tests. The failing tests are not included in the smoke tests and thus do not affect the CI testing.

We have additional tests we would like to add as noted in the "TODO" comments; however, we believe the changes to the server source code are solid and thus would like to proceed with the review and merge. Additional tests will be submitted in a future PR.

@sdass23 sdass23 changed the title Hook event endjob (WIP) Hook event endjob Sep 2, 2021
@@ -169,7 +169,7 @@ struct pbs_db_job_info {
INTEGER ji_state; /* Internal copy of state */
INTEGER ji_substate; /* job sub-state */
INTEGER ji_svrflags; /* server flags */
BIGINT ji_stime; /* time job started execution */
BIGINT ji_stime; /* time job started execution */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious tab character at the end,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0eaabbb.

# mom is stopped and restarted, the job is killed and requeued as expected.
# The job is rerun after the MoM is restarted and runs to completion;
# however, the job's substate indicates that it was deleted.
def test_hook_endjob_delete_running_single_job_as_user_rm(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the failure be related to the changes made to req_delete.c? Does the problem happen without the change? If this still looks like a PBS code bug, please file an OSS ticket. And rather than have this test fail, put a @Skip(<message_why_skipped>) tag. For example:

@skip("issue 2440")
def test_hook_endjob_delete_running_single_job_as_user_rm(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure scenario is described in issue #2485. An @skip was added to the test in 0eaabbb.

# again when resources are available, which seems to be the opposite of a
# successful delete. Shouldn't qdel return a non-zero exit status since
# the delete was unsuccessful?
def test_hook_endjob_delete_running_array_job_as_root_sm(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous FIXME comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure scenario is described in issue #2486. An @skip was added to the test in 0eaabbb.

# mom is stopped and restarted, the job is killed and requeued as expected.
# The job is rerun after the MoM is restarted and runs to completion;
# however, the job's substate indicates that it was deleted.
def test_hook_endjob_delete_running_single_job_as_root_rm(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous FIXME comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure scenario is described in issue #2485. An @skip was added to the test in 0eaabbb.

# again when resources are available, which seems to be the opposite of a
# successful delete. Shouldn't qdel return a non-zero exit status since
# the delete was unsuccessful?
def test_hook_endjob_delete_running_array_job_as_user_sm(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous FIXME message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure scenario is described in issue #2486. An @skip was added to the test in 0eaabbb.

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for filing the ticket and making the changes. This looks good to me.

@@ -145,6 +146,7 @@ typedef struct hook_input_param {
void *rq_move;
void *rq_prov;
void *rq_run;
void *rq_end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks a little off here (and below)

@@ -200,6 +200,23 @@
<ECL>verify_value_state</ECL>
</member_verify_function>
</attributes>
<attributes>
<member_index>JOB_ATR_in_resv</member_index>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_resv sounds more like a boolean. Is JOB_ATR_resv taken? I mean the name of the attribute is just 'resv'. Something else to consider is to rename ATTR_NODE_RESV (which is just 'resv') and use it for both the node resv attribute and the job resv attribute. Also, I don't see you using this in the code. Am I missing something?

elif key.startswith("RESV_STATE"):
_pbs_v1.REVERSE_RESV_STATE[value] = key

_pbs_v1.REVERSE_HOOK_EVENT = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider creating a list of the hook event strings, and then use a for loop to set them. That way when you add a new one, all you have to do is add it to the list.


/* update parent job state to 'F' */
sprintf(log_buffer, "rq_endjob svr_setjobstate update parent job state to 'F'");
log_err(-1, __func__, log_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no printf() parameters above, just put the quoted string directly into the log_err()


check_block(parent, "");
if (check_job_state(parent, JOB_STATE_LTR_BEGUN)) {
char acctbuf[40];

/* set parent endtime to time_now */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment really just says what the code is doing. I'd either explain a bit further or just remove the comment (here and below)

resv_attrs={}):
start_time = resv_start_time or int(time.time()) + \
self.resv_start_delay
end_time = resv_end_time or start_time + self.resv_duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'or' like this will return a boolean. You'll need to use the ternary operator to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are indeed correct. I'm sorry for the confusion. I swear I tested this and it didn't work right. This is neat functionality. I would have thought you'd need to use the ternary operator to do this. This is a nice shorthand.

for mom in self.moms.values():
mom.stop(*args, **kwargs)
self.moms_stopped = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much I like how these tests are structured. You use a lot of class variables as globals. They will persist from test to test, so you don't truly revert to defaults between tests. It would be easy to use one with values from the previous tests. I suggest you don't use class variables and pass them as arguments into your helper functions. If you don't want to do that, at least define a tearDown() where you unset them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestHookEndJob class level attributes are all constants that are largely used as default argument values. All of the attributes modified by the helper methods in the TestHookEndJob class are part of the object instantiated by nose not the class itself. These attributes are used to track state which can change within each helper method as the test progresses, so passing them as arguments doesn't seem like an option. The attributes are initialized, and thus reset, at the beginning of the run_test_func() method, which is used by all of the test methods in the class. If you think it would improve code readability, we can move the initialization of those attributes into a setUp() method and add a tearDown() method to clear them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setUp() is called before each job. If you reinitialize them in setUp() you should be fine. You don't need setUp() and tearDown() both.

def test_hook_endjob_rerun_and_delete_single_job(self):
"""
Start a single job, issue a rerun and immediately delete it. Verify
that the end job hook is only executed once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the hook only run once? Won't it trigger with the qrerun, and another time with qdel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper functions enable and disable scheduling so job state transitions can be detected and controlled. In this test, and all of the tests that use endjob_rerun_and_delete_job(), scheduling is not reenabled after the rerun, allowing the job to be deleted before it is restarted.

Run an array job, where all jobs have started but also force deleted
(by the user) before completion. Verify that the end job hook is
executed for all subjobs and the array job.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider condensing these tests into a fewer number. I'd say combine the force delete and normal delete tests for each purpose or the root/normal. The other split could be combining root/user tests. This will drop the tests by half. This will make the test plan run faster because setUp() can be slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that combining similar tests to reduce overhead is worthy of consideration and quite likely doing. It would mean leaving the initialization of the state attributes in run_test_func() and not moving them into setUp().

Unless you have an objection, I would like to defer such changes until after this PR is merged as this hook is critical for our operations in the near term. As noted earlier, we have additional tests that should be added as well. We could combine tests as appropriate at that time.

"""
Run a single rerunable job, but delete as user before completion
after stopping the MoM. Verify that the end job hook is executed.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the fact the job is rerunnable or not have anything to do with deleting it? The end job hook should run because you are deleting the job. The job being rerunnable determines what happens with you do a qrerun. Does that make some of these tests unnecessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A job being rerunnable or not also determines what happens to the job when an error occurs like not being able to talk to the MoM. In this test, the qdel fails because the MoM is unreachable. Since the job is rerunnable, the server requeues the job after node_fail_requeue amount of time has elapsed. The job is then run again when resources are available. For this test, the result should be an end job hook being run twice: once when the server considers the running job lost and requeues it, and once when the job is successfully rerun to completion. Looking at the function documentation again, perhaps I should have included more detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the tests can be refactored later, especially if you are going to add more then.

Copy link
Contributor

@bhroam bhroam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing all my comments! Looks good to me.

@bhroam
Copy link
Contributor

bhroam commented Oct 8, 2021

@bayucan I just signed off. Do you want to take another look, or should it be merged?

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good to me, including the update to the design doc.

@bayucan bayucan merged commit 56245cb into openpbs:master Oct 8, 2021
@bayucan
Copy link
Contributor

bayucan commented Oct 8, 2021

Merged #2290 into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants