New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hook event endjob #2290
Hook event endjob #2290
Conversation
… PBS string constants and the associated reverse lookkup
…lso fixed when job end time is captured; removed recent endtime database updates; corrected endjob server hook processing
…spro into hook_event_endjob
…s hook call and reque endjob process hook calls; also remove debug code
… process hooks calls in node_manager (post_discard_job) and req_jobobit (on_job_rerun)
New end-job hooks tests have been added. Four of these tests currently fail, three of which appear to be an issue with the end-job hook additions not running end-job hooks during a forced delete. The fourth failure may be a bug in the qdel or server code. More investigation is needed. Supporting methods have also been added to the TestHookEndJob class to provide support for the additional tests.
…were missing an 'E' record
This pull request is ready for review. There are currently four (4) tests in tests.functional.pbs_hook_endjob.TestHookEndJob that are not passing. They all appear to fail due to an unexpected response from the server or PBS command. Details about the failures are documented in comments tagged with "FIXME" above each of the failing tests. The failing tests are not included in the smoke tests and thus do not affect the CI testing. We have additional tests we would like to add as noted in the "TODO" comments; however, we believe the changes to the server source code are solid and thus would like to proceed with the review and merge. Additional tests will be submitted in a future PR. |
src/include/pbs_db.h
Outdated
@@ -169,7 +169,7 @@ struct pbs_db_job_info { | |||
INTEGER ji_state; /* Internal copy of state */ | |||
INTEGER ji_substate; /* job sub-state */ | |||
INTEGER ji_svrflags; /* server flags */ | |||
BIGINT ji_stime; /* time job started execution */ | |||
BIGINT ji_stime; /* time job started execution */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious tab character at the end,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 0eaabbb.
# mom is stopped and restarted, the job is killed and requeued as expected. | ||
# The job is rerun after the MoM is restarted and runs to completion; | ||
# however, the job's substate indicates that it was deleted. | ||
def test_hook_endjob_delete_running_single_job_as_user_rm(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the failure be related to the changes made to req_delete.c? Does the problem happen without the change? If this still looks like a PBS code bug, please file an OSS ticket. And rather than have this test fail, put a @Skip(<message_why_skipped>) tag. For example:
@skip("issue 2440")
def test_hook_endjob_delete_running_single_job_as_user_rm(self):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# again when resources are available, which seems to be the opposite of a | ||
# successful delete. Shouldn't qdel return a non-zero exit status since | ||
# the delete was unsuccessful? | ||
def test_hook_endjob_delete_running_array_job_as_root_sm(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous FIXME comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# mom is stopped and restarted, the job is killed and requeued as expected. | ||
# The job is rerun after the MoM is restarted and runs to completion; | ||
# however, the job's substate indicates that it was deleted. | ||
def test_hook_endjob_delete_running_single_job_as_root_rm(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous FIXME comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# again when resources are available, which seems to be the opposite of a | ||
# successful delete. Shouldn't qdel return a non-zero exit status since | ||
# the delete was unsuccessful? | ||
def test_hook_endjob_delete_running_array_job_as_user_sm(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous FIXME message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for filing the ticket and making the changes. This looks good to me.
src/include/pbs_python.h
Outdated
@@ -145,6 +146,7 @@ typedef struct hook_input_param { | |||
void *rq_move; | |||
void *rq_prov; | |||
void *rq_run; | |||
void *rq_end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation looks a little off here (and below)
@@ -200,6 +200,23 @@ | |||
<ECL>verify_value_state</ECL> | |||
</member_verify_function> | |||
</attributes> | |||
<attributes> | |||
<member_index>JOB_ATR_in_resv</member_index> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in_resv sounds more like a boolean. Is JOB_ATR_resv taken? I mean the name of the attribute is just 'resv'. Something else to consider is to rename ATTR_NODE_RESV (which is just 'resv') and use it for both the node resv attribute and the job resv attribute. Also, I don't see you using this in the code. Am I missing something?
elif key.startswith("RESV_STATE"): | ||
_pbs_v1.REVERSE_RESV_STATE[value] = key | ||
|
||
_pbs_v1.REVERSE_HOOK_EVENT = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider creating a list of the hook event strings, and then use a for loop to set them. That way when you add a new one, all you have to do is add it to the list.
src/server/array_func.c
Outdated
|
||
/* update parent job state to 'F' */ | ||
sprintf(log_buffer, "rq_endjob svr_setjobstate update parent job state to 'F'"); | ||
log_err(-1, __func__, log_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no printf() parameters above, just put the quoted string directly into the log_err()
src/server/array_func.c
Outdated
|
||
check_block(parent, ""); | ||
if (check_job_state(parent, JOB_STATE_LTR_BEGUN)) { | ||
char acctbuf[40]; | ||
|
||
/* set parent endtime to time_now */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment really just says what the code is doing. I'd either explain a bit further or just remove the comment (here and below)
resv_attrs={}): | ||
start_time = resv_start_time or int(time.time()) + \ | ||
self.resv_start_delay | ||
end_time = resv_end_time or start_time + self.resv_duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'or' like this will return a boolean. You'll need to use the ternary operator to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are indeed correct. I'm sorry for the confusion. I swear I tested this and it didn't work right. This is neat functionality. I would have thought you'd need to use the ternary operator to do this. This is a nice shorthand.
for mom in self.moms.values(): | ||
mom.stop(*args, **kwargs) | ||
self.moms_stopped = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much I like how these tests are structured. You use a lot of class variables as globals. They will persist from test to test, so you don't truly revert to defaults between tests. It would be easy to use one with values from the previous tests. I suggest you don't use class variables and pass them as arguments into your helper functions. If you don't want to do that, at least define a tearDown() where you unset them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TestHookEndJob
class level attributes are all constants that are largely used as default argument values. All of the attributes modified by the helper methods in the TestHookEndJob
class are part of the object instantiated by nose not the class itself. These attributes are used to track state which can change within each helper method as the test progresses, so passing them as arguments doesn't seem like an option. The attributes are initialized, and thus reset, at the beginning of the run_test_func()
method, which is used by all of the test methods in the class. If you think it would improve code readability, we can move the initialization of those attributes into a setUp()
method and add a tearDown()
method to clear them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setUp() is called before each job. If you reinitialize them in setUp() you should be fine. You don't need setUp() and tearDown() both.
def test_hook_endjob_rerun_and_delete_single_job(self): | ||
""" | ||
Start a single job, issue a rerun and immediately delete it. Verify | ||
that the end job hook is only executed once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the hook only run once? Won't it trigger with the qrerun, and another time with qdel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper functions enable and disable scheduling so job state transitions can be detected and controlled. In this test, and all of the tests that use endjob_rerun_and_delete_job()
, scheduling is not reenabled after the rerun, allowing the job to be deleted before it is restarted.
Run an array job, where all jobs have started but also force deleted | ||
(by the user) before completion. Verify that the end job hook is | ||
executed for all subjobs and the array job. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider condensing these tests into a fewer number. I'd say combine the force delete and normal delete tests for each purpose or the root/normal. The other split could be combining root/user tests. This will drop the tests by half. This will make the test plan run faster because setUp() can be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that combining similar tests to reduce overhead is worthy of consideration and quite likely doing. It would mean leaving the initialization of the state attributes in run_test_func()
and not moving them into setUp()
.
Unless you have an objection, I would like to defer such changes until after this PR is merged as this hook is critical for our operations in the near term. As noted earlier, we have additional tests that should be added as well. We could combine tests as appropriate at that time.
""" | ||
Run a single rerunable job, but delete as user before completion | ||
after stopping the MoM. Verify that the end job hook is executed. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the fact the job is rerunnable or not have anything to do with deleting it? The end job hook should run because you are deleting the job. The job being rerunnable determines what happens with you do a qrerun. Does that make some of these tests unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A job being rerunnable or not also determines what happens to the job when an error occurs like not being able to talk to the MoM. In this test, the qdel fails because the MoM is unreachable. Since the job is rerunnable, the server requeues the job after node_fail_requeue
amount of time has elapsed. The job is then run again when resources are available. For this test, the result should be an end job hook being run twice: once when the server considers the running job lost and requeues it, and once when the job is successfully rerun to completion. Looking at the function documentation again, perhaps I should have included more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the tests can be refactored later, especially if you are going to add more then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing all my comments! Looks good to me.
@bayucan I just signed off. Do you want to take another look, or should it be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good to me, including the update to the design doc.
Merged #2290 into master. |
In order to provide more detailed accounting of jobs as required by some sites, recording the job information and the time of when a job ends is required. Therefore an endjob hook event has been added to the PBS server. Registered endjob hook scripts are executed when a job or subjob execution ends. Each script is provided information about the job and any associated attributes.
Link to Design Doc