Hook event endjob #2290

sdass23 · 2021-03-09T23:01:58Z

In order to provide more detailed accounting of jobs as required by some sites, recording the job information and the time of when a job ends is required. Therefore an endjob hook event has been added to the PBS server. Registered endjob hook scripts are executed when a job or subjob execution ends. Each script is provided information about the job and any associated attributes.

Link to Design Doc

… PBS string constants and the associated reverse lookkup

…lso fixed when job end time is captured; removed recent endtime database updates; corrected endjob server hook processing

…spro into hook_event_endjob

…s hook call and reque endjob process hook calls; also remove debug code

CLAassistant · 2021-03-09T23:02:05Z

All committers have signed the CLA.

…ated tests

… process hooks calls in node_manager (post_discard_job) and req_jobobit (on_job_rerun)

New end-job hooks tests have been added. Four of these tests currently fail, three of which appear to be an issue with the end-job hook additions not running end-job hooks during a forced delete. The fourth failure may be a bug in the qdel or server code. More investigation is needed. Supporting methods have also been added to the TestHookEndJob class to provide support for the additional tests.

…were missing an 'E' record

toonen · 2021-09-02T20:42:57Z

This pull request is ready for review.

There are currently four (4) tests in tests.functional.pbs_hook_endjob.TestHookEndJob that are not passing. They all appear to fail due to an unexpected response from the server or PBS command. Details about the failures are documented in comments tagged with "FIXME" above each of the failing tests. The failing tests are not included in the smoke tests and thus do not affect the CI testing.

We have additional tests we would like to add as noted in the "TODO" comments; however, we believe the changes to the server source code are solid and thus would like to proceed with the review and merge. Additional tests will be submitted in a future PR.

bayucan · 2021-09-08T17:58:19Z

src/include/pbs_db.h

@@ -169,7 +169,7 @@ struct pbs_db_job_info {
 	INTEGER  ji_state;	/* Internal copy of state */
 	INTEGER  ji_substate;	/* job sub-state */
 	INTEGER  ji_svrflags;	/* server flags */
-	BIGINT   ji_stime;	/* time job started execution */
+	BIGINT   ji_stime;	/* time job started execution */	


Spurious tab character at the end,

Fixed in 0eaabbb.

bayucan · 2021-09-08T18:17:33Z

test/tests/functional/pbs_hook_endjob.py

+    # mom is stopped and restarted, the job is killed and requeued as expected.
+    # The job is rerun after the MoM is restarted and runs to completion;
+    # however, the job's substate indicates that it was deleted.
+    def test_hook_endjob_delete_running_single_job_as_user_rm(self):


Could the failure be related to the changes made to req_delete.c? Does the problem happen without the change? If this still looks like a PBS code bug, please file an OSS ticket. And rather than have this test fail, put a @Skip(<message_why_skipped>) tag. For example:

@skip("issue 2440") def test_hook_endjob_delete_running_single_job_as_user_rm(self):

The failure scenario is described in issue #2485. An @skip was added to the test in 0eaabbb.

bayucan · 2021-09-08T18:18:16Z

test/tests/functional/pbs_hook_endjob.py

+    # again when resources are available, which seems to be the opposite of a
+    # successful delete.  Shouldn't qdel return a non-zero exit status since
+    # the delete was unsuccessful?
+    def test_hook_endjob_delete_running_array_job_as_root_sm(self):


See previous FIXME comment.

The failure scenario is described in issue #2486. An @skip was added to the test in 0eaabbb.

bayucan · 2021-09-08T18:18:34Z

test/tests/functional/pbs_hook_endjob.py

+    # mom is stopped and restarted, the job is killed and requeued as expected.
+    # The job is rerun after the MoM is restarted and runs to completion;
+    # however, the job's substate indicates that it was deleted.
+    def test_hook_endjob_delete_running_single_job_as_root_rm(self):


See previous FIXME comment.

The failure scenario is described in issue #2485. An @skip was added to the test in 0eaabbb.

bayucan · 2021-09-08T18:19:39Z

test/tests/functional/pbs_hook_endjob.py

+    # again when resources are available, which seems to be the opposite of a
+    # successful delete.  Shouldn't qdel return a non-zero exit status since
+    # the delete was unsuccessful?
+    def test_hook_endjob_delete_running_array_job_as_user_sm(self):


See previous FIXME message.

The failure scenario is described in issue #2486. An @skip was added to the test in 0eaabbb.

bayucan

Thank you for filing the ticket and making the changes. This looks good to me.

bhroam · 2021-09-22T23:53:45Z

src/include/pbs_python.h

@@ -145,6 +146,7 @@ typedef struct	hook_input_param {
 	void		*rq_move;
 	void		*rq_prov;
 	void		*rq_run;
+	void        *rq_end;


Indentation looks a little off here (and below)

bhroam · 2021-09-22T23:55:27Z

src/lib/Libattr/master_job_attr_def.xml

@@ -200,6 +200,23 @@
         <ECL>verify_value_state</ECL>
      </member_verify_function>
   </attributes>
+   <attributes>
+      <member_index>JOB_ATR_in_resv</member_index>


in_resv sounds more like a boolean. Is JOB_ATR_resv taken? I mean the name of the attribute is just 'resv'. Something else to consider is to rename ATTR_NODE_RESV (which is just 'resv') and use it for both the node resv attribute and the job resv attribute. Also, I don't see you using this in the code. Am I missing something?

bhroam · 2021-09-23T00:05:09Z

src/modules/python/pbs/v1/_svr_types.py

    elif key.startswith("RESV_STATE"):
        _pbs_v1.REVERSE_RESV_STATE[value] = key
+
+_pbs_v1.REVERSE_HOOK_EVENT = {


consider creating a list of the hook event strings, and then use a for loop to set them. That way when you add a new one, all you have to do is add it to the list.

bhroam · 2021-09-23T00:08:05Z

src/server/array_func.c

+
+				/* update parent job state to 'F' */
+				sprintf(log_buffer, "rq_endjob svr_setjobstate update parent job state to 'F'");
+				log_err(-1, __func__, log_buffer);


There are no printf() parameters above, just put the quoted string directly into the log_err()

bhroam · 2021-09-23T00:10:29Z

src/server/array_func.c


 		check_block(parent, "");
 		if (check_job_state(parent, JOB_STATE_LTR_BEGUN)) {
 			char acctbuf[40];
+
+			/* set parent endtime to time_now */


This comment really just says what the code is doing. I'd either explain a bit further or just remove the comment (here and below)

bhroam · 2021-09-23T22:59:15Z

test/tests/functional/pbs_hook_endjob.py

+            resv_attrs={}):
+        start_time = resv_start_time or int(time.time()) + \
+            self.resv_start_delay
+        end_time = resv_end_time or start_time + self.resv_duration


Using 'or' like this will return a boolean. You'll need to use the ternary operator to do this.

https://realpython.com/python-or-operator/#non-boolean-contexts

You are indeed correct. I'm sorry for the confusion. I swear I tested this and it didn't work right. This is neat functionality. I would have thought you'd need to use the ternary operator to do this. This is a nice shorthand.

bhroam · 2021-09-27T20:03:54Z

test/tests/functional/pbs_hook_endjob.py

+        for mom in self.moms.values():
+            mom.stop(*args, **kwargs)
+        self.moms_stopped = True
+


I'm not sure how much I like how these tests are structured. You use a lot of class variables as globals. They will persist from test to test, so you don't truly revert to defaults between tests. It would be easy to use one with values from the previous tests. I suggest you don't use class variables and pass them as arguments into your helper functions. If you don't want to do that, at least define a tearDown() where you unset them all.

The TestHookEndJob class level attributes are all constants that are largely used as default argument values. All of the attributes modified by the helper methods in the TestHookEndJob class are part of the object instantiated by nose not the class itself. These attributes are used to track state which can change within each helper method as the test progresses, so passing them as arguments doesn't seem like an option. The attributes are initialized, and thus reset, at the beginning of the run_test_func() method, which is used by all of the test methods in the class. If you think it would improve code readability, we can move the initialization of those attributes into a setUp() method and add a tearDown() method to clear them.

setUp() is called before each job. If you reinitialize them in setUp() you should be fine. You don't need setUp() and tearDown() both.

bhroam · 2021-09-27T21:09:42Z

test/tests/functional/pbs_hook_endjob.py

+    def test_hook_endjob_rerun_and_delete_single_job(self):
+        """
+        Start a single job, issue a rerun and immediately delete it.  Verify
+        that the end job hook is only executed once.


Why does the hook only run once? Won't it trigger with the qrerun, and another time with qdel?

The helper functions enable and disable scheduling so job state transitions can be detected and controlled. In this test, and all of the tests that use endjob_rerun_and_delete_job(), scheduling is not reenabled after the rerun, allowing the job to be deleted before it is restarted.

bhroam · 2021-09-28T19:38:03Z

test/tests/functional/pbs_hook_endjob.py

+        Run an array job, where all jobs have started but also force deleted
+        (by the user) before completion.  Verify that the end job hook is
+        executed for all subjobs and the array job.
+        """


Consider condensing these tests into a fewer number. I'd say combine the force delete and normal delete tests for each purpose or the root/normal. The other split could be combining root/user tests. This will drop the tests by half. This will make the test plan run faster because setUp() can be slow.

I agree that combining similar tests to reduce overhead is worthy of consideration and quite likely doing. It would mean leaving the initialization of the state attributes in run_test_func() and not moving them into setUp().

Unless you have an objection, I would like to defer such changes until after this PR is merged as this hook is critical for our operations in the near term. As noted earlier, we have additional tests that should be added as well. We could combine tests as appropriate at that time.

bhroam · 2021-09-28T19:42:04Z

test/tests/functional/pbs_hook_endjob.py

+        """
+        Run a single rerunable job, but delete as user before completion
+        after stopping the MoM. Verify that the end job hook is executed.
+        """


Why does the fact the job is rerunnable or not have anything to do with deleting it? The end job hook should run because you are deleting the job. The job being rerunnable determines what happens with you do a qrerun. Does that make some of these tests unnecessary?

A job being rerunnable or not also determines what happens to the job when an error occurs like not being able to talk to the MoM. In this test, the qdel fails because the MoM is unreachable. Since the job is rerunnable, the server requeues the job after node_fail_requeue amount of time has elapsed. The job is then run again when resources are available. For this test, the result should be an end job hook being run twice: once when the server considers the running job lost and requeues it, and once when the job is successfully rerun to completion. Looking at the function documentation again, perhaps I should have included more detail.

Sure, the tests can be refactored later, especially if you are going to add more then.

bhroam

Thanks for fixing all my comments! Looks good to me.

bhroam · 2021-10-08T19:09:58Z

@bayucan I just signed off. Do you want to take another look, or should it be merged?

bayucan

Still looks good to me, including the update to the design doc.

bayucan · 2021-10-08T19:33:18Z

Merged #2290 into master.

toonen and others added 24 commits August 20, 2020 13:54

Initial additions for server endjob hook event

6fcfc8f

minor job end updates

dce2860

merging with master

e86dcb1

added test

e29c209

adding test

cca3041

update test: look for job_state 'R', 'E'

b08b33d

Merge branch 'master' into hook_event_endjob

d04272e

updated test and fixed endjob hook processing

385b127

added endjob hook test for array of jobs

e2f139e

added endjob hook call for requed jobs

74d2add

merge with master and adding reverse end job state

2937ef8

fixing _ -> .

3e3fb1a

add HOOKSTR_ENDJOB to add_hook_event; add JOB_SUBSTATE_* constants to…

f9dfc79

… PBS string constants and the associated reverse lookkup

adding hook event reverse.

28a33fe

reversing the reverse hook lookup to be int -> str.

14345c7

Merge branch 'master' into hook_event_endjob

e053f0e

add ji_endtime attribute to jobfix

eabbab9

removed forced 'E' setting and instead properly setting 'F' record; a…

12b0887

…lso fixed when job end time is captured; removed recent endtime database updates; corrected endjob server hook processing

Merge branch 'hook_event_endjob' of https://github.com/ericpershey/pb…

7fb6fd2

…spro into hook_event_endjob

apply fixes done for array job wrt to endtime to single endjob proces…

48915ab

…s hook call and reque endjob process hook calls; also remove debug code

clean up endjob reservation test

8079343

updates to accommodate endjob hooks ran under reservation

80113b4

add test for forced reque endjob hook

c4c67f7

"remove debug endjob hook success log error message calls"

c1b93bb

sdass23 changed the title ~~Hook event endjob~~ Hook event endjob (WIP) Mar 10, 2021

sdass23 and others added 4 commits March 15, 2021 15:42

add endjob hook calls for running jobs that are deleted; added associ…

eccf8bf

…ated tests

Remove recently added process hook calls from req_delete. Add missing…

774f1a3

… process hooks calls in node_manager (post_discard_job) and req_jobobit (on_job_rerun)

Merge branch 'master' into hook_event_endjob

bba3665

Merge remote-tracking branch 'upstream/master' into hook_event_endjob

0d59130

sdass23 and others added 13 commits May 13, 2021 16:05

update unstarted single job tests

84536cb

Merge remote-tracking branch 'upstream/master' into hook_event_endjob

0fad3db

use short name when updating mom's settings on server

9ba3db6

fixed NULL pointer dereference causing qdel to seg fault

4ba8317

additional tests for the end-job hook

41610a5

fix issue where the parent job in job arrays that were forced delete …

f18c958

…were missing an 'E' record

Merge remote-tracking branch 'origin/master' into hook_event_endjob

049e50f

Merge remote-tracking branch 'upstream/master' into hook_event_endjob

b7b280c

set datetime conversion error logging back to debug

91ee22f

Merge remote-tracking branch 'upstream/master' into hook_event_endjob

79b354a

Merge remote-tracking branch 'upstream/master' into hook_event_endjob

ccf3b5f

refactored and added tests

b4bd234

sdass23 changed the title ~~Hook event endjob (WIP)~~ Hook event endjob Sep 2, 2021

Merge branch 'master' into hook_event_endjob

e63c773

bayucan requested changes Sep 8, 2021

View reviewed changes

This was referenced Sep 22, 2021

qdel of running job while the MoM is stopped #2485

Closed

qdel of running job array while the MoM is stopped #2486

Closed

skip tests associated with issues 2485 and 2486. removed extraneous tab.

0eaabbb

bayucan approved these changes Sep 22, 2021

View reviewed changes

bhroam requested changes Sep 28, 2021

View reviewed changes

updates based on PR2290 comments

645e4e3

bhroam approved these changes Oct 8, 2021

View reviewed changes

bayucan approved these changes Oct 8, 2021

View reviewed changes

bayucan merged commit 56245cb into openpbs:master Oct 8, 2021

sdass23 mentioned this pull request Oct 11, 2021

Hook event jobobit #2494

Merged

toonen mentioned this pull request Nov 16, 2021

Fix for a job array sticking in the queue after delete #2499

Merged

Hook event endjob #2290

Hook event endjob #2290

Conversation

sdass23 commented Mar 9, 2021

CLAassistant commented Mar 9, 2021 • edited

toonen commented Sep 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bayucan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhroam left a comment

Choose a reason for hiding this comment

bhroam commented Oct 8, 2021

bayucan left a comment

Choose a reason for hiding this comment

bayucan commented Oct 8, 2021

CLAassistant commented Mar 9, 2021 •

edited