Add possibility of resuming multinode jobs #1955

zv0n · 2020-08-05T12:58:54Z

Describe Bug or Feature

I have added possibility of resuming jobs that are running on multiple nodes. It is already possible to resume a job running on single node after mom restarts, so allowing the same for multinode jobs seems like a good idea. This is done when -p flag is passed on command line.

Possible problems with the implementation:

If you run pbsdsh while one of the nodes is down it enters an infinite loop as it doesn't check for node availability
If you restart mother superior while pbsdsh is running, it will exit with error as it loses connection with mother superior
Starting pbsdsh right after a node restart may result in error on task spawn because the task can be sent to the node before the node rejoins the job (and receives ports through which it should communicate)

Describe Your Change

Mother superior now saves and recovers ji_ports, so it can communicate with demux program after recovering the job.
When a non-superior mom recovers multinode job, it sends IM_RECONNECT_TO_MS to mother superior, MS then sends IM_JOIN_RECOV_JOB to all sisters along with port numbers through which they should communicate
When MS recovers multinode job, it sends IM_JOIN_RECOV_JOB along with port numbers to all sisters
Obit information (if there is any) is saved/recover along with tasks, so moms know wheter to send an obit after task completion
Multinode jobs now check whether they have a connection to MS inside scan_for_exiting, so when they send an obit they can be sure that it is delivered

Attach Test and Valgrind Logs/Output

test_multinode_restart.txt

CLAassistant · 2020-08-05T12:58:59Z

All committers have signed the CLA.

zv0n · 2020-08-11T17:13:21Z

Hi,
Can someone please remove the random redundant Travis job (that doesn't seem to be starting at all)?
Thanks!

arungrover · 2020-08-11T18:14:52Z

Hi @zv0n, I think I have seen this problem with Travis in the past. What might work is that either you can close the PR and reopen it OR force push your branch. In both these cases, Travis will run again but I think it will get rid of that redundant job.

zv0n · 2020-08-12T06:22:18Z

Thanks @arungrover, seems like force pushing fixed it!

zv0n · 2020-08-14T06:22:37Z

Can someone take a look at this change, please?

bayucan

@zv0n : Do you also have an accompanying design for this? It would be better to add a design under https://openpbs.atlassian.net/wiki/spaces/PD/pages outlining your approach of when pbs_mom is started with -p option, multinode jobs can also be recovered. Then announce it to the community form: https://community.openpbs.org/ for comments. People in the community can chime in and suggest improvements to the design...

bayucan · 2020-08-14T18:39:40Z

src/include/job.h

@@ -501,6 +501,8 @@ struct job {
 	int		ji_stderr;	/* socket for stderr */
 	int		ji_ports[2];	/* ports for stdout/err */
 	enum	bg_hook_request	ji_hook_running_bg_on; /* set when hook starts in the background*/
+	int		ji_msconnected; /* 0 - not connected, 1 - connected */
+	pbs_list_head	ji_multinodejobs;	/* links to recovered multinode jobs */


lease initialize ji_msconnected in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.
Initialize also ji_multinodejobs in server/job_func.c:job_alloc() by calling CLEAR_HEAD(ji_multinodejobs).
and also freeing memory allocated to ji_multinodejobs in server/job_func.c:job_free() under the #ifdef PBS_MOM section.

I believe I'm already initializing these variables in job_func.c:355,356, should I do something more than that?

Since ji_multinodejobs only contains pointers to jobs that are already managed, will CLEAR_HEAD(ji_multinodejobs) suffice in job_free()?

I missed that you were already initializing them in job_func.c under job_alloc() so that's good. For the job_free(), yes, just do the CLEAR_HEAD.

bayucan · 2020-08-14T18:45:31Z

src/include/job.h

@@ -612,6 +614,8 @@ struct job {
 #ifdef PBS_MOM
 			tm_host_id ji_nodeidx;	/* my node id */
 			tm_task_id ji_taskidx;	/* generate task id's for job */
+			int ji_stdout;
+			int ji_stderr;


Initialize these in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.

bayucan · 2020-08-14T18:45:54Z

src/include/mom_func.h

@@ -376,7 +376,7 @@ extern pid_t fork_me(int sock);
 extern ssize_t readpipe(int pfd, void *vptr, size_t nbytes);
 extern ssize_t writepipe(int pfd, void *vptr, size_t nbytes);
 extern int   get_la(double *);
-extern void  init_abort_jobs(int);
+extern void  init_abort_jobs(int, pbs_list_head*);


pbs_listhead* -> pbs_listhead *

bayucan · 2020-08-14T18:47:02Z

src/resmom/catch_child.c

@@ -2060,6 +2068,14 @@ init_abort_jobs(int recover)

 			if (mom_do_poll(pj))
 				append_link(&mom_polljobs, &pj->ji_jobque, pj);
+
+			append_link(multinode_jobs, &pj->ji_multinodejobs, pj);


Should we check first if job being added is multinode?

bayucan · 2020-08-14T18:47:59Z

src/resmom/mom_comm.c

+				(void)close(fd);
+				return (-1);
+			}
+			continue;


Should we put a limit on the number of write retries, so we don't get into an infinite loop?

src/resmom/mom_comm.c

bayucan · 2020-08-14T18:53:17Z

src/resmom/mom_main.c

@@ -650,7 +651,7 @@ extern	void	dep_cleanup(void);
 /* External Functions */

 extern	void	catch_child(int);
-extern	void	init_abort_jobs(int);
+extern	void	init_abort_jobs(int, pbs_list_head*);


pbs_list_head* -> pbs_list_head *

bayucan · 2020-08-14T19:07:58Z

test/tests/functional/pbs_node_jobs_restart_multinode.py

+
+        # Make sure moms are running with -p flag
+        momA.restart(args=['-p'])
+        momB.restart(args=['-p'])


What happens if -p is not specified by at least one of the moms? Please add a test case for this.

bayucan · 2020-08-14T19:10:25Z

test/tests/functional/pbs_node_jobs_restart_multinode.py

+        self.server.expect(JOB, {'job_state': 'R'}, id=jid)
+
+        # only restart 1 node as restarting momA would result in a pbsdsh crash
+        momB.restart(args=['-p'])


Please add a test case where momB is prematurely killed, as in mom.signal("-KILL"), and then issue a mom.restart("-p"). Also a case where primary mom is also killed prematurely, and then restart with -p. Is the behavior as you expected?

src/resmom/mom_comm.c

zv0n · 2020-08-24T07:04:57Z

@zv0n : Do you also have an accompanying design for this? It would be better to add a design under https://openpbs.atlassian.net/wiki/spaces/PD/pages outlining your approach of when pbs_mom is started with -p option, multinode jobs can also be recovered. Then announce it to the community form: https://community.openpbs.org/ for comments. People in the community can chime in and suggest improvements to the design...

@bayucan I don't have a design, sorry. I did this because we needed this functionality at my workplace, so I started coding before I looked at how merging to the project works. I'll go ahead and create one now.

src/resmom/mom_comm.c

bayucan · 2020-08-24T17:21:39Z

Thanks @zv0n for creating the design.

subhasisb · 2020-08-25T07:22:11Z

hi @zv0n I was looking at your proposal. I looked at your design page, but could not go to the associated design discussion page - it shows "oops topic not found". Hence adding my thoughts here;

I was trying to understand your primary usecase to make this change:
Are you trying to handle the case where only pbs_mom is restarted/upgraded? Or for a host level restart. If it is a host level restart/OS-update etc that needs a reboot, pbsdsh, or MPI job will typically die anyway, so allowing mom to restart with -p to preserve jobs is not useful there.
PBS supports ways to make node maintenance easy, by allowing to turn nodes offline to drain off jobs from them, or to enable admin-suspend: maintenance_jobs in the guide: https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf

Also, there is a thread about new maintenance windows enhancement:
https://openpbs.atlassian.net/wiki/spaces/PD/pages/879493121/Node+maintenance+window+enhancement

Please look through these and see if they satisfy your use cases already or not?

I do realize that your effort is to allow restarting of mom while preserving multinode jobs. The part i am trying to understand is that "is it really necessary, when there are ways to take a node down for maintenance in an orderly fashion anyway?"

vchlum · 2020-08-25T09:53:04Z

Hi @subhasisb , my colleague @zv0n works on this because we would appreciate this feature/fix for the upgrading process, and I think this could help others too.

Of course, this can not cover the host-level restart need. In such a case, the maintenance windows are suitable.

Our current solution: When we need to update the PBS service on nodes (because applying some patch or bug fix), we often do the "live update". Right now, we have the following patch in use: After upgrading the PBS package on the system, a signal is sent to mom. The mom will wait until no more multi-node jobs are running on the node. Once no multi-node job is present the self restart is engaged (with -p). This has obviously a huge drawback - the restart may never happen because a new multi-node job could start before the last one is finished. Not ideal.

I think it would be good and useful for all to be able to update (minor? versions of) PBS without setting any maintenance. We could just install a new version of the PBS package and the post-install could initiate the restart and that's it. This "live update" would be much simpler for admins. It would also increase the utilization of nodes because you do not need to drain nodes before updating. So, no resources are wasted during the PBS package update process.

subhasisb · 2020-08-25T10:35:34Z

Our current solution: When we need to update the PBS service on nodes (because applying some patch or bug fix), we often do the "live update". Right now, we have the following patch in use: After upgrading the PBS package on the system, a signal is sent to mom. The mom will wait until no more multi-node jobs are running on the node. Once no multi-node job is present the self restart is engaged (with -p). This has obviously a huge drawback - the restart may never happen because a new multi-node job could start before the last one is finished. Not ideal.

Actually, if you set the node to "offline" no new jobs will be sent to the mom. So eventually, the node will be free. I do understand that even that could take a good amount of time depending on how long the currently running job takes to complete.

I think it would be good and useful for all to be able to update (minor? versions of) PBS without setting any maintenance. We could just install a new version of the PBS package and the post-install could initiate the restart and that's it. This "live update" would be much simpler for admins. It would also increase the utilization of nodes because you do not need to drain nodes before updating. So, no resources are wasted during the PBS package update process.

yes, this, of course, makes sense. I was just thinking whether lots of sites really need this. But, i understand if you do.

bhroam · 2020-08-25T18:09:56Z

@vchlum another one of our customers handles rolling updates with provisioning. They even handle OS updates this way. They set the default aoe of the new jobs to the new node image. Eventually a new job will run with the new aoe and that will trigger the node to be provisioned. This will work better for them than for you because they allocate their nodes exclusively. In your case, the scheduler will only run jobs with shared aoe values on the same nodes. Eventually all of the older aoe jobs will be finished and a new aoe will run. The node will eventually provision.

vchlum · 2020-08-27T07:13:40Z

Thx @bhroam for sharing the solution. A solution such as this is not suitable for our site. We have heterogeneous nodes and heterogeneous jobs. We allow up to one-month walltime.

Since the -p parameter already works for singlenode job, I also think this is a logical improvement.

zv0n · 2020-08-31T11:16:27Z

My forum post got approved, so we can move future discussion there if need be - https://community.openpbs.org/t/option-to-resume-multinode-jobs/2244/2

bayucan

Since obit saving has been removed, need to also take out the function definition for ensure_write(). It's now unused...

zv0n · 2020-09-01T06:35:42Z

Forgot about that, sorry, it's removed now

hirenvadalia

@zv0n I don't have any further comments, but you still have few open question/comments from @bayucan. Also you need to update design page as suggestion by @bayucan.

zv0n · 2020-09-14T09:00:19Z

Hi, sorry for the long delay, I was on vacation.
@bayucan I think I've fixed all problems with code you've found and just added your requested tests. I've removed the test which restarted during pbsdsh since we've decided to remove obit saving and that test would either fail or enter an infinite loop

bayucan · 2020-09-15T23:28:57Z

@zv0n : It looks good now. Travis check seemed to be stuck. Please rebase your branch to the latest master.

zv0n · 2020-09-17T07:52:53Z

@bayucan I've rebased to master, hopefully Travis check will be happy and cooperative this time around :D

hirenvadalia · 2020-09-17T08:27:00Z

src/resmom/start_exec.c

+
+		int stream = np->hn_stream;
+		im_compose(stream, pjob->ji_qs.ji_jobid,
+			pjob->ji_wattr[(int)JOB_ATR_Cookie].at_val.at_str,


@zv0n sorry but one very small change: (as now things have changed in mainline about how we access job attributes using getters/setters):
Can you please repalce this line with get_jattr_str(pjob, JOB_ATR_Cookie)?

hirenvadalia

Thanks @zv0n. LGTM.

bayucan

Code looks good.

This reverts commit fc02d4c.

zv0n force-pushed the master branch 8 times, most recently from 4a6daad to d48f11b Compare August 11, 2020 13:57

zv0n force-pushed the master branch from d48f11b to 1d010b9 Compare August 12, 2020 06:20

zv0n force-pushed the master branch from 1d010b9 to 3774f5b Compare August 13, 2020 10:10

bayucan requested changes Aug 14, 2020

View reviewed changes

hirenvadalia requested changes Aug 14, 2020

View reviewed changes

src/resmom/mom_comm.c Outdated Show resolved Hide resolved

src/resmom/mom_comm.c Outdated Show resolved Hide resolved

zv0n force-pushed the master branch from b091c24 to 6af4e33 Compare August 24, 2020 10:22

hirenvadalia reviewed Aug 24, 2020

View reviewed changes

src/resmom/mom_comm.c Outdated Show resolved Hide resolved

bayucan requested changes Aug 31, 2020

View reviewed changes

hirenvadalia approved these changes Sep 7, 2020

View reviewed changes

zv0n force-pushed the master branch 2 times, most recently from d384898 to 9c6696b Compare September 17, 2020 07:48

Allow mom restart during multinode job

650d730

zv0n force-pushed the master branch from 9c6696b to 650d730 Compare September 17, 2020 07:49

hirenvadalia reviewed Sep 17, 2020

View reviewed changes

Replace ji_wattr with new getter function

157ba56

hirenvadalia approved these changes Sep 17, 2020

View reviewed changes

zv0n requested a review from bayucan September 17, 2020 11:40

bayucan approved these changes Sep 18, 2020

View reviewed changes

hirenvadalia merged commit fc02d4c into openpbs:master Sep 18, 2020

SudeshnaMoulik added a commit to SudeshnaMoulik/openpbs that referenced this pull request Sep 18, 2020

Revert "Add possibility of resuming multinode jobs (openpbs#1955)"

d589dc1

This reverts commit fc02d4c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility of resuming multinode jobs #1955

Add possibility of resuming multinode jobs #1955

zv0n commented Aug 5, 2020 •

edited

CLAassistant commented Aug 5, 2020 •

edited

zv0n commented Aug 11, 2020

arungrover commented Aug 11, 2020

zv0n commented Aug 12, 2020

zv0n commented Aug 14, 2020

bayucan left a comment •

edited

bayucan Aug 14, 2020

zv0n Aug 24, 2020

bayucan Aug 24, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

bayucan Aug 14, 2020

zv0n commented Aug 24, 2020

bayucan commented Aug 24, 2020

subhasisb commented Aug 25, 2020

vchlum commented Aug 25, 2020

subhasisb commented Aug 25, 2020

bhroam commented Aug 25, 2020

vchlum commented Aug 27, 2020

zv0n commented Aug 31, 2020

bayucan left a comment

zv0n commented Sep 1, 2020

hirenvadalia left a comment

zv0n commented Sep 14, 2020

bayucan commented Sep 15, 2020

zv0n commented Sep 17, 2020

hirenvadalia Sep 17, 2020

hirenvadalia left a comment

bayucan left a comment

Add possibility of resuming multinode jobs #1955

Add possibility of resuming multinode jobs #1955

Conversation

zv0n commented Aug 5, 2020 • edited

Describe Bug or Feature

Describe Your Change

Attach Test and Valgrind Logs/Output

CLAassistant commented Aug 5, 2020 • edited

zv0n commented Aug 11, 2020

arungrover commented Aug 11, 2020

zv0n commented Aug 12, 2020

zv0n commented Aug 14, 2020

bayucan left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zv0n commented Aug 24, 2020

bayucan commented Aug 24, 2020

subhasisb commented Aug 25, 2020

vchlum commented Aug 25, 2020

subhasisb commented Aug 25, 2020

bhroam commented Aug 25, 2020

vchlum commented Aug 27, 2020

zv0n commented Aug 31, 2020

bayucan left a comment

Choose a reason for hiding this comment

zv0n commented Sep 1, 2020

hirenvadalia left a comment

Choose a reason for hiding this comment

zv0n commented Sep 14, 2020

bayucan commented Sep 15, 2020

zv0n commented Sep 17, 2020

Choose a reason for hiding this comment

hirenvadalia left a comment

Choose a reason for hiding this comment

bayucan left a comment

Choose a reason for hiding this comment

zv0n commented Aug 5, 2020 •

edited

CLAassistant commented Aug 5, 2020 •

edited

bayucan left a comment •

edited