Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility of resuming multinode jobs #1955

Merged
merged 2 commits into from Sep 18, 2020
Merged

Conversation

zv0n
Copy link
Contributor

@zv0n zv0n commented Aug 5, 2020

Describe Bug or Feature

I have added possibility of resuming jobs that are running on multiple nodes. It is already possible to resume a job running on single node after mom restarts, so allowing the same for multinode jobs seems like a good idea. This is done when -p flag is passed on command line.

Possible problems with the implementation:

  • If you run pbsdsh while one of the nodes is down it enters an infinite loop as it doesn't check for node availability
  • If you restart mother superior while pbsdsh is running, it will exit with error as it loses connection with mother superior
  • Starting pbsdsh right after a node restart may result in error on task spawn because the task can be sent to the node before the node rejoins the job (and receives ports through which it should communicate)

Describe Your Change

  • Mother superior now saves and recovers ji_ports, so it can communicate with demux program after recovering the job.
  • When a non-superior mom recovers multinode job, it sends IM_RECONNECT_TO_MS to mother superior, MS then sends IM_JOIN_RECOV_JOB to all sisters along with port numbers through which they should communicate
  • When MS recovers multinode job, it sends IM_JOIN_RECOV_JOB along with port numbers to all sisters
  • Obit information (if there is any) is saved/recover along with tasks, so moms know wheter to send an obit after task completion
  • Multinode jobs now check whether they have a connection to MS inside scan_for_exiting, so when they send an obit they can be sure that it is delivered

Attach Test and Valgrind Logs/Output

test_multinode_restart.txt

@CLAassistant
Copy link

CLAassistant commented Aug 5, 2020

CLA assistant check
All committers have signed the CLA.

@zv0n zv0n force-pushed the master branch 8 times, most recently from 4a6daad to d48f11b Compare August 11, 2020 13:57
@zv0n
Copy link
Contributor Author

zv0n commented Aug 11, 2020

Hi,
Can someone please remove the random redundant Travis job (that doesn't seem to be starting at all)?
Thanks!

@arungrover
Copy link
Contributor

Hi @zv0n, I think I have seen this problem with Travis in the past. What might work is that either you can close the PR and reopen it OR force push your branch. In both these cases, Travis will run again but I think it will get rid of that redundant job.

@zv0n
Copy link
Contributor Author

zv0n commented Aug 12, 2020

Thanks @arungrover, seems like force pushing fixed it!

@zv0n
Copy link
Contributor Author

zv0n commented Aug 14, 2020

Can someone take a look at this change, please?

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zv0n : Do you also have an accompanying design for this? It would be better to add a design under https://openpbs.atlassian.net/wiki/spaces/PD/pages outlining your approach of when pbs_mom is started with -p option, multinode jobs can also be recovered. Then announce it to the community form: https://community.openpbs.org/ for comments. People in the community can chime in and suggest improvements to the design...

@@ -501,6 +501,8 @@ struct job {
int ji_stderr; /* socket for stderr */
int ji_ports[2]; /* ports for stdout/err */
enum bg_hook_request ji_hook_running_bg_on; /* set when hook starts in the background*/
int ji_msconnected; /* 0 - not connected, 1 - connected */
pbs_list_head ji_multinodejobs; /* links to recovered multinode jobs */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lease initialize ji_msconnected in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.
Initialize also ji_multinodejobs in server/job_func.c:job_alloc() by calling CLEAR_HEAD(ji_multinodejobs).
and also freeing memory allocated to ji_multinodejobs in server/job_func.c:job_free() under the #ifdef PBS_MOM section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I'm already initializing these variables in job_func.c:355,356, should I do something more than that?

Since ji_multinodejobs only contains pointers to jobs that are already managed, will CLEAR_HEAD(ji_multinodejobs) suffice in job_free()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that you were already initializing them in job_func.c under job_alloc() so that's good. For the job_free(), yes, just do the CLEAR_HEAD.

@@ -612,6 +614,8 @@ struct job {
#ifdef PBS_MOM
tm_host_id ji_nodeidx; /* my node id */
tm_task_id ji_taskidx; /* generate task id's for job */
int ji_stdout;
int ji_stderr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize these in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.

@@ -376,7 +376,7 @@ extern pid_t fork_me(int sock);
extern ssize_t readpipe(int pfd, void *vptr, size_t nbytes);
extern ssize_t writepipe(int pfd, void *vptr, size_t nbytes);
extern int get_la(double *);
extern void init_abort_jobs(int);
extern void init_abort_jobs(int, pbs_list_head*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pbs_listhead* -> pbs_listhead *

@@ -2060,6 +2068,14 @@ init_abort_jobs(int recover)

if (mom_do_poll(pj))
append_link(&mom_polljobs, &pj->ji_jobque, pj);

append_link(multinode_jobs, &pj->ji_multinodejobs, pj);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check first if job being added is multinode?

(void)close(fd);
return (-1);
}
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put a limit on the number of write retries, so we don't get into an infinite loop?

src/resmom/mom_comm.c Outdated Show resolved Hide resolved
@@ -650,7 +651,7 @@ extern void dep_cleanup(void);
/* External Functions */

extern void catch_child(int);
extern void init_abort_jobs(int);
extern void init_abort_jobs(int, pbs_list_head*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pbs_list_head* -> pbs_list_head *


# Make sure moms are running with -p flag
momA.restart(args=['-p'])
momB.restart(args=['-p'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if -p is not specified by at least one of the moms? Please add a test case for this.

self.server.expect(JOB, {'job_state': 'R'}, id=jid)

# only restart 1 node as restarting momA would result in a pbsdsh crash
momB.restart(args=['-p'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case where momB is prematurely killed, as in mom.signal("-KILL"), and then issue a mom.restart("-p"). Also a case where primary mom is also killed prematurely, and then restart with -p. Is the behavior as you expected?

src/resmom/mom_comm.c Outdated Show resolved Hide resolved
src/resmom/mom_comm.c Outdated Show resolved Hide resolved
@zv0n
Copy link
Contributor Author

zv0n commented Aug 24, 2020

@zv0n : Do you also have an accompanying design for this? It would be better to add a design under https://openpbs.atlassian.net/wiki/spaces/PD/pages outlining your approach of when pbs_mom is started with -p option, multinode jobs can also be recovered. Then announce it to the community form: https://community.openpbs.org/ for comments. People in the community can chime in and suggest improvements to the design...

@bayucan I don't have a design, sorry. I did this because we needed this functionality at my workplace, so I started coding before I looked at how merging to the project works. I'll go ahead and create one now.

src/resmom/mom_comm.c Outdated Show resolved Hide resolved
@bayucan
Copy link
Contributor

bayucan commented Aug 24, 2020

Thanks @zv0n for creating the design.

@subhasisb
Copy link
Collaborator

hi @zv0n I was looking at your proposal. I looked at your design page, but could not go to the associated design discussion page - it shows "oops topic not found". Hence adding my thoughts here;

I was trying to understand your primary usecase to make this change:
Are you trying to handle the case where only pbs_mom is restarted/upgraded? Or for a host level restart. If it is a host level restart/OS-update etc that needs a reboot, pbsdsh, or MPI job will typically die anyway, so allowing mom to restart with -p to preserve jobs is not useful there.
PBS supports ways to make node maintenance easy, by allowing to turn nodes offline to drain off jobs from them, or to enable admin-suspend: maintenance_jobs in the guide: https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf

Also, there is a thread about new maintenance windows enhancement:
https://openpbs.atlassian.net/wiki/spaces/PD/pages/879493121/Node+maintenance+window+enhancement

Please look through these and see if they satisfy your use cases already or not?

I do realize that your effort is to allow restarting of mom while preserving multinode jobs. The part i am trying to understand is that "is it really necessary, when there are ways to take a node down for maintenance in an orderly fashion anyway?"

@vchlum
Copy link
Contributor

vchlum commented Aug 25, 2020

Hi @subhasisb , my colleague @zv0n works on this because we would appreciate this feature/fix for the upgrading process, and I think this could help others too.

Of course, this can not cover the host-level restart need. In such a case, the maintenance windows are suitable.

Our current solution: When we need to update the PBS service on nodes (because applying some patch or bug fix), we often do the "live update". Right now, we have the following patch in use: After upgrading the PBS package on the system, a signal is sent to mom. The mom will wait until no more multi-node jobs are running on the node. Once no multi-node job is present the self restart is engaged (with -p). This has obviously a huge drawback - the restart may never happen because a new multi-node job could start before the last one is finished. Not ideal.

I think it would be good and useful for all to be able to update (minor? versions of) PBS without setting any maintenance. We could just install a new version of the PBS package and the post-install could initiate the restart and that's it. This "live update" would be much simpler for admins. It would also increase the utilization of nodes because you do not need to drain nodes before updating. So, no resources are wasted during the PBS package update process.

@subhasisb
Copy link
Collaborator

Our current solution: When we need to update the PBS service on nodes (because applying some patch or bug fix), we often do the "live update". Right now, we have the following patch in use: After upgrading the PBS package on the system, a signal is sent to mom. The mom will wait until no more multi-node jobs are running on the node. Once no multi-node job is present the self restart is engaged (with -p). This has obviously a huge drawback - the restart may never happen because a new multi-node job could start before the last one is finished. Not ideal.

Actually, if you set the node to "offline" no new jobs will be sent to the mom. So eventually, the node will be free. I do understand that even that could take a good amount of time depending on how long the currently running job takes to complete.

I think it would be good and useful for all to be able to update (minor? versions of) PBS without setting any maintenance. We could just install a new version of the PBS package and the post-install could initiate the restart and that's it. This "live update" would be much simpler for admins. It would also increase the utilization of nodes because you do not need to drain nodes before updating. So, no resources are wasted during the PBS package update process.

yes, this, of course, makes sense. I was just thinking whether lots of sites really need this. But, i understand if you do.

@bhroam
Copy link
Contributor

bhroam commented Aug 25, 2020

@vchlum another one of our customers handles rolling updates with provisioning. They even handle OS updates this way. They set the default aoe of the new jobs to the new node image. Eventually a new job will run with the new aoe and that will trigger the node to be provisioned. This will work better for them than for you because they allocate their nodes exclusively. In your case, the scheduler will only run jobs with shared aoe values on the same nodes. Eventually all of the older aoe jobs will be finished and a new aoe will run. The node will eventually provision.

@vchlum
Copy link
Contributor

vchlum commented Aug 27, 2020

Thx @bhroam for sharing the solution. A solution such as this is not suitable for our site. We have heterogeneous nodes and heterogeneous jobs. We allow up to one-month walltime.

Since the -p parameter already works for singlenode job, I also think this is a logical improvement.

@zv0n
Copy link
Contributor Author

zv0n commented Aug 31, 2020

My forum post got approved, so we can move future discussion there if need be - https://community.openpbs.org/t/option-to-resume-multinode-jobs/2244/2

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since obit saving has been removed, need to also take out the function definition for ensure_write(). It's now unused...

@zv0n
Copy link
Contributor Author

zv0n commented Sep 1, 2020

Forgot about that, sorry, it's removed now

Copy link
Contributor

@hirenvadalia hirenvadalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zv0n I don't have any further comments, but you still have few open question/comments from @bayucan. Also you need to update design page as suggestion by @bayucan.

@zv0n
Copy link
Contributor Author

zv0n commented Sep 14, 2020

Hi, sorry for the long delay, I was on vacation.
@bayucan I think I've fixed all problems with code you've found and just added your requested tests. I've removed the test which restarted during pbsdsh since we've decided to remove obit saving and that test would either fail or enter an infinite loop

@bayucan
Copy link
Contributor

bayucan commented Sep 15, 2020

@zv0n : It looks good now. Travis check seemed to be stuck. Please rebase your branch to the latest master.

@zv0n zv0n force-pushed the master branch 2 times, most recently from d384898 to 9c6696b Compare September 17, 2020 07:48
@zv0n
Copy link
Contributor Author

zv0n commented Sep 17, 2020

@bayucan I've rebased to master, hopefully Travis check will be happy and cooperative this time around :D


int stream = np->hn_stream;
im_compose(stream, pjob->ji_qs.ji_jobid,
pjob->ji_wattr[(int)JOB_ATR_Cookie].at_val.at_str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zv0n sorry but one very small change: (as now things have changed in mainline about how we access job attributes using getters/setters):
Can you please repalce this line with get_jattr_str(pjob, JOB_ATR_Cookie)?

Copy link
Contributor

@hirenvadalia hirenvadalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zv0n. LGTM.

@zv0n zv0n requested a review from bayucan September 17, 2020 11:40
Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good.

@hirenvadalia hirenvadalia merged commit fc02d4c into openpbs:master Sep 18, 2020
SudeshnaMoulik added a commit to SudeshnaMoulik/openpbs that referenced this pull request Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants