New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possibility of resuming multinode jobs #1955
Conversation
4a6daad
to
d48f11b
Compare
Hi, |
Hi @zv0n, I think I have seen this problem with Travis in the past. What might work is that either you can close the PR and reopen it OR force push your branch. In both these cases, Travis will run again but I think it will get rid of that redundant job. |
Thanks @arungrover, seems like force pushing fixed it! |
Can someone take a look at this change, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zv0n : Do you also have an accompanying design for this? It would be better to add a design under https://openpbs.atlassian.net/wiki/spaces/PD/pages outlining your approach of when pbs_mom is started with -p option, multinode jobs can also be recovered. Then announce it to the community form: https://community.openpbs.org/ for comments. People in the community can chime in and suggest improvements to the design...
@@ -501,6 +501,8 @@ struct job { | |||
int ji_stderr; /* socket for stderr */ | |||
int ji_ports[2]; /* ports for stdout/err */ | |||
enum bg_hook_request ji_hook_running_bg_on; /* set when hook starts in the background*/ | |||
int ji_msconnected; /* 0 - not connected, 1 - connected */ | |||
pbs_list_head ji_multinodejobs; /* links to recovered multinode jobs */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lease initialize ji_msconnected in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.
Initialize also ji_multinodejobs in server/job_func.c:job_alloc() by calling CLEAR_HEAD(ji_multinodejobs).
and also freeing memory allocated to ji_multinodejobs in server/job_func.c:job_free() under the #ifdef PBS_MOM section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I'm already initializing these variables in job_func.c:355,356, should I do something more than that?
Since ji_multinodejobs only contains pointers to jobs that are already managed, will CLEAR_HEAD(ji_multinodejobs) suffice in job_free()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that you were already initializing them in job_func.c under job_alloc() so that's good. For the job_free(), yes, just do the CLEAR_HEAD.
@@ -612,6 +614,8 @@ struct job { | |||
#ifdef PBS_MOM | |||
tm_host_id ji_nodeidx; /* my node id */ | |||
tm_task_id ji_taskidx; /* generate task id's for job */ | |||
int ji_stdout; | |||
int ji_stderr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize these in server/job_func.c:job_alloc() under #ifdef PBS_MOM section.
src/include/mom_func.h
Outdated
@@ -376,7 +376,7 @@ extern pid_t fork_me(int sock); | |||
extern ssize_t readpipe(int pfd, void *vptr, size_t nbytes); | |||
extern ssize_t writepipe(int pfd, void *vptr, size_t nbytes); | |||
extern int get_la(double *); | |||
extern void init_abort_jobs(int); | |||
extern void init_abort_jobs(int, pbs_list_head*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pbs_listhead* -> pbs_listhead *
src/resmom/catch_child.c
Outdated
@@ -2060,6 +2068,14 @@ init_abort_jobs(int recover) | |||
|
|||
if (mom_do_poll(pj)) | |||
append_link(&mom_polljobs, &pj->ji_jobque, pj); | |||
|
|||
append_link(multinode_jobs, &pj->ji_multinodejobs, pj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check first if job being added is multinode?
src/resmom/mom_comm.c
Outdated
(void)close(fd); | ||
return (-1); | ||
} | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put a limit on the number of write retries, so we don't get into an infinite loop?
src/resmom/mom_main.c
Outdated
@@ -650,7 +651,7 @@ extern void dep_cleanup(void); | |||
/* External Functions */ | |||
|
|||
extern void catch_child(int); | |||
extern void init_abort_jobs(int); | |||
extern void init_abort_jobs(int, pbs_list_head*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pbs_list_head* -> pbs_list_head *
|
||
# Make sure moms are running with -p flag | ||
momA.restart(args=['-p']) | ||
momB.restart(args=['-p']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if -p is not specified by at least one of the moms? Please add a test case for this.
self.server.expect(JOB, {'job_state': 'R'}, id=jid) | ||
|
||
# only restart 1 node as restarting momA would result in a pbsdsh crash | ||
momB.restart(args=['-p']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case where momB is prematurely killed, as in mom.signal("-KILL"), and then issue a mom.restart("-p"). Also a case where primary mom is also killed prematurely, and then restart with -p. Is the behavior as you expected?
@bayucan I don't have a design, sorry. I did this because we needed this functionality at my workplace, so I started coding before I looked at how merging to the project works. I'll go ahead and create one now. |
Thanks @zv0n for creating the design. |
hi @zv0n I was looking at your proposal. I looked at your design page, but could not go to the associated design discussion page - it shows "oops topic not found". Hence adding my thoughts here; I was trying to understand your primary usecase to make this change: Also, there is a thread about new maintenance windows enhancement: Please look through these and see if they satisfy your use cases already or not? I do realize that your effort is to allow restarting of mom while preserving multinode jobs. The part i am trying to understand is that "is it really necessary, when there are ways to take a node down for maintenance in an orderly fashion anyway?" |
Hi @subhasisb , my colleague @zv0n works on this because we would appreciate this feature/fix for the upgrading process, and I think this could help others too. Of course, this can not cover the host-level restart need. In such a case, the maintenance windows are suitable. Our current solution: When we need to update the PBS service on nodes (because applying some patch or bug fix), we often do the "live update". Right now, we have the following patch in use: After upgrading the PBS package on the system, a signal is sent to mom. The mom will wait until no more multi-node jobs are running on the node. Once no multi-node job is present the self restart is engaged (with -p). This has obviously a huge drawback - the restart may never happen because a new multi-node job could start before the last one is finished. Not ideal. I think it would be good and useful for all to be able to update (minor? versions of) PBS without setting any maintenance. We could just install a new version of the PBS package and the post-install could initiate the restart and that's it. This "live update" would be much simpler for admins. It would also increase the utilization of nodes because you do not need to drain nodes before updating. So, no resources are wasted during the PBS package update process. |
Actually, if you set the node to "offline" no new jobs will be sent to the mom. So eventually, the node will be free. I do understand that even that could take a good amount of time depending on how long the currently running job takes to complete.
yes, this, of course, makes sense. I was just thinking whether lots of sites really need this. But, i understand if you do. |
@vchlum another one of our customers handles rolling updates with provisioning. They even handle OS updates this way. They set the default aoe of the new jobs to the new node image. Eventually a new job will run with the new aoe and that will trigger the node to be provisioned. This will work better for them than for you because they allocate their nodes exclusively. In your case, the scheduler will only run jobs with shared aoe values on the same nodes. Eventually all of the older aoe jobs will be finished and a new aoe will run. The node will eventually provision. |
Thx @bhroam for sharing the solution. A solution such as this is not suitable for our site. We have heterogeneous nodes and heterogeneous jobs. We allow up to one-month walltime. Since the -p parameter already works for singlenode job, I also think this is a logical improvement. |
My forum post got approved, so we can move future discussion there if need be - https://community.openpbs.org/t/option-to-resume-multinode-jobs/2244/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since obit saving has been removed, need to also take out the function definition for ensure_write(). It's now unused...
Forgot about that, sorry, it's removed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sorry for the long delay, I was on vacation. |
@zv0n : It looks good now. Travis check seemed to be stuck. Please rebase your branch to the latest master. |
d384898
to
9c6696b
Compare
@bayucan I've rebased to master, hopefully Travis check will be happy and cooperative this time around :D |
src/resmom/start_exec.c
Outdated
|
||
int stream = np->hn_stream; | ||
im_compose(stream, pjob->ji_qs.ji_jobid, | ||
pjob->ji_wattr[(int)JOB_ATR_Cookie].at_val.at_str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zv0n sorry but one very small change: (as now things have changed in mainline about how we access job attributes using getters/setters):
Can you please repalce this line with get_jattr_str(pjob, JOB_ATR_Cookie)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zv0n. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good.
This reverts commit fc02d4c.
Describe Bug or Feature
I have added possibility of resuming jobs that are running on multiple nodes. It is already possible to resume a job running on single node after mom restarts, so allowing the same for multinode jobs seems like a good idea. This is done when
-p
flag is passed on command line.Possible problems with the implementation:
pbsdsh
while one of the nodes is down it enters an infinite loop as it doesn't check for node availabilitypbsdsh
is running, it will exit with error as it loses connection with mother superiorpbsdsh
right after a node restart may result in error on task spawn because the task can be sent to the node before the node rejoins the job (and receives ports through which it should communicate)Describe Your Change
ji_ports
, so it can communicate with demux program after recovering the job.IM_RECONNECT_TO_MS
to mother superior, MS then sendsIM_JOIN_RECOV_JOB
to all sisters along with port numbers through which they should communicateIM_JOIN_RECOV_JOB
along with port numbers to all sistersscan_for_exiting
, so when they send an obit they can be sure that it is deliveredAttach Test and Valgrind Logs/Output
test_multinode_restart.txt