Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orted prolog and epilog hooks #35

Closed
ompiteam opened this issue Oct 1, 2014 · 7 comments
Closed

Orted prolog and epilog hooks #35

ompiteam opened this issue Oct 1, 2014 · 7 comments
Assignees
Milestone

Comments

@ompiteam
Copy link
Contributor

ompiteam commented Oct 1, 2014

Terry and I were talking about the possibility of having per-job prolog and epilog steps in the orted. That is, an MCA parameter that identifies an argv to run before the first local proc of a job is launched on the node and after the last local proc of a job has completed. Typical argv would usually be a local script (perhaps to perform some site-specific administrative stuff). If the argv for the prolog/epilog is blank (which would be the default), then nothing would be launched for these steps. Hence, these would be hooks available to sysadmins if they want to use them.

I'm guessing/assuming that this would not be difficult to do -- it's mainly a matter of:

  • Finding the right place in the orted to run the prolog and epilog
  • Deciding what information to give to the prolog and epilog (e.g., passing a pile of relevant info in environment variables, such as the job ID, the session directory, the argv of the job, the exit conditions of the job, etc. -- anything that the prolog and epilog might want to know. Just about every resource manager have prolog/epilog functionality -- we might look to them for inspiration on what kind of information could be useful).

It ''might'' be useful to also have the same prolog/epilog hooks for each process in a job on the host as well. [shrug]

I'm initially marking this as a 1.3 milestone, but have no real requirement for it in v1.3 -- it seems like an easy / neat / useful idea, but there is no ''need'' to have it in v1.3. It could be pushed forward.

@ompiteam ompiteam added this to the Future milestone Oct 1, 2014
@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Imported from trac issue 1269. Created by jsquyres on 2008-04-11T11:23:25, last modified: 2011-01-11T07:45:51

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-06-23 13:32:57:

Yo Ralph -- I'm assuming there's no plans for this kind of feature in v1.3 (Terry and I were talking "pie in the sky" kinds of ideas when we came up with this one). Should we shift it to "Future"?

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by rhc on 2008-06-23 13:58:28:

As noted, it would be easy to implement, so I guess I don't care - could throw it into 1.3 or not.

Kinda up to you guys as to how badly you want it.

Ralph

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by tdd on 2008-06-24 10:25:59:

This feature is not absolutely necessary for 1.3 but I would like it in 1.3.1. I've discussed this with the RMs (Brad and George) and they are fine with this feature being added to 1.3.1.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by rhc on 2010-01-27 22:29:02:

Damien has a somewhat related issue - what he needs is basically a "spawn agent" similar to our "launch agent". If provided, this would be a cmd that executes each app when spawned.

In other words, you take the argv that is going to be fork/exec'd and prepend the spawn agent in it. Thus, the spawn agent is what actually executes the app.

I'm not sure if Damien is doing this work or not - perhaps he could confirm? Otherwise, I'll implement it over the next week or two (actually rather trivial to do).

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2010-02-02 18:27:17:

Damien was having problems posting; he mailed his reply to me directly (see below). Damien: note that you can sign up for an account on our Trac and therefore be able to comment on tickets directly (our Trac does not currently accept emails as input).


I have received a query from the Openmpi tracker. The ticket https://svn.open-mpi.org/trac/ompi/ticket/1269 is an
enhancement to a "spawn agent".

This feature is not an issue for me.

My problem is to have an "orted local to mpirun",this is the only one
solution to , fix this problems:

  • have no launch difference between mpirun node and other node
  • have no ulimit difference between mpirun node and other node
  • have no cpuset difference between mpirun node and other node
  • have no bash difference between mpirun node and other node
  • have no environment difference between mpirun node and other node

Please refer to Ralph for history and sorry for confusion.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-01-11 07:45:51:

I think it's pretty safe to say that this won't happen any time soon unless someone can free up some cycles to implement it.

rhc54 pushed a commit that referenced this issue Oct 31, 2014
…-v1.8

OSHMEM: spml ikrit: complete puts b4 memheap destruction
yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015
lrrajesh pushed a commit to lrrajesh/ompi that referenced this issue Mar 19, 2015
The IPMI plugin tries to read the bmc credentials in an endless loop in
case it fails to read for some reason, and no other compute node tries
to send the bmc credential data. Fixes open-mpi#35

The nodepower plugin uses the ipmi_cmdraw command to retrieve data from
the bmc. It needs to pass the length of the response buffer to the API
so that the library knows how much memory is allocated to it and can
pack the data accordingly. The current implementation does pass the
repose length. but it doesn't initialize the length, which could lead to
non deterministic results. Initialized teh lenght to 1024 to reflect the
size of the data buffer. Fixes open-mpi#33

If the ipmi_cmdraw call in the nodepower plugin fails for some reason,
then the failure is currently being ignored and the responseData is
copied out and passed to the application. Added the code to handle a
return failure of the ipmi_cmdraw call.
lrrajesh pushed a commit to lrrajesh/ompi that referenced this issue Mar 19, 2015
When get_bmc_cred fails in an aggregator, and no other compute node
sends the ipmi credential data, the aggregator tries to read its
bmc credential in a loop. Implemented a timeout as well as printing out
a debug message indicating the issue. Refs open-mpi#35
@rhc54 rhc54 closed this as completed Jan 27, 2017
devreal added a commit to devreal/ompi that referenced this issue Sep 15, 2020
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants