Now that we have an "isolated" PLM component, we cannot just let rsh … #936

rhc54 · 2015-09-24T14:20:01Z

…silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.

This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem

…silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out. This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it. Also do a little cleanup to avoid bombarding the user with multiple error messages. Thanks to Patrick Begou for reporting the problem

rhc54 · 2015-09-24T15:49:49Z

@hjelmn Do you see any issues with the change to mca_component_select?

hjelmn · 2015-09-24T15:54:13Z

Would it be better to fail if the return code is anything other than OPAL_SUCCESS or OPAL_ERR_NOT_AVAILABLE? If we make failure based on only OPAL_ERR_FATAL we need to verify that all component query routines conform to that usage on a fatal error. I am ok with that.

rhc54 · 2015-09-24T16:24:41Z

My concern was that "not_available" isn't a fatal error - it just means that the component isn't available. I used "fatal" as a way to indicate that not only am I not to be used, but I've seen an error that mandates we fail, even if other components say they can run. Seemed a more specific use-case. I don't know if other components that use this function have experienced this situation, or else I would think this would have come up before?

hjelmn · 2015-09-24T16:55:06Z

I was thinking SUCCESS = use it, NOT_AVAILABLE = skip, anything else = fail. Certainly there are codes that could mean the same as fatal (OPAL_ERR_OUT_OF_RESOURCE might be considered a fatal error). It is worth spot checking some of the query functions to see what range of error codes are in use.

hjelmn · 2015-09-24T16:55:50Z

BTW, I have no problem with using OPAL_ERR_FATAL. Just want to make sure we don't have to change too many components since there are a lot of them.

rhc54 · 2015-09-24T20:15:05Z

@hjelmn Well, a full-scale grep scan shows that people report:
OUT_OF_RESOURCE
NOT_SUPPORTED
NOT_AVAILABLE
ERROR
TAKE_NEXT_OPTION
FATAL

So it looks like the only one that intentionally directs us to exit is FATAL. Looking closer, OUT_OF_RESOURCE isn't necessarily used to indicate a memory issue - it can be used to indicate we don't have any more relevant networking ports, for example. This should probably disqualify that provider, but shouldn't terminate the job so long as somebody else can run.

I think I'll just stick with FATAL for now - we can adjust if someone comes up with another one.

Now that we have an "isolated" PLM component, we cannot just let rsh …

v1.10: f08: do not BIND(C) to subroutines with LOGICAL parameters

rhc54 pushed a commit that referenced this pull request Sep 24, 2015

Merge pull request #936 from rhc54/topic/rsh

7d3321b

Now that we have an "isolated" PLM component, we cannot just let rsh …

rhc54 merged commit 7d3321b into open-mpi:master Sep 24, 2015

rhc54 deleted the topic/rsh branch November 3, 2015 15:49

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016

Merge pull request open-mpi#936 from jsquyres/pr/v1.10/remove-f08-bind-c

5af848c

v1.10: f08: do not BIND(C) to subroutines with LOGICAL parameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Now that we have an "isolated" PLM component, we cannot just let rsh … #936

Now that we have an "isolated" PLM component, we cannot just let rsh … #936

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Now that we have an "isolated" PLM component, we cannot just let rsh … #936

Now that we have an "isolated" PLM component, we cannot just let rsh … #936

Uh oh!

Conversation

rhc54 commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

hjelmn commented Sep 24, 2015

Uh oh!

rhc54 commented Sep 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants