Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Sep 24, 2015

…silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.

This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem

…silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.

This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem
@rhc54
Copy link
Contributor Author

rhc54 commented Sep 24, 2015

@hjelmn Do you see any issues with the change to mca_component_select?

@hjelmn
Copy link
Member

hjelmn commented Sep 24, 2015

Would it be better to fail if the return code is anything other than OPAL_SUCCESS or OPAL_ERR_NOT_AVAILABLE? If we make failure based on only OPAL_ERR_FATAL we need to verify that all component query routines conform to that usage on a fatal error. I am ok with that.

@rhc54
Copy link
Contributor Author

rhc54 commented Sep 24, 2015

My concern was that "not_available" isn't a fatal error - it just means that the component isn't available. I used "fatal" as a way to indicate that not only am I not to be used, but I've seen an error that mandates we fail, even if other components say they can run. Seemed a more specific use-case. I don't know if other components that use this function have experienced this situation, or else I would think this would have come up before?

@hjelmn
Copy link
Member

hjelmn commented Sep 24, 2015

I was thinking SUCCESS = use it, NOT_AVAILABLE = skip, anything else = fail. Certainly there are codes that could mean the same as fatal (OPAL_ERR_OUT_OF_RESOURCE might be considered a fatal error). It is worth spot checking some of the query functions to see what range of error codes are in use.

@hjelmn
Copy link
Member

hjelmn commented Sep 24, 2015

BTW, I have no problem with using OPAL_ERR_FATAL. Just want to make sure we don't have to change too many components since there are a lot of them.

@rhc54
Copy link
Contributor Author

rhc54 commented Sep 24, 2015

@hjelmn Well, a full-scale grep scan shows that people report:
OUT_OF_RESOURCE
NOT_SUPPORTED
NOT_AVAILABLE
ERROR
TAKE_NEXT_OPTION
FATAL

So it looks like the only one that intentionally directs us to exit is FATAL. Looking closer, OUT_OF_RESOURCE isn't necessarily used to indicate a memory issue - it can be used to indicate we don't have any more relevant networking ports, for example. This should probably disqualify that provider, but shouldn't terminate the job so long as somebody else can run.

I think I'll just stick with FATAL for now - we can adjust if someone comes up with another one.

rhc54 pushed a commit that referenced this pull request Sep 24, 2015
Now that we have an "isolated" PLM component, we cannot just let rsh …
@rhc54 rhc54 merged commit 7d3321b into open-mpi:master Sep 24, 2015
@rhc54 rhc54 deleted the topic/rsh branch November 3, 2015 15:49
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
v1.10: f08: do not BIND(C) to subroutines with LOGICAL parameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants