Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@francois-wellenreiter
Copy link
Member

No description provided.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1236/ for details.

@hppritcha
Copy link
Member

@francois-wellenreiter please assign a reviewer. Also is this an enhancement or a bug fix?

@francois-wellenreiter
Copy link
Member Author

@regrant could you please review this enhancement

@jsquyres
Copy link
Member

@francois-wellenreiter @regrant What version is this targeted for? Please see https://github.com/open-mpi/ompi/wiki/OmpiReleaseBotCommands for how to assign milestones, labels, and reviewers. Thanks!

@hppritcha hppritcha added this to the v2.x milestone Jan 22, 2016
@regrant
Copy link
Contributor

regrant commented Feb 8, 2016

@francois-wellenreiter while there's no explicit use of the PTL_EVENT_SEND event, it's the only way that you can catch initiator side errors since send events are not counted (CT) either. This could lead to infinite polling on your CTWait call if you're waiting for something that will never complete.

@francois-wellenreiter
Copy link
Member Author

@regrant, I am really surprised, my uderstanding of the portals4 specification is that PTL_MD_EVENT_CT_ACK allows to count sucessfull AND unsuccessfull PTL_EVENT_ACK events.
I agree that messages could be lost onto the network, but in such a case the portals4 API has to be robust enough to detect it and generate an event with a ptl_ni_fail_t value different than PTL_NI_OK.
The PtlCTWait must not poll infinitely but return a ptl_ct_event_t structure where the failure field has been incremented instead of the success one.

@regrant
Copy link
Contributor

regrant commented Feb 22, 2016

@francois-wellenreiter I wasn't referring to how the spec says things should work, you're right, that should happen exactly as described. I think I was unclear about what I was describing in the OMPI implementation. So the only time we call EQ_Get is in the progress callback, some of the functions depend on the op_count completion here, and so some of the functions rely on this for request completion. If you want to use CTs exclusively you can in places like the passive target ops (they already use CTs exclusively). It is the active target operations that use event counts to signal completion (and call the progress_callback function), so it is here where you need to know that there are no failures (you're not checking the ct.failures value). I misspoke here in my earlier comment it was not the CTWait, but this event count waiting that could be perturbed (also you should get an event back regardless of the disable flag, if it is a failure).

If this is changed I expect there to be changes that will need to be made to the places where the progress_callback function is relied upon. Are you testing these functions with the code changes and everything is working?

@regrant
Copy link
Contributor

regrant commented Apr 7, 2016

This checks out and should be merged. @hppritcha there should be no danger in including this in 2.0 if possible.

@hppritcha
Copy link
Member

given that this is highly portals specific and unlikely to cause additional bug fixes prior to 2.0.0 release I'm okay with this. @jsquyres lets discuss on Monday.

@jsquyres jsquyres modified the milestones: v2.0.0, v2.x Apr 11, 2016
@jsquyres
Copy link
Member

I'm assuming @regrant meant: 👍

@jsquyres jsquyres merged commit 83bc1c5 into open-mpi:v2.x Apr 11, 2016
@francois-wellenreiter francois-wellenreiter deleted the osc_disable_portals4_evt_send branch April 11, 2016 14:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants