-
Notifications
You must be signed in to change notification settings - Fork 931
usnic: use fi_getname in newer libfabric #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Refer to this link for build results (access rights to CI server needed): |
|
Doh -- sadness. When I run this with libfabric master, I get an assert fail in your new code: Works fine with libfabric 1.0.0, though. Sorry -- I don't have enough brain power to track this down ATM... |
|
strange... I thought I tested all cases with the latest version of the code, but maybe I missed one. Let me see if I can reproduce. |
When using an external libfabric (or really any libfabric newer than libfabric commit 607e863), we must use fi_getname to determine the local port of our endpoint. Without this fix, OMPI will hang endlessly while retransmitting packets to port 0 on the remote host.
|
I wasn't building with |
|
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the assert right above it, does this "if" need to check for FI_SOCKADDR or FI_SOCKADDR_IN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard my previous comment... I see that the purpose of this "if" is mainly just another assert.
However, it'll generate a CID / compiler warning in non-debug builds, because sa will be assigned and not used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it'll generate a CID / compiler warning in non-debug builds, because sa will be assigned and not used, right?
I suppose. What's the usual fix for that specific set of warnings? I rather hate sticking more code in just to make various static analysis tools happy when there's no actual problem.
I could also do it a bit less cleanly and exploit the fact that I know struct sockaddr and struct sockaddr_in must use the same layout for sa_family/sin_family. Then I don't need sa at all and I just make the assertion on sin->sin_family instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also just surround the if block with #if OPAL_ENABLE_DEBUG...?
usnic: use fi_getname in newer libfabric
Add ability for user to empty the CUDA IPC registration cache when it is full
When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint.
@jsquyres please review (and propagate to other OMPI branches/instances as needed)