Skip to content

Conversation

@goodell
Copy link
Member

@goodell goodell commented May 21, 2015

When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint.

@jsquyres please review (and propagate to other OMPI branches/instances as needed)

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/537/
Test PASSed.

@jsquyres
Copy link
Member

Doh -- sadness. When I run this with libfabric master, I get an assert fail in your new code:

ring_c: btl_usnic_module.c:1514: create_ep: Assertion `FI_SOCKADDR_IN == channel->info->addr_format' failed.

Works fine with libfabric 1.0.0, though. Sorry -- I don't have enough brain power to track this down ATM...

@jsquyres jsquyres assigned goodell and unassigned jsquyres May 21, 2015
@goodell
Copy link
Member Author

goodell commented May 21, 2015

strange... I thought I tested all cases with the latest version of the code, but maybe I missed one. Let me see if I can reproduce.

When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint.  Without this fix, OMPI will hang endlessly
while retransmitting packets to port 0 on the remote host.
@goodell goodell force-pushed the pr/usnic_getname branch from 1facab0 to 65b66ab Compare May 21, 2015 15:52
@goodell
Copy link
Member Author

goodell commented May 21, 2015

I wasn't building with --enable-debug, so my assertions weren't enabled when I did my testing. I've pushed a fixed version which knows how to handle libfabric returning either FI_SOCKADDR or FI_SOCKADDR_IN.

@goodell goodell assigned jsquyres and unassigned goodell May 21, 2015
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/539/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the assert right above it, does this "if" need to check for FI_SOCKADDR or FI_SOCKADDR_IN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard my previous comment... I see that the purpose of this "if" is mainly just another assert.

However, it'll generate a CID / compiler warning in non-debug builds, because sa will be assigned and not used, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it'll generate a CID / compiler warning in non-debug builds, because sa will be assigned and not used, right?

I suppose. What's the usual fix for that specific set of warnings? I rather hate sticking more code in just to make various static analysis tools happy when there's no actual problem.

I could also do it a bit less cleanly and exploit the fact that I know struct sockaddr and struct sockaddr_in must use the same layout for sa_family/sin_family. Then I don't need sa at all and I just make the assertion on sin->sin_family instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also just surround the if block with #if OPAL_ENABLE_DEBUG...?

jsquyres added a commit that referenced this pull request May 26, 2015
usnic: use fi_getname in newer libfabric
@jsquyres jsquyres merged commit 68c3335 into open-mpi:master May 26, 2015
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
Add ability for user to empty the CUDA IPC registration cache when it is full
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants