Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AV insertion, removal, and reinsertion causes FI_ADDR_NOTAVAIL #2504

Closed
NotThatJonSmith opened this issue Nov 7, 2016 · 2 comments
Closed

Comments

@NotThatJonSmith
Copy link

Problem discovered in libfabric 1.1.0 on 10/31/2016

For reasons tangential to the problem at hand, I wish to send a message from one unconnected endpoint to another using the address vector in a “insert, send, remove” pattern.

After performing one insert, send, remove iteration, the second fi_av_insert causes the sockets provider to throw a debug message saying that the sockets address given is invalid (even though it was successfully used to send a message on the last iteration). Additionally, the fi_av_insert call yields a fi_addr_t equal to FI_ADDR_NOTAVAIL.

I held two theories for the cause of this behavior:

  1. The fi_av_insert call mangles the address we give it, perhaps setting it to NULL, or freeing it
  2. The fid_av has latent state, either unknown or errant, which causes the behavior

I did some digging and discovered that:

  1. If I ensure that I only ever give fresh, dedicated copies of my raw addresses to the fi_av_insert call, the symptoms persist. The first message is successful, but the second AV insert results in failure. This seems to debunk cause A bunch of little fixes #1.
  2. If, before every send operation, I create and bind a new AV, and afterward I close it, then the allegedly-errant behavior is eliminated – both messages are correctly transmitted. This seems to support cause A series of small fixes #2.
@jsquyres
Copy link
Member

jsquyres commented Nov 7, 2016

Which provider are you using?

If you're using the sockets provider, can you try again with Libfabric v1.4? There were some AV bugs fixed in the sockets provider for v1.4.

@NotThatJonSmith
Copy link
Author

I upgraded to 1.4.0 and updated my application accordingly, and the issue was resolved. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants