-
Notifications
You must be signed in to change notification settings - Fork 931
usnic fixes for differences between libfabric v1.0.0 and v1.1.0 #688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usnic fixes for differences between libfabric v1.0.0 and v1.1.0 #688
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this FI_VERSION call just be fi_version()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would make a run-time query, as opposed to a compile-time query. The way it is written, it will tell libfabric "this is the version of the API with which I was compiled."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...although thinking about this more, I think you're on to something.
If we compile OMPI against libfabric v1.0.0 (i.e., API v1.0) and run with libfabric v1.1.0 (i.e., API v1.1), our compromise was that the usnic libfabric provider would say "I'm v1.1, but you(the app) are compiled against v1.0, so you're going to do MSG_PREFIX wrong -- so I'll turn it off."
But OMPI correctly handles MSG_PREFIX regardless of which way the usnic provider does it.
So we should lie here; OMPI should pass a minimum of FI_VERSION(1,1) to fi_getinfo().
New patch coming shortly...
d9bbe3e to
470fd30
Compare
|
This PR is turning into general usnic BTL fixes. I just pushed another commit with some valgrind updates. Note that I'm still chasing two more issues:
|
|
The commits related to the CRC errors and FI_MSG_PREFIX look alright to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"report" --> "request"? (also later in this comment)
7085953 to
c8438ed
Compare
Handle the differences between libfabric v1.0.0 and v1.1.0 in the return value of fi_cq_readerr(). Also consolidate CRC and truncation errors into the same handling block, since truncation errors are typically another symptom of CRC errors. This ensures that buffers get reposted properly.
This helps reduce false positives when running MPI apps through Valgrind.
In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled FI_MSG_PREFIX inconsistently between sends and receives. This has been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is handled consistently for both sends and receives. Run-time detect which libfabric we are running with and adapt behavior appropriately.
The usnic BTL put method is currently broken. Disable it until we can fix it properly.
The "sin" variable is used below; need to ensure that it is assigned for all builds (not just debug builds).
The comment didn't match the debugging code (which was ugly, and apparently never happens, anyway). Just return and let the sender retransmit.
|
bot:retest |
c8438ed to
633da66
Compare
usnic fixes for differences between libfabric v1.0.0 and v1.1.0
Fix singleton operations when running under a SLURM allocation.
Two commits to handle differences between libfabric v1.0.0 and v1.1.0:
fi_cq_readerr()return value differences@bturrubiates @goodell please review. The
fi_cq_readerr()thing is new -- I discovered that tonight.