Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Jul 3, 2015

Two commits to handle differences between libfabric v1.0.0 and v1.1.0:

  1. fi_cq_readerr() return value differences
  2. FI_MSG_PREFIX behavior differences

@bturrubiates @goodell please review. The fi_cq_readerr() thing is new -- I discovered that tonight.

@jsquyres jsquyres added the bug label Jul 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this FI_VERSION call just be fi_version()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would make a run-time query, as opposed to a compile-time query. The way it is written, it will tell libfabric "this is the version of the API with which I was compiled."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...although thinking about this more, I think you're on to something.

If we compile OMPI against libfabric v1.0.0 (i.e., API v1.0) and run with libfabric v1.1.0 (i.e., API v1.1), our compromise was that the usnic libfabric provider would say "I'm v1.1, but you(the app) are compiled against v1.0, so you're going to do MSG_PREFIX wrong -- so I'll turn it off."

But OMPI correctly handles MSG_PREFIX regardless of which way the usnic provider does it.

So we should lie here; OMPI should pass a minimum of FI_VERSION(1,1) to fi_getinfo().

New patch coming shortly...

@jsquyres jsquyres force-pushed the pr/usnic-libfabric-msg-prefix-fix branch 2 times, most recently from d9bbe3e to 470fd30 Compare July 3, 2015 11:44
@jsquyres
Copy link
Member Author

jsquyres commented Jul 3, 2015

This PR is turning into general usnic BTL fixes. I just pushed another commit with some valgrind updates.

Note that I'm still chasing two more issues:

  1. Minor: Valgrind is reporting an invalid 2-byte read in the usnic BTL on MCW rank 16 (and only MCW rank 16!) when I mpirun -np 32 --mca btl usnic,sm,self IMB-MPI1 Alltoall -npmin 32 across two of the 16-core ivyXX servers.
  2. Showstopper: I'm getting a segv in the usnic BTL finalize in the same 32-core IMB run listed above; it looks like some kind of memory corruption. The segv is happening deep within OBJ_DESTRUCT(&module->small_send_frags) in usnic_finalize() in the BTL.

@bturrubiates
Copy link

The commits related to the CRC errors and FI_MSG_PREFIX look alright to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"report" --> "request"? (also later in this comment)

@jsquyres jsquyres force-pushed the pr/usnic-libfabric-msg-prefix-fix branch 2 times, most recently from 7085953 to c8438ed Compare July 9, 2015 19:58
jsquyres added 8 commits July 10, 2015 06:51
Handle the differences between libfabric v1.0.0 and v1.1.0 in the
return value of fi_cq_readerr().

Also consolidate CRC and truncation errors into the same handling
block, since truncation errors are typically another symptom of CRC
errors.  This ensures that buffers get reposted properly.
This helps reduce false positives when running MPI apps through
Valgrind.
In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled
FI_MSG_PREFIX inconsistently between sends and receives.  This has
been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is
handled consistently for both sends and receives.

Run-time detect which libfabric we are running with and adapt behavior
appropriately.
The usnic BTL put method is currently broken.  Disable it until we can
fix it properly.
The "sin" variable is used below; need to ensure that it is assigned
for all builds (not just debug builds).
The comment didn't match the debugging code (which was ugly, and
apparently never happens, anyway).  Just return and let the sender
retransmit.
@jsquyres
Copy link
Member Author

bot:retest

@jsquyres jsquyres force-pushed the pr/usnic-libfabric-msg-prefix-fix branch from c8438ed to 633da66 Compare July 20, 2015 18:33
jsquyres added a commit that referenced this pull request Jul 21, 2015
usnic fixes for differences between libfabric v1.0.0 and v1.1.0
@jsquyres jsquyres merged commit ec3a383 into open-mpi:master Jul 21, 2015
@jsquyres jsquyres deleted the pr/usnic-libfabric-msg-prefix-fix branch July 21, 2015 14:18
jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
Fix singleton operations when running under a SLURM allocation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants