Skip to content

Conversation

@jsquyres
Copy link
Member

This is the v2.x version of master PR #2692.

It includes all but the last commit from #2692 (i.e., the one that change the output format of statistics).

This is a blocker because it fixes some usnic bugs when running at scale.

- Add more explanatory comments
- Trivial whitespace / style updates
- Rename opal_btl_usnic_force_retrans() -> opal_btl_usnic_fast_retrans()

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 40fe575)
The types are technically typedef equivalent, but it's less confusing
to use the types that agree with the name of the constructor.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e25b860)
Since the usnic BTL is single-threaded in this area, there really is
no danger, but don't use one of the pointers hanging off the frag
after we return it to the freelist.  Instead, save the endpoint
pointer before returning the frag.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit b02d8c4)
The libfabric usnic provider may give you back TX/RX queues that are
longer than you asked for.  So just use the TX/RQ/CQ lengths that we
asked for, regardless of what length comes back.

Additionally, keep the length of the priority channel CQ separate from
the length of the data CQ.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7787dad)
Don't just blindly send ACKs; ensure that we have send credits before
doing so.  If we don't have any send credits, just don't send the ACK
(it'll come again soon enough; it's not a tragedy if we don't send it
now).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 879d25e)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit c4d7876)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 53dc75a)
Double check the queue lengths that we get back from libfabric to
ensure that they are at least as long as we need.  They *should* never
be shorter than we need, but let's just check to be sure.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit bd5b8ed)
Show the actual RX/TX and CQ length returned by libfabric in verbose
output.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 2d28ccb)
Add some run-time assert checks for debug builds.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7048ade)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 1fdd0fe)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 706f53b)
@jsquyres jsquyres force-pushed the pr/v2.x/usnic-queue-fixes branch from bfc982f to dcbe7a0 Compare January 10, 2017 20:40
@jsquyres
Copy link
Member Author

@bturrubiates Updated to include fixes from review of #2692.

@jsquyres
Copy link
Member Author

@hppritcha Good to go.

@hppritcha hppritcha merged commit d77f860 into open-mpi:v2.x Jan 11, 2017
@jsquyres jsquyres deleted the pr/v2.x/usnic-queue-fixes branch August 2, 2018 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants