Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 7, 2016

Before dynamic add_procs the openib_btl_size_queues was called exactly
once for non-dynamic jobs. Now the function is called on each new
connection so the calculation was wrong. Re-wrote the function to
correctly calculate the CQ size and only attempt to adjust the CQ if
the requested size has changed. This fixes a bug when using the openib
btl on psm2 hardware that is caused by the time needed to resize a
CQ. The overhead was causing udcm to timeout and fail.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

Before dynamic add_procs the openib_btl_size_queues was called exactly
once for non-dynamic jobs. Now the function is called on each new
connection so the calculation was wrong. Re-wrote the function to
correctly calculate the CQ size and only attempt to adjust the CQ if
the requested size has changed. This fixes a bug when using the openib
btl on psm2 hardware that is caused by the time needed to resize a
CQ. The overhead was causing udcm to timeout and fail.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 7, 2016

@hppritcha This is the error I was telling you about. The CQ sizing was completely wrong and the overhead of resizing the CQ was breaking UDCM on omnipath hardware. This change was developed and tested on kit.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Jun 8, 2016

@hjelmn @hppritcha This looks to be a bug in UCX. We are tracking a few issues related to memory hooks being deregistered after the component is unloaded. Adding -mca pml ^ucx fixes the hang. I'll update the Jenkins test now.

@jladd-mlnx
Copy link
Member

bot:retest

@hjelmn hjelmn merged commit f8957f2 into open-mpi:master Jun 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants