-
Notifications
You must be signed in to change notification settings - Fork 918
btl/uct: bug fixes and general improvements #5837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@thananon See if this helps with the zcopy issue. |
This patch give me segfault.
|
I have some more changes that need to be pushed. Will push those this afternoon. |
@thananon Pushed the updates. |
:bot:retest:mellanox |
I still have the same segfault with this commit. I will have the trace here after I got the debug build built. |
backtrace here:
|
@thananon Can you share the program and mpirun line that triggers the SEGV. |
The program is NetPIPE. I think it segfaulted in MPI_Init. |
Let me take that back. The stack trace is from my benchmark but I can reproduce it with NetPIPE and both seems to crash in |
Well damn. It working perfectly for me:
|
Maybe something stale in your install? |
What UCX version? |
My install is fresh. I can try to distclean and make again. my ucx version is master from April.
|
@thananon I will try that version and see what happens. |
@hjelmn I rebased to the current master and still have the same problem. I'm still waiting for my new build. Will report back. |
build finished. Still have the same problem. I think the error is from uct as everything looks normal from MPI stack. |
very odd. working for me with that revision. though i am testing on aarch64 and power9. |
what transport is it using? it will say in btl_base_verbose output. |
I'm not sure. |
Thats the problem. UD is selected for active-messages in your case. I will fix that now. Try with |
With UD as AM:
|
|
Ok, adding a check to ensure we don't try to use UD for communication. Its a special case. |
This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Done. Please try the latest version without setting the transports. UD will get thrown to the end of the list of tl's to check. |
👍 works. |
Did this fix the zero-copy issue? |
It did. |
FYI: I did a performance run and this patch reduced multithreaded performance around 20-25%. Prepatch:
Post patch:
Seems like something got funneled here. |
@thananon Interesting. Only thing I can think of is the incorrect fragment fields making the results look better than they should have. The other reason could be that recursive progress will no longer be made because it can lead to deep recursion. |
Hmm, thank you @hjelmn . I will look into this. I think you are right about result looking better than it should. btl/uct should not beat openib btl single threaded right? |
@thananon Not sure. I would assume that going directly to verbs is better than uct if not using the optimized paths (rc_mlx5, dc_mlx5). Otherwise UCT will be faster. |
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov