New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in netcdf test with more than 2 mpi processes #5535
Comments
|
Ah, once I realized that I need to run valgrind on mpiexec and not on my program, perhaps something useful. I've attached the valgrind output. |
|
@opoplawski is it possible to try the current UCX master? there were fixes in TCP for PUT operation handling: if they are useful, we could map them to v1.7.x. |
|
It still segfaults with the latest UCX master. Setting UCX_TCP_PUT_ENABLE=n does allow the test to run. |
thank you for the tests |
|
@opoplawski I cloned https://github.com/Parallel-NetCDF/PnetCDF, configured and build NetCDF and the tests, Am I missing something to successfully reproduce the issue? |
|
We actually do not build netcdf with pnetcdf. Also, the particular test suite that is failing is nc_test4/tst_parallel3. We run the whole suite with "make check", so sure off the top of my head how to run just the nc_test4 suite - but I think tst_* scripts get created there. https://src.fedoraproject.org/rpms/netcdf/blob/master/f/netcdf.spec and should give you some more insight into how we are building netcdf. |
|
@opoplawski is this issue still relevant? did you see this with the current UCX master? |
|
I don't seem to be seeing this with UCX 1.9.0. |
Describe the bug
Segmentation fault in uct_tcp_ep_handle_put_req with more than 2 mpi processes.
Steps to Reproduce
Setup and versions
Additional information (depending on the issue)
ucx_info -dto show transports and devices recognized by UCXThis creates an empty log file. Without UCX_LOG_FILE is outputs:
It may have been introduced between 1.7.0 and 1.8.0 - or it may just be that ucx behaves differently when I downgrade just ucx to 1.7.0 - it warns about incompatible versions.
I haven't had any luck with getting valgrind to output anything useful.`
The text was updated successfully, but these errors were encountered: