Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in netcdf test with more than 2 mpi processes #5535

Closed
opoplawski opened this issue Aug 6, 2020 · 8 comments
Closed

Segmentation fault in netcdf test with more than 2 mpi processes #5535

opoplawski opened this issue Aug 6, 2020 · 8 comments
Assignees
Labels

Comments

@opoplawski
Copy link

Describe the bug

Segmentation fault in uct_tcp_ep_handle_put_req with more than 2 mpi processes.

Steps to Reproduce

  • Compile netcdf 4.7.3 and run mpi tests. nc_test4/tst_parallel3 segfaults when run with more than 2 processes.
  • ucx 1.8.1 (Fedora package0

Setup and versions

  • Fedora Rawhide x86_64 4 core VM

Additional information (depending on the issue)

  • OpenMPI version 4.0.4
  • Output of ucx_info -d to show transports and devices recognized by UCX
#
# Memory domain: sockcm
#     Component: sockcm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: tcp
#      Device: ens3
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Connection manager: tcp
#      max_conn_priv: 2040 bytes
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: self
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#   Transport: sysv
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#   Transport: posix
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
    This creates an empty log file. Without UCX_LOG_FILE is outputs:
[1596670815.248803] [vmrawhide-rufous:370898:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory tcp/ens3);
[1596670815.248962] [vmrawhide-rufous:370896:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory tcp/ens3);
[1596670815.249694] [vmrawhide-rufous:370895:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory tcp/ens3);
[1596670815.249700] [vmrawhide-rufous:370898:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(sysv/memory tcp/ens3);
[1596670815.250053] [vmrawhide-rufous:370896:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(sysv/memory tcp/ens3);
[1596670815.251018] [vmrawhide-rufous:370895:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(sysv/memory tcp/ens3);

It may have been introduced between 1.7.0 and 1.8.0 - or it may just be that ucx behaves differently when I downgrade just ucx to 1.7.0 - it warns about incompatible versions.

I haven't had any luck with getting valgrind to output anything useful.`

@opoplawski opoplawski added the Bug label Aug 6, 2020
@opoplawski
Copy link
Author

Ah, once I realized that I need to run valgrind on mpiexec and not on my program, perhaps something useful. I've attached the valgrind output.
valgrind.log

@dmitrygx
Copy link
Member

dmitrygx commented Aug 6, 2020

@opoplawski is it possible to try the current UCX master? there were fixes in TCP for PUT operation handling:
#4318
#4678

if they are useful, we could map them to v1.7.x.
Also, if the issue still persists when using the master branch, could use UCX_TCP_PUT_ENABLE configuration parameter as a workaround while the issue is under investigation by me.

@opoplawski
Copy link
Author

It still segfaults with the latest UCX master. Setting UCX_TCP_PUT_ENABLE=n does allow the test to run.

@dmitrygx
Copy link
Member

dmitrygx commented Aug 7, 2020

It still segfaults with the latest UCX master. Setting UCX_TCP_PUT_ENABLE=n does allow the test to run.

thank you for the tests

@dmitrygx
Copy link
Member

@opoplawski I cloned https://github.com/Parallel-NetCDF/PnetCDF, configured and build NetCDF and the tests,
and then tried to run make ptest4 in test/F90 directory with UCX_TLS=tcp environment variable set.
but it passes all tests:

$UCX_TLS=tcp make ptest4
===========================================================
    test/F90: Parallel testing on 4 MPI processes
===========================================================
export SED="/bin/sed"; export srcdir="."; export TESTOUTDIR="."; export TESTSEQRUN=""; export TESTMPIRUN="/hpc/local/benchmarks/daily/next/2020-10-21/hpcx-gcc-redhat7.6/ompi/bin/mpiexec -n NP"; export PNETCDF_DEBUG="0"; export TESTPROGRAMS="tst_f90 f90tst_vars tst_types2 tst_f90_cdf5 f90tst_vars2 f90tst_vars3 f90tst_vars4 test_intent test_attr_int64 test_fill"; export check_PROGRAMS="tst_f90 f90tst_vars tst_types2 tst_f90_cdf5 f90tst_vars2 f90tst_vars3 f90tst_vars4 test_intent test_attr_int64 test_fill f90tst_parallel f90tst_parallel2 f90tst_parallel3 f90tst_parallel4 tst_io"; export ENABLE_BURST_BUFFER="0"; export PARALLEL_PROGS="f90tst_parallel f90tst_parallel2 f90tst_parallel3 f90tst_parallel4"; \
./parallel_run.sh 4 || exit 1
*** TESTING F90 f90tst_parallel                                    ------ pass
*** TESTING F90 f90tst_parallel2 for strided access                ------ pass
*** TESTING F90 f90tst_parallel3                                   ------ pass
*** TESTING F90 f90tst_parallel4                                   ------ pass

Am I missing something to successfully reproduce the issue?

@opoplawski
Copy link
Author

We actually do not build netcdf with pnetcdf. Also, the particular test suite that is failing is nc_test4/tst_parallel3. We run the whole suite with "make check", so sure off the top of my head how to run just the nc_test4 suite - but I think tst_* scripts get created there.

https://src.fedoraproject.org/rpms/netcdf/blob/master/f/netcdf.spec and
https://kojipkgs.fedoraproject.org//packages/netcdf/4.7.3/4.fc34/data/logs/x86_64/build.log

should give you some more insight into how we are building netcdf.

@dmitrygx
Copy link
Member

@opoplawski is this issue still relevant? did you see this with the current UCX master?

@opoplawski
Copy link
Author

I don't seem to be seeing this with UCX 1.9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants