-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mtt] timeout exceeded in mpi-small-tests #2934
Comments
=============================================================Mon Oct 15 21:12:00 2018[1,0]:seed value: 1603678378 |
PFC was not configured properly. Needs to be checked with |
See this one on Jazz as well: http://hpcweb.lab.mtl.com//hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20181102_132321_6230_11161_jazz23/html/test_stdout_XZb7E6.txt |
Did you check on all nodes? |
Yes, all nodes in my allocation (reproduced manually) had lossless configured. |
I also see a hang in Configuration
Cmd: Output: Mon Nov 5 19:48:55 2018[1,0]<stdout>:atest: ver. 2.1.210
Mon Nov 5 19:48:55 2018[1,0]<stdout>:
Mon Nov 5 19:48:55 2018[1,0]<stdout>: test_name coll_op datatype op comm num_el root np time (us) slow noise status init_val result expected_result fail_msg
Mon Nov 5 19:48:55 2018[1,0]<stdout>:
Mon Nov 5 19:48:55 2018[1,0]<stdout>:================================================================================================================================================================================================================================================================================================================================================
Mon Nov 5 19:48:55 2018[1,0]<stdout>:
node=jazz13, pid=168907:
Thread 4 (Thread 0x7ff463215700 (LWP 168920)):
#0 0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007ff4661278a3 in epoll_dispatch (base=0x146c860, tv=<optimized out>) at epoll.c:407
#2 0x00007ff46612b2f0 in opal_libevent2022_event_base_loop (base=0x146c860, flags=flags@entry=1) at event.c:1630
#3 0x00007ff4660e7c1e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4 0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ff460438700 (LWP 168935)):
#0 0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007ff4661278a3 in epoll_dispatch (base=0x14ab8c0, tv=<optimized out>) at epoll.c:407
#2 0x00007ff46612b2f0 in opal_libevent2022_event_base_loop (base=0x14ab8c0, flags=flags@entry=1) at event.c:1630
#3 0x00007ff4621765fe in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4 0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7ff456777700 (LWP 169006)):
#0 0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007ff45c365214 in ucs_async_thread_func (arg=0x151e820) at async/thread.c:93
#2 0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ff46711b740 (LWP 168907)):
#0 0x00007ff45cf4e3f3 in uct_dc_mlx5_iface_progress (arg=<optimized out>) at ib/dc/accel/dc_mlx5.c:721
#1 0x00007ff45d1a5ff2 in ucs_callbackq_dispatch (cbq=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-11-05/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:209
#2 uct_worker_progress (worker=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-11-05/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/api/uct.h:1677
#3 ucp_worker_progress (worker=0x7ff46703b010) at core/ucp_worker.c:1416
#4 0x00007ff45d5dc577 in mca_pml_ucx_progress () at pml_ucx.c:452
#5 0x00007ff4660e1d9c in opal_progress () at runtime/opal_progress.c:231
#6 0x00007ff466c9d805 in ompi_request_wait_completion (req=0x16a47a8) at ../ompi/request/request.h:415
#7 ompi_request_default_wait (req_ptr=0x7ffee946f000, status=0x7ffee946f010) at request/req_wait.c:42
#8 0x00007ff466cd9089 in ompi_coll_base_sendrecv_actual (sendbuf=0x16aac08, scount=2, sdatatype=sdatatype@entry=0x6244e0 <ompi_mpi_short_int>, dest=101, stag=stag@entry=-14, recvbuf=<optimized out>, rcount=1, rdatatype=rdatatype@entry=0x6244e0 <ompi_mpi_short_int>, source=source@entry=159, rtag=rtag@entry=-14, comm=comm@entry=0x6264e0 <ompi_mpi_comm_world>, status=status@entry=0x0) at base/coll_base_util.c:59
#9 0x00007ff466cde5c3 in ompi_coll_base_sendrecv (stag=-14, rtag=-14, status=0x0, myid=270, comm=0x6264e0 <ompi_mpi_comm_world>, source=159, rdatatype=0x6244e0 <ompi_mpi_short_int>, rcount=<optimized out>, recvbuf=<optimized out>, dest=101, sdatatype=0x6244e0 <ompi_mpi_short_int>, scount=<optimized out>, sendbuf=<optimized out>) at base/coll_base_util.h:67
#10 ompi_coll_base_alltoallv_intra_pairwise (sbuf=0x16aa750, scounts=0x16a86f0, sdisps=0x16a9190, sdtype=0x6244e0 <ompi_mpi_short_int>, rbuf=0x1757040, rcounts=0x16a8d20, rdisps=0x16aa2e0, rdtype=0x6244e0 <ompi_mpi_short_int>, comm=0x6264e0 <ompi_mpi_comm_world>, module=0x168e090) at base/coll_base_alltoallv.c:162
#11 0x00007ff466cb1181 in PMPI_Alltoallv (sendbuf=sendbuf@entry=0x16aa750, sendcounts=sendcounts@entry=0x16a86f0, sdispls=sdispls@entry=0x16a9190, sendtype=<optimized out>, recvbuf=recvbuf@entry=0x1757040, recvcounts=recvcounts@entry=0x16a8d20, rdispls=rdispls@entry=0x16aa2e0, recvtype=0x6244e0 <ompi_mpi_short_int>, comm=0x6264e0 <ompi_mpi_comm_world>) at palltoallv.c:129
#12 0x000000000040b226 in test_alltoallv (settings=<optimized out>, status=0x7ffee946f6e8) at tests/alltoallv.c:319
#13 0x000000000040dc18 in test_correctness (settings=settings@entry=0x7ffee946f770, status=status@entry=0x7ffee946f764) at tests/correctness.c:89
#14 0x0000000000402a8f in main (argc=9, argv=0x7ffee946f9e8) at main.c:92 |
https://redmine.mellanox.com/issues/1550134 |
http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=0/11154/console |
|
http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=2/11236/consoleFull (reproduced on |
http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=0/11238/console (reproduced on |
[ RUN ] dc_mlx5/uct_p2p_mix_test_alloc_methods.mix1000_rcache/1 /hpc/local/oss/gcc-8.2.0/include/c++/8.2.0/bits/stl_vector.h: [ uct_test::ent() ] ==== backtrace (tid: 13312) ====
|
Fixed in FW 16.26.0276 |
Configuration
MTT log: http://e2e-gw.mellanox.com:4080/mnt/lustre/users/mtt/scratch/ucx_ompi/20181005_205511_9823_201836_clx-hercules-036/html/test_stdout_sgzRIm.txt
It doesn't always reproduce.
Cmd:
mpirun -np 1760 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll ^hcoll --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc,sm -x UCX_IB_SL=1 -x UCX_DC_VERBS_TM_ENABLE=y -x UCX_DC_VERBS_TM_MAX_BCOPY=8k -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20181005_205511_9823_201836_clx-hercules-036/installs/N8mV/tests/mpi-small-tests/hpc_tests.git/mpi/misc/all2all_ibm
Output:
The text was updated successfully, but these errors were encountered: