Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] timeout exceeded in mpi-small-tests #2934

Closed
amaslenn opened this issue Oct 9, 2018 · 15 comments
Closed

[mtt] timeout exceeded in mpi-small-tests #2934

amaslenn opened this issue Oct 9, 2018 · 15 comments
Assignees
Labels
Milestone

Comments

@amaslenn
Copy link
Contributor

amaslenn commented Oct 9, 2018

Configuration

MOFED: MLNX_OFED_LINUX-4.4-2.0.7.0
OMPI: OMPI: 4.0.0rc4
Nodes: hercules x55 (ppn=32(x55), nodelist=clx-hercules-[036,043-096])
Job: ucx-hwtm-dc

MTT log: http://e2e-gw.mellanox.com:4080/mnt/lustre/users/mtt/scratch/ucx_ompi/20181005_205511_9823_201836_clx-hercules-036/html/test_stdout_sgzRIm.txt

It doesn't always reproduce.

Cmd:
mpirun -np 1760 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll ^hcoll --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc,sm -x UCX_IB_SL=1 -x UCX_DC_VERBS_TM_ENABLE=y -x UCX_DC_VERBS_TM_MAX_BCOPY=8k -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20181005_205511_9823_201836_clx-hercules-036/installs/N8mV/tests/mpi-small-tests/hpc_tests.git/mpi/misc/all2all_ibm

Output:

node=clx-hercules-096, pid=15531:
Thread 4 (Thread 0x7ffff3e8c700 (LWP 15571)):
#0  0x00007ffff72fd113 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6cb28f3 in epoll_dispatch (base=0x669030, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6cb6340 in opal_libevent2022_event_base_loop (base=0x669030, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff6c72c6e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff75d2dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff72fcb3d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ffff10af700 (LWP 15613)):
#0  0x00007ffff72fd113 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6cb28f3 in epoll_dispatch (base=0x6a8320, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6cb6340 in opal_libevent2022_event_base_loop (base=0x6a8320, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff2ff35fe in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff75d2dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff72fcb3d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffe34ca700 (LWP 15759)):
#0  0x00007ffff72fd113 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007fffe8f33294 in ucs_async_thread_func (arg=0x855320) at async/thread.c:93
#2  0x00007ffff75d2dd5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff72fcb3d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fc9740 (LWP 15531)):
#0  0x00007ffff75d7483 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffe9684064 in mlx5_poll_cq_1 () from /usr/lib64/libmlx5.so.1
#2  0x00007fffe9b169f6 in ibv_poll_cq (wc=0x7fffffffae10, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458
#3  uct_ib_poll_cq (wcs=0x7fffffffae10, count=<synthetic pointer>, cq=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-10-05/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/base/ib_device.h:311
#4  uct_dc_verbs_poll_tx (iface=0x7f3ff0) at ib/dc/verbs/dc_verbs.c:720
#5  uct_dc_verbs_iface_progress_tm (arg=0x7f3ff0) at ib/dc/verbs/dc_verbs.c:960
#6  0x00007fffe9d6ea72 in ucs_callbackq_dispatch (cbq=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-10-05/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:208
#7  uct_worker_progress (worker=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-10-05/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/api/uct.h:1675
#8  ucp_worker_progress (worker=0x79b510) at core/ucp_worker.c:1390
#9  0x00007fffea1a5457 in mca_pml_ucx_progress () at pml_ucx.c:412
#10 0x00007ffff6c6cdec in opal_progress () at runtime/opal_progress.c:231
#11 0x00007ffff7832865 in ompi_request_wait_completion (req=0x8a2668) at ../ompi/request/request.h:415
#12 ompi_request_default_wait (req_ptr=0x7fffffffb320, status=0x7fffffffb330) at request/req_wait.c:42
#13 0x00007ffff786e0e9 in ompi_coll_base_sendrecv_actual (sendbuf=0xc5a71c, scount=58, sdatatype=sdatatype@entry=0x6020e0 <ompi_mpi_int>, dest=1392, stag=stag@entry=-14, recvbuf=<optimized out>, rcount=1, rdatatype=rdatatype@entry=0x6020e0 <ompi_mpi_int>, source=source@entry=476, rtag=rtag@entry=-14, comm=comm@entry=0x6022e0 <ompi_mpi_comm_world>, status=status@entry=0x0) at base/coll_base_util.c:59
#14 0x00007ffff7873623 in ompi_coll_base_sendrecv (stag=-14, rtag=-14, status=0x0, myid=54, comm=0x6022e0 <ompi_mpi_comm_world>, source=476, rdatatype=0x6020e0 <ompi_mpi_int>, rcount=<optimized out>, recvbuf=<optimized out>, dest=1392, sdatatype=0x6020e0 <ompi_mpi_int>, scount=<optimized out>, sendbuf=<optimized out>) at base/coll_base_util.h:67
#15 ompi_coll_base_alltoallv_intra_pairwise (sbuf=0xc3c470, scounts=0x8bcec0, sdisps=0x8c05e0, sdtype=0x6020e0 <ompi_mpi_int>, rbuf=0xcff2c0, rcounts=0x8bea50, rdisps=0x8b6be0, rdtype=0x6020e0 <ompi_mpi_int>, comm=0x6022e0 <ompi_mpi_comm_world>, module=0x891020) at base/coll_base_alltoallv.c:162
#16 0x00007ffff78461e1 in PMPI_Alltoallv (sendbuf=<optimized out>, sendcounts=<optimized out>, sdispls=<optimized out>, sendtype=<optimized out>, recvbuf=<optimized out>, recvcounts=<optimized out>, rdispls=0x8b6be0, recvtype=0x6020e0 <ompi_mpi_int>, comm=0x6022e0 <ompi_mpi_comm_world>) at palltoallv.c:129
#17 0x000000000040149d in a2av (iter=10761) at all2all_ibm.c:170
#18 0x00000000004016de in main () at all2all_ibm.c:218
@amaslenn amaslenn added Bug MTT MTT Error labels Oct 9, 2018
@yosefe yosefe added this to the v1.5.0 milestone Oct 15, 2018
@brminich
Copy link
Contributor

@brminich
Copy link
Contributor

=============================================================Mon Oct 15 21:12:00 2018[1,0]:seed value: 1603678378
Mon Oct 15 21:12:00 2018[1,0]:iter 0Mon Oct 15 21:12:02 2018[1,0]:iter 1

@brminich
Copy link
Contributor

@brminich
Copy link
Contributor

brminich commented Nov 1, 2018

PFC was not configured properly. Needs to be checked with
/hpc/local/bin/lossless_roce_hca.sh script next time. Need to be sure that the following line is shown:
Priority trust state is cofigured to Lossless: dscp

@yosefe yosefe closed this as completed Nov 5, 2018
@amaslenn
Copy link
Contributor Author

amaslenn commented Nov 5, 2018

See this one on Jazz as well: http://hpcweb.lab.mtl.com//hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20181102_132321_6230_11161_jazz23/html/test_stdout_XZb7E6.txt
Also reproduced manually, second run hung on iter 412 (seed value: 728689391).
According to the script, jazz is lossless.

@amaslenn amaslenn reopened this Nov 5, 2018
@brminich
Copy link
Contributor

brminich commented Nov 6, 2018

Did you check on all nodes?

@amaslenn
Copy link
Contributor Author

amaslenn commented Nov 6, 2018

Yes, all nodes in my allocation (reproduced manually) had lossless configured.

@amaslenn
Copy link
Contributor Author

amaslenn commented Nov 6, 2018

I also see a hang in atest. Usually test takes <5 min, but sometimes it doesn't finish in 20+ min.

Configuration

OMPI: 4.0.0rc5
MOFED: MLNX_OFED_LINUX-4.4-2.0.7.0
Module: hpcx-gcc (2018-11-05)
Test module: mtt-tests/hpcx-gcc
Nodes: jazz x10 (ppn=28(x10), nodelist=jazz[13-15,22-23,26-29,31])

MTT log: http://hpcweb.lab.mtl.com//hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20181105_191717_143621_11334_jazz13/html/test_stdout_zscHGH.txt

Cmd:
mpirun -np 280 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_1:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_IB_ETH_PAUSE_ON=y -x UCX_TLS=all -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 -x UCX_IB_TRAFFIC_CLASS=106 -x UCX_IB_GID_INDEX=auto --map-by node --mca pmix_server_max_wait 8 /hpc/local/benchmarks/hpcx_install_2018-11-05/mtt-tests-gcc/installs/FKXc/tests/atest/atest.git/src/atest --nslow 500 --slow-sleep 3 -o 2 --test-cross 0

Output:

Mon Nov  5 19:48:55 2018[1,0]<stdout>:atest: ver. 2.1.210
Mon Nov  5 19:48:55 2018[1,0]<stdout>:
Mon Nov  5 19:48:55 2018[1,0]<stdout>:        test_name   coll_op            datatype        op                          comm    num_el      root        np    time (us)      slow     noise    status       init_val                        result               expected_result                                                                                             fail_msg
Mon Nov  5 19:48:55 2018[1,0]<stdout>:
Mon Nov  5 19:48:55 2018[1,0]<stdout>:================================================================================================================================================================================================================================================================================================================================================
Mon Nov  5 19:48:55 2018[1,0]<stdout>:
node=jazz13, pid=168907:
Thread 4 (Thread 0x7ff463215700 (LWP 168920)):
#0  0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ff4661278a3 in epoll_dispatch (base=0x146c860, tv=<optimized out>) at epoll.c:407
#2  0x00007ff46612b2f0 in opal_libevent2022_event_base_loop (base=0x146c860, flags=flags@entry=1) at event.c:1630
#3  0x00007ff4660e7c1e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ff460438700 (LWP 168935)):
#0  0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ff4661278a3 in epoll_dispatch (base=0x14ab8c0, tv=<optimized out>) at epoll.c:407
#2  0x00007ff46612b2f0 in opal_libevent2022_event_base_loop (base=0x14ab8c0, flags=flags@entry=1) at event.c:1630
#3  0x00007ff4621765fe in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7ff456777700 (LWP 169006)):
#0  0x00007ff46676b923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ff45c365214 in ucs_async_thread_func (arg=0x151e820) at async/thread.c:93
#2  0x00007ff466a3de25 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ff46676b34d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ff46711b740 (LWP 168907)):
#0  0x00007ff45cf4e3f3 in uct_dc_mlx5_iface_progress (arg=<optimized out>) at ib/dc/accel/dc_mlx5.c:721
#1  0x00007ff45d1a5ff2 in ucs_callbackq_dispatch (cbq=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-11-05/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:209
#2  uct_worker_progress (worker=<optimized out>) at /hpc/local/benchmarks/hpcx_install_2018-11-05/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/api/uct.h:1677
#3  ucp_worker_progress (worker=0x7ff46703b010) at core/ucp_worker.c:1416
#4  0x00007ff45d5dc577 in mca_pml_ucx_progress () at pml_ucx.c:452
#5  0x00007ff4660e1d9c in opal_progress () at runtime/opal_progress.c:231
#6  0x00007ff466c9d805 in ompi_request_wait_completion (req=0x16a47a8) at ../ompi/request/request.h:415
#7  ompi_request_default_wait (req_ptr=0x7ffee946f000, status=0x7ffee946f010) at request/req_wait.c:42
#8  0x00007ff466cd9089 in ompi_coll_base_sendrecv_actual (sendbuf=0x16aac08, scount=2, sdatatype=sdatatype@entry=0x6244e0 <ompi_mpi_short_int>, dest=101, stag=stag@entry=-14, recvbuf=<optimized out>, rcount=1, rdatatype=rdatatype@entry=0x6244e0 <ompi_mpi_short_int>, source=source@entry=159, rtag=rtag@entry=-14, comm=comm@entry=0x6264e0 <ompi_mpi_comm_world>, status=status@entry=0x0) at base/coll_base_util.c:59
#9  0x00007ff466cde5c3 in ompi_coll_base_sendrecv (stag=-14, rtag=-14, status=0x0, myid=270, comm=0x6264e0 <ompi_mpi_comm_world>, source=159, rdatatype=0x6244e0 <ompi_mpi_short_int>, rcount=<optimized out>, recvbuf=<optimized out>, dest=101, sdatatype=0x6244e0 <ompi_mpi_short_int>, scount=<optimized out>, sendbuf=<optimized out>) at base/coll_base_util.h:67
#10 ompi_coll_base_alltoallv_intra_pairwise (sbuf=0x16aa750, scounts=0x16a86f0, sdisps=0x16a9190, sdtype=0x6244e0 <ompi_mpi_short_int>, rbuf=0x1757040, rcounts=0x16a8d20, rdisps=0x16aa2e0, rdtype=0x6244e0 <ompi_mpi_short_int>, comm=0x6264e0 <ompi_mpi_comm_world>, module=0x168e090) at base/coll_base_alltoallv.c:162
#11 0x00007ff466cb1181 in PMPI_Alltoallv (sendbuf=sendbuf@entry=0x16aa750, sendcounts=sendcounts@entry=0x16a86f0, sdispls=sdispls@entry=0x16a9190, sendtype=<optimized out>, recvbuf=recvbuf@entry=0x1757040, recvcounts=recvcounts@entry=0x16a8d20, rdispls=rdispls@entry=0x16aa2e0, recvtype=0x6244e0 <ompi_mpi_short_int>, comm=0x6264e0 <ompi_mpi_comm_world>) at palltoallv.c:129
#12 0x000000000040b226 in test_alltoallv (settings=<optimized out>, status=0x7ffee946f6e8) at tests/alltoallv.c:319
#13 0x000000000040dc18 in test_correctness (settings=settings@entry=0x7ffee946f770, status=status@entry=0x7ffee946f764) at tests/correctness.c:89
#14 0x0000000000402a8f in main (argc=9, argv=0x7ffee946f9e8) at main.c:92

@brminich
Copy link
Contributor

https://redmine.mellanox.com/issues/1550134
(Mellanox internal link)

@dmitrygx
Copy link
Member

16:20:25 [----------] 1 test from dc_mlx5/test_uct_perf
16:20:25 [ RUN      ] dc_mlx5/test_uct_perf.envelope/0
16:20:25 [     INFO ] mlx5_3:1                am latency : 1.130 usec
16:35:25 /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test_helpers.cc:46: Failure
16:35:25 Failed
16:35:25 Connection timed out - abort testing
16:35:25 [hpc-test-node4:14756:0:14756] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
16:35:26 ==== backtrace (tid:  14756) ====
16:35:26  0 0x0000000000008ef7 pthread_join()  ???:0
16:35:26  1 0x000000000055fa43 test_perf::run_multi_threaded()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test_perf.cc:244
16:35:26  2 0x00000000005600de test_perf::run_test()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test_perf.cc:270
16:35:26  3 0x000000000061f69c test_uct_perf_envelope_Test::test_body()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/uct/test_uct_perf.cc:171
16:35:26  4 0x00000000005619c6 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test.cc:276
16:35:26  5 0x00000000005619c6 ucs::test_base::TestBodyProxy()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test.cc:302
16:35:26  6 0x000000000054c503 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3562
16:35:26  7 0x000000000054097d testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3635
16:35:26  8 0x0000000000540a4c testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3812
16:35:26  9 0x0000000000540baf testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3930
16:35:26 10 0x0000000000545547 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:5802
16:35:26 11 0x000000000054584b testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:5719
16:35:26 12 0x00000000004ec853 RUN_ALL_TESTS()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest.h:20059
16:35:26 13 0x00000000004ec853 main()  /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/main.cc:101
16:35:26 14 0x0000000000021b15 __libc_start_main()  ???:0
16:35:26 15 0x000000000052b105 _start()  ???:0
16:35:26 =================================

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=0/11154/console

@dmitrygx
Copy link
Member

21:12:10 [----------] 1 test from dc_mlx5/test_uct_perf
21:12:10 [ RUN      ] dc_mlx5/test_uct_perf.envelope/0
21:12:11 [     INFO ] mlx5_3:1                am latency : 2.787 usec
21:12:12 [     INFO ] mlx5_3:1                   am rate : 2.384 Mpps
21:12:12 [     INFO ] mlx5_3:1                 am rate64 : 3.880 Mpps
21:12:13 [     INFO ] mlx5_3:1          am bcopy latency : 1.876 usec
21:12:13 [     INFO ] mlx5_3:1               am bcopy bw : 2295.355 MB/sec
21:12:13 [     INFO ] mlx5_3:1               am zcopy bw : 2193.969 MB/sec
21:27:10 /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test_helpers.cc:46: Failure
21:27:10 Failed
21:27:10 Connection timed out - abort testing
21:27:10 [hpc-test-node4:20707:0:20707] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
21:27:11 ==== backtrace (tid:  20707) ====
21:27:11  0 0x0000000000008ef7 pthread_join()  ???:0
21:27:11  1 0x000000000055fad3 test_perf::run_multi_threaded()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test_perf.cc:244
21:27:11  2 0x000000000056016e test_perf::run_test()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test_perf.cc:270
21:27:11  3 0x000000000061f72c test_uct_perf_envelope_Test::test_body()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/uct/test_uct_perf.cc:171
21:27:11  4 0x0000000000561a56 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test.cc:276
21:27:11  5 0x0000000000561a56 ucs::test_base::TestBodyProxy()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test.cc:302
21:27:11  6 0x000000000054c593 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3562
21:27:11  7 0x0000000000540a0d testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3635
21:27:11  8 0x0000000000540adc testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3812
21:27:11  9 0x0000000000540c3f testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3930
21:27:11 10 0x00000000005455d7 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:5802
21:27:11 11 0x00000000005458db testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:5719
21:27:11 12 0x00000000004ec8e3 RUN_ALL_TESTS()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest.h:20059
21:27:11 13 0x00000000004ec8e3 main()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/main.cc:101
21:27:11 14 0x0000000000021b15 __libc_start_main()  ???:0
21:27:11 15 0x000000000052b195 _start()  ???:0
21:27:11 =================================

@dmitrygx
Copy link
Member

12:40:38 [ RUN      ] dcx/test_ucp_atomic64.atomic_fxor_nb/0
12:55:31 /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test_helpers.cc:46: Failure
12:55:31 Failed
12:55:31 Connection timed out - abort testing
12:55:31 [hpc-test-node4:17611:0:17611] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
12:55:31 ==== backtrace (tid:  17611) ====
12:55:31  0 0x00000000000db057 __GI___sched_yield()  :0
12:55:31  1 0x0000000000789f5b ucp_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/ucp_test.cc:148
12:55:31  2 0x0000000000789f5b ucp_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/ucp_test.cc:144
12:55:31  3 0x0000000000789f5b ucp_test::wait()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/ucp_test.cc:204
12:55:31  4 0x00000000006b67fd test_ucp_atomic::nb_fetch<unsigned long, (ucp_atomic_fetch_op_t)5>()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/test_ucp_atomic.cc:145
12:55:31  5 0x00000000006badac test_ucp_memheap::test_blocking_xfer()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/test_ucp_memheap.cc:230
12:55:31  6 0x00000000006aba37 test<long unsigned int, void (test_ucp_atomic::*)(ucp_test_base::entity*, long unsigned int, void*, ucp_rkey*, std::basic_string<char>&)>()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/ucp/test_ucp_atomic.cc:188
12:55:31  7 0x0000000000561e26 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test.cc:276
12:55:31  8 0x0000000000561e26 ucs::test_base::TestBodyProxy()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/test.cc:302
12:55:31  9 0x000000000054c963 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3562
12:55:31 10 0x0000000000540ddd testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3635
12:55:31 11 0x0000000000540eac testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3812
12:55:31 12 0x000000000054100f testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:3930
12:55:31 13 0x00000000005459a7 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:5802
12:55:31 14 0x0000000000545cab testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest-all.cc:5719
12:55:31 15 0x00000000004ecc63 RUN_ALL_TESTS()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/gtest.h:20059
12:55:31 16 0x00000000004ecc63 main()  /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-test-node-new/worker/2/contrib/../test/gtest/common/main.cc:101
12:55:31 17 0x0000000000021b15 __libc_start_main()  ???:0
12:55:31 18 0x000000000052b565 _start()  ???:0

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=2/11236/consoleFull (reproduced on hpc-test-node-4 node)

@dmitrygx
Copy link
Member

13:18:34 [ RUN      ] dcx/test_ucp_tag_xfer.generic_unexp/1
13:18:34 [     INFO ] 0 50x10^1 50x10^2 /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test_helpers.cc:46: Failure
13:33:06 Failed
13:33:06 Connection timed out - abort testing
13:33:06 [hpc-test-node4:7648 :0:7648] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
13:33:07 ==== backtrace (tid:   7648) ====
13:33:07  0 0x00000000000db057 __GI___sched_yield()  :0
13:33:07  1 0x000000000078827d ucp_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/ucp_test.cc:148
13:33:07  2 0x0000000000744630 test_ucp_tag::wait()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/test_ucp_tag.cc:120
13:33:07  3 0x0000000000729bc9 test_ucp_tag_xfer::do_xfer()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:486
13:33:07  4 0x000000000072a77f test_ucp_tag_xfer::test_xfer_generic()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:368
13:33:07  5 0x000000000072761b test_ucp_tag_xfer::test_xfer()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:181
13:33:07  6 0x0000000000727748 test_ucp_tag_xfer_generic_unexp_Test::test_body()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:577
13:33:07  7 0x0000000000561c06 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test.cc:276
13:33:07  8 0x0000000000561c06 ucs::test_base::TestBodyProxy()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/test.cc:302
13:33:07  9 0x000000000054c743 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3562
13:33:07 10 0x0000000000540bbd testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3635
13:33:07 11 0x0000000000540c8c testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3812
13:33:07 12 0x0000000000540def testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:3930
13:33:07 13 0x0000000000545787 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:5802
13:33:07 14 0x0000000000545a8b testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest-all.cc:5719
13:33:07 15 0x00000000004ec9e3 RUN_ALL_TESTS()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/gtest.h:20059
13:33:07 16 0x00000000004ec9e3 main()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node-new/worker/0/contrib/../test/gtest/common/main.cc:101
13:33:07 17 0x0000000000021b15 __libc_start_main()  ???:0
13:33:07 18 0x000000000052b345 _start()  ???:0
13:33:07 =================================

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-new,worker=0/11238/console (reproduced on hpc-test-node4 node)

@dmitrygx
Copy link
Member

[ RUN ] dc_mlx5/uct_p2p_mix_test_alloc_methods.mix1000_rcache/1
/scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/test_helpers.cc:46: Failure
Failed
Connection timed out - abort testing
[hpc-test-node4:13312:0:13312] Caught signal 6 (Aborted: tkill(2) or tgkill(2))

/hpc/local/oss/gcc-8.2.0/include/c++/8.2.0/bits/stl_vector.h: [ uct_test::ent() ]
...
957 {
958 if (__n >= this->size())
959 __throw_out_of_range_fmt(__N("vector::_M_range_check: __n "
==> 960 "(which is %zu) >= this->size() "
961 "(which is %zu)"),
962 __n, this->size());
963 }

==== backtrace (tid: 13312) ====
0 0x000000000062fc30 uct_test::ent() /hpc/local/oss/gcc-8.2.0/include/c++/8.2.0/bits/stl_vector.h:960
1 0x00000000005f9990 uct_p2p_mix_test::random_op() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/uct/test_p2p_mix.cc:132
2 0x00000000005f9d86 uct_p2p_mix_test::run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/uct/test_p2p_mix.cc:156
3 0x0000000000559182 ucs::test_base::run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/test.cc:276
4 0x000000000055927d ucs::test_base::TestBodyProxy() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/test.cc:302
5 0x0000000000543e0a testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3562
6 0x000000000053bef0 testing::Test::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3634
7 0x000000000053bef0 testing::Test::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3641
8 0x000000000053c255 testing::Test::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3626
9 0x000000000053c255 testing::TestInfo::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3812
10 0x000000000053c38d testing::TestInfo::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3929
11 0x000000000053c38d testing::TestCase::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:3930
12 0x000000000053cdcd testing::TestCase::Run() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:5800
13 0x000000000053cdcd testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:5802
14 0x000000000053d045 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest-all.cc:5719
15 0x0000000000525066 RUN_ALL_TESTS() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/gtest.h:20059
16 0x0000000000525066 main() /scrap/jenkins/workspace/hpc-ucx-pr-4/label/hpc-test-node-new/worker/1/contrib/../test/gtest/common/main.cc:101
17 0x0000000000021b15 __libc_start_main() ???:0
18 0x0000000000525c29 _start() ???:0

@yosefe
Copy link
Contributor

yosefe commented Sep 5, 2019

Fixed in FW 16.26.0276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants