Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in ud on the bsend2 test: `status == UCS_OK' failed: send completion with error. #2267

Closed
alinask opened this issue Feb 7, 2018 · 2 comments
Assignees
Labels
Bug MTT MTT Error
Milestone

Comments

@alinask
Copy link
Contributor

alinask commented Feb 7, 2018

The failure reproduces on 8 hercules hosts with ppn=32.
It doesn't reproduce on every run though. After running the following command line in a loop, it reproduced on the 9th iteration.

Setup:
(Note the MOFED version)

alinas@mngx-orion-01 ~
$pdsh -w clx-hercules-[001-008] ofed_info -s | dshbak -c
----------------
clx-hercules-[001-008]
----------------
MLNX_OFED_LINUX-4.2-1.0.0.0:

alinas@mngx-orion-01 ~
$pdsh -w clx-hercules-[001-008] ibv_devinfo -d mlx5_2 | grep vendor_part_id | dshbak -c
----------------
clx-hercules-[001-008]
----------------
        vendor_part_id:                 4121

alinas@mngx-orion-01 ~
$pdsh -w clx-hercules-[001-008] ibv_devinfo -d mlx5_2 | grep fw_ver | dshbak -c
----------------
clx-hercules-[001-008]
----------------
        fw_ver:                         16.21.1000

alinas@mngx-orion-01 ~
$pdsh -w clx-hercules-[001-008] ibv_devinfo -d mlx5_2 | grep link_layer | dshbak -c
----------------
clx-hercules-[001-008]
----------------
                        link_layer:             InfiniBand

Command line:

/hpc/local/benchmarks/hpcx_install_2018-02-06/hpcx-gcc-redhat7.4/ompi-v3.1.x/bin/mpirun -np 256 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -x UCX_IB_GID_INDEX=0 -mca osc ucx -x UCX_TLS=rc,sm -x UCX_TM_OFFLOAD=y -x UCX_RC_VERBS_TM_ENABLE=y -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20180206_232410_3890_56446_clx-hercules-001/installs/1oDr/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend2

Wed Feb  7 00:51:14 2018[1,167]<stderr>:[clx-hercules-008:25073:0:25073]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,166]<stderr>:[clx-hercules-007:32376:0:32376]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,14]<stderr>:[clx-hercules-007:32343:0:32343]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,13]<stderr>:[clx-hercules-006:18491:0:18491]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,12]<stderr>:[clx-hercules-005:27734:0:27734]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,165]<stderr>:[clx-hercules-006:18522:0:18522]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,10]<stderr>:[clx-hercules-003:23064:0:23064]    ud_iface.c:773  Assertion `status == UCS_OK' failed: send completion with error: Endpoint timeout
Wed Feb  7 00:51:14 2018[1,166]<stderr>:==== backtrace ====
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 0 0x000000000004becb uct_ud_iface_dispatch_async_comps_do()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/uct/ib/ud/base/ud_iface.c:771
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 1 0x000000000004fc03 uct_ud_iface_dispatch_zcopy_comps()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/uct/ib/ud/base/ud_iface.h:440
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 2 0x0000000000017682 ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/ucs/datastruct/callbackq.h:208
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 3 0x0000000000003307 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/pml/ucx/pml_ucx.c:454
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 4 0x00000000000329fc opal_progress()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/opal/runtime/opal_progress.c:228
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 5 0x000000000004a80f ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/runtime/ompi_mpi_init.c:883
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 6 0x0000000000067869 PMPI_Init_thread()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mpi/c/profile/pinit_thread.c:68
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 7 0x00000000004027af MTest_Init_thread()  ???:0Wed Feb  7 00:51:14 2018[1,166]<stderr>: 8 0x0000000000402b12 MTest_Init()  ???:0
Wed Feb  7 00:51:14 2018[1,166]<stderr>: 9 0x000000000040253b main()  ???:0
Wed Feb  7 00:51:14 2018[1,166]<stderr>:10 0x0000000000021c05 __libc_start_main()  ???:0
Wed Feb  7 00:51:14 2018[1,166]<stderr>:11 0x0000000000402449 _start()  ???:0
Wed Feb  7 00:51:14 2018[1,166]<stderr>:===================
Wed Feb  7 00:51:14 2018[1,166]<stderr>:[clx-hercules-007:32376:0:32376] Process frozen...
Wed Feb  7 00:51:14 2018[1,14]<stderr>:==== backtrace ====
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 0 0x000000000004becb uct_ud_iface_dispatch_async_comps_do()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/uct/ib/ud/base/ud_iface.c:771
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 1 0x000000000004fc03 uct_ud_iface_dispatch_zcopy_comps()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/uct/ib/ud/base/ud_iface.h:440
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 2 0x0000000000017682 ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ucx-v1.3.x/src/ucs/datastruct/callbackq.h:208
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 3 0x0000000000003307 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/pml/ucx/pml_ucx.c:454
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 4 0x00000000000329fc opal_progress()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/opal/runtime/opal_progress.c:228
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 5 0x000000000004a80f ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/runtime/ompi_mpi_init.c:883
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 6 0x0000000000067869 PMPI_Init_thread()  /hpc/local/benchmarks/hpcx_install_2018-02-06/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mpi/c/profile/pinit_thread.c:68
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 7 0x00000000004027af MTest_Init_thread()  ???:0
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 8 0x0000000000402b12 MTest_Init()  ???:0
Wed Feb  7 00:51:14 2018[1,14]<stderr>: 9 0x000000000040253b main()  ???:0
Wed Feb  7 00:51:14 2018[1,14]<stderr>:10 0x0000000000021c05 __libc_start_main()  ???:0
Wed Feb  7 00:51:14 2018[1,14]<stderr>:11 0x0000000000402449 _start()  ???:0
Wed Feb  7 00:51:14 2018[1,14]<stderr>:===================Wed Feb  7 00:51:14 2018[1,14]<stderr>:[clx-hercules-007:32343:0:32343] Process frozen...

MTT link:
http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20180206_232410_3890_56446_clx-hercules-001/html/test_stdout_TpEL62.txt

@alinask alinask added Bug MTT MTT Error labels Feb 7, 2018
@yosefe yosefe added this to the v1.3.0 milestone Feb 7, 2018
@evgeny-leksikov
Copy link
Contributor

Not reproducible during 200+ iterations (took ~3 hours) after MOFED and FW upgrade:

$pdsh -w clx-hercules-[001-008] ofed_info -s | dshbak -c
----------------
clx-hercules-[001-008]
----------------
MLNX_OFED_LINUX-4.2-1.2.0.0:
$pdsh -w clx-hercules-[001-008] ibv_devinfo -d mlx5_2 | grep vendor_part_id | dshbak -c
----------------
clx-hercules-[001-008]
----------------
        vendor_part_id:                 4121
$pdsh -w clx-hercules-[001-008] ibv_devinfo -d mlx5_2 | grep fw_ver | dshbak -c
----------------
clx-hercules-[001-008]
----------------
        fw_ver:                         16.21.2010

@evgeny-leksikov
Copy link
Contributor

not reproducible during 406 iterations (5.5 hours).
@amaslenn @alinask pls check if it's reproducible in MTT after MOFED/FM upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
Development

No branches or pull requests

3 participants