Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] ERROR Failed to allocate memory pool chunk: Out of memory #2670

Closed
amaslenn opened this issue Jun 7, 2018 · 8 comments
Closed

[mtt] ERROR Failed to allocate memory pool chunk: Out of memory #2670

amaslenn opened this issue Jun 7, 2018 · 8 comments
Assignees
Labels
Bug MTT MTT Error

Comments

@amaslenn
Copy link
Contributor

amaslenn commented Jun 7, 2018

Configuration:

OMPI: 3.1.1a1
MOFED: MLNX_OFED_LINUX-4.3-1.0.1.0
Nodes: orion x89 (ppn=28(x89), nodelist=clx-orion-[005,012-013,019-020,022-023,025-035,037-045,047-079,081-086,088-094,096-111])
Job: ucx-hwtm-dc

MTT log: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/html/test_stdout_1x6yiO.txt

The issue wasn't reproduced manually. It can be useful while we chase OOM issues.

Cmd:
mpirun -np 2492 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_4:1 -x UCX_IB_ADDR_TYPE=ib_global -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,sm -x UCX_IB_SL=1 -x UCX_DC_MLX5_TM_ENABLE=y -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/installs/IYeL/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'All,^One-sided,^IO,^Environment' -n 300

Output:

Sat Jun  2 09:45:46 2018[1,2076]<stdout>:[1527921946.615007] [clx-orion-049:8972 :0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool chunk: Out of memory
Sat Jun  2 09:45:46 2018[1,118]<stdout>:[1527921946.615327] [clx-orion-049:8950 :0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool chunk: Out of memory
Sat Jun  2 09:45:46 2018[1,118]<stdout>:[1527921946.615723] [clx-orion-049:8950 :0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool chunk: Out of memorySat Jun  2 09:46:13 2018[1,2332]<stderr>:[clx-orion-037:8665 :0:8665] ib_mlx5_log.c:113  FATAL: Error on QP 0x19c31 wqe[57014]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:[clx-orion-037:8664 :0:8664] ib_mlx5_log.c:113  FATAL: Error on QP 0x19ca5 wqe[32252]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SENDSat Jun  2 09:46:13 2018[1,168]<stderr>:[clx-orion-102:13732:0:13732] ib_mlx5_log.c:113  FATAL: Error on QP 0x17bae wqe[9231]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,79]<stderr>:[clx-orion-102:13730:0:13730] ib_mlx5_log.c:113  FATAL: Error on QP 0x17bb2 wqe[4458]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SENDSat Jun  2 09:46:13 2018[1,218]<stderr>:[clx-orion-060:8095 :0:8095] ib_mlx5_log.c:113  FATAL: Error on QP 0x19657 wqe[13337]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,485]<stderr>:[clx-orion-060:8098 :0:8098] ib_mlx5_log.c:113  FATAL: Error on QP 0x195ae wqe[21316]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,396]<stderr>:[clx-orion-060:8097 :0:8097] ib_mlx5_log.c:113  FATAL: Error on QP 0x195ee wqe[25280]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,129]<stderr>:[clx-orion-060:8094 :0:8094] ib_mlx5_log.c:113  FATAL: Error on QP 0x19608 wqe[44838]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
Sat Jun  2 09:46:13 2018[1,307]<stderr>:[clx-orion-060:8096 :0:8096] ib_mlx5_log.c:113  FATAL: Error on QP 0x194df wqe[51666]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SENDSat Jun  2 09:46:13 2018[1,2243]<stderr>:==== backtrace ====
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 0 0x000000000002a16b uct_ib_mlx5_completion_with_err()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/mlx5/ib_mlx5_log.c:111
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 1 0x00000000000447c6 uct_ib_mlx5_poll_cq()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/mlx5/ib_mlx5.inl:32
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 2 0x00000000000179b2 ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:208
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 3 0x0000000000003347 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/pml/ucx/pml_ucx.c:457
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 4 0x0000000000032a5c opal_progress()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/opal/runtime/opal_progress.c:228
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 5 0x0000000000004c54 mca_pml_ucx_send_nbr()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/pml/ucx/pml_ucx.c:765
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 6 0x000000000007ca15 ompi_coll_base_bcast_intra_split_bintree()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/coll/base/coll_base_bcast.c:484
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 7 0x000000000000494c ompi_coll_tuned_bcast_intra_dec_fixed()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:265
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 8 0x000000000005c251 PMPI_Bcast()  /hpc/local/benchmarks/hpcx_install_2018-06-01/src/hpcx-gcc-redhat7.4/ompi/ompi/mpi/c/profile/pbcast.c:111
Sat Jun  2 09:46:13 2018[1,2243]<stderr>: 9 0x0000000000406ae9 tst_coll_bcast_run()  /mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/installs/IYeL/tests/mpi-test-suite/ompi-tests/mpi_test_suite/coll/tst_coll_bcast.c:72
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:10 0x0000000000436af4 tst_test_run_func()  /mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/installs/IYeL/tests/mpi-test-suite/ompi-tests/mpi_test_suite/tst_tests.c:1461
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:11 0x0000000000412e91 main()  /mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/installs/IYeL/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite.c:636
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:12 0x0000000000021c05 __libc_start_main()  ???:0
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:13 0x0000000000403e39 _start()  ???:0
Sat Jun  2 09:46:13 2018[1,2243]<stderr>:===================Sat Jun  2 09:46:13 2018[1,2332]<stderr>:==== backtrace ====

Full log: ucx-orion-oom.txt

@yosefe yosefe added the Bug label Jun 10, 2018
@yosefe yosefe added the MTT MTT Error label Jun 10, 2018
@alinask
Copy link
Contributor

alinask commented Jun 10, 2018

Reproduced manually with the following:

Sun Jun 10 17:48:47 2018[1,488]<stdout>:[1528642126.768941] [clx-orion-045:30808:0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool chunk: Out of memory
Sun Jun 10 17:48:47 2018[1,488]<stdout>:[1528642126.768955] [clx-orion-045:30808:0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool chunk: Out of memory
Sun Jun 10 17:48:47 Sun Jun 10 17:48:47 2018[1,803]<stderr>:[clx-orion-004:30834:0:30834] malloc_hook.c:142  Assertion `hash_extra_status >= 0' failed
Sun Jun 10 17:48:47 2018[1,803]<stderr>:[clx-orion-004:30834:0:30834] mmap(NULL, 4096, READ|WRITE, PRIVATE|ANON) failed: Cannot allocate memory
Sun Jun 10 17:48:48 2018[1,2381]<stderr>:[clx-orion-069:26120:0:26120] malloc_hook.c:142  Assertion `hash_extra_status >= 0' failed
Sun Jun 10 17:48:48 2018[1,2381]<stderr>:[clx-orion-069:26120:0:26120] mmap(NULL, 4096, READ|WRITE, PRIVATE|ANON) failed: Cannot allocate memory

@yosefe
Copy link
Contributor

yosefe commented Jun 10, 2018

@alinask i think it's worth adding the mpool name to the error message in mpool.c:176

@alinask
Copy link
Contributor

alinask commented Jun 14, 2018

mpool.c:176 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Out of memory

@alinask
Copy link
Contributor

alinask commented Jun 17, 2018

Probably due to enabled HW TM and the increased threshold for BCOPY:

#
# Maximal size of copy-out sends when tag-matching offload is enabled
#
# Syntax: memory units: <number>[b|kb|mb|gb], "inf", or "auto"
#
UCX_DC_MLX5_TM_MAX_BCOPY=48k

@amaslenn
Copy link
Contributor Author

Found in dmesg on one of the host after the failure:

[Jun18 00:35] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[  +1.782955] prolog (30529): drop_caches: 3
[Jun18 00:47] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30122): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.008773] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30109): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.001393] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30111): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.003357] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30125): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.001646] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30127): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.011469] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30116): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.124870] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30132): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.001839] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30157): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.005886] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30149): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.004033] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30152): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.000543] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30165): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.003805] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30170): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.002034] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30172): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.014786] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30177): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.018667] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30174): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.004447] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30185): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[  +0.005452] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30180): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[Jun18 00:57] sysctl (30885): drop_caches: 1

@alinask
Copy link
Contributor

alinask commented Jun 24, 2018

Printed the chunk size as well:
mpool.c:177 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Out of memory. chunk_size 222006488

@alinask
Copy link
Contributor

alinask commented Jun 26, 2018

Didn't reproduce without HW TM:
-x UCX_DC_MLX5_TM_ENABLE=n

Didn't reproduce with HW TM but with a shorter value for TM_MAX_BCOPY:
-x UCX_DC_MLX5_TM_ENABLE=y -x UCX_DC_MLX5_TM_MAX_BCOPY=8k

@alinask
Copy link
Contributor

alinask commented Jun 27, 2018

Will change mtt's config not to use the large bcopy threshold with HW TM on such a large scale.

@alinask alinask closed this as completed Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
Development

No branches or pull requests

3 participants