-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mtt] ERROR Failed to allocate memory pool chunk: Out of memory #2670
Comments
Reproduced manually with the following:
|
@alinask i think it's worth adding the mpool name to the error message in mpool.c:176 |
|
Probably due to enabled HW TM and the increased threshold for BCOPY:
|
Found in [Jun18 00:35] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[ +1.782955] prolog (30529): drop_caches: 3
[Jun18 00:47] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30122): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.008773] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30109): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.001393] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30111): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.003357] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30125): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.001646] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30127): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.011469] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30116): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.124870] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30132): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.001839] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30157): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.005886] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30149): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.004033] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30152): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.000543] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30165): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.003805] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30170): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.002034] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30172): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.014786] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30177): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.018667] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30174): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.004447] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30185): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[ +0.005452] mlx5_core 0000:82:00.0: mlx5_cmd_check:731:(pid 30180): unknown command opcode(0x205) op_mod(0x0) failed, status no resources(0xf), syndrome (0x9f261)
[Jun18 00:57] sysctl (30885): drop_caches: 1 |
Printed the chunk size as well: |
Didn't reproduce without HW TM: Didn't reproduce with HW TM but with a shorter value for TM_MAX_BCOPY: |
Will change mtt's config not to use the large bcopy threshold with HW TM on such a large scale. |
Configuration:
MTT log: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/html/test_stdout_1x6yiO.txt
The issue wasn't reproduced manually. It can be useful while we chase OOM issues.
Cmd:
mpirun -np 2492 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_4:1 -x UCX_IB_ADDR_TYPE=ib_global -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,sm -x UCX_IB_SL=1 -x UCX_DC_MLX5_TM_ENABLE=y -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20180602_062851_15753_107513_clx-orion-005/installs/IYeL/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'All,^One-sided,^IO,^Environment' -n 300
Output:
Full log: ucx-orion-oom.txt
The text was updated successfully, but these errors were encountered: