Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS: verbs;rxm - latest ofi main causes mem corruption when running at scale #6973

Closed
frostedcmos opened this issue Aug 5, 2021 · 4 comments

Comments

@frostedcmos
Copy link

frostedcmos commented Aug 5, 2021

When testing latest OFI main on Frontera TACC system at scale one of scenarios that perviously passed a test stage with v1.12.0 using verbs;ofi_rxm is now crashing one of daos servers with the following backtrace:

OFI: 7d6d2a1
System: Frontera
Servers: 16 * 1 engine
Clients: 40
Procs per client: 56
Program ran: "ior easy" with stonewalling set to 20seconds.

Crash:
c171-141: ERROR: daos_engine:0 ** Error in `/work2/08126/dbohninx/frontera/BUILDS/daos-8250/latest/daos/install/bin/daos_engine': double free or corruption (!prev): 0x00002b26a403e1f0 **
c171-141: ERROR: daos_engine:0 ======= Backtrace: =========
c171-141: /lib64/libc.so.6(+0x7f3e4)[0x2b26852463e4]
c171-141: /lib64/libc.so.6(+0x846e0)[0x2b268524b6e0]
c171-141: ERROR: daos_engine:0 /lib64/libc.so.6(realloc+0x1d2)[0x2b268524cd82]

Backtrace from core file:
#0 0x00002b26851fd387 in raise () from /lib64/libc.so.6
#1 0x00002b26851fea78 in abort () from /lib64/libc.so.6
#2 0x00002b268523fed7 in __libc_message () from /lib64/libc.so.6
#3 0x00002b26852463e4 in malloc_printerr () from /lib64/libc.so.6
#4 0x00002b268524b6e0 in _int_realloc () from /lib64/libc.so.6
#5 0x00002b268524cd82 in realloc () from /lib64/libc.so.6
#6 0x00002b268ac4fec1 in ofi_bufpool_grow () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#7 0x00002b268ac8d5c9 in rxm_alloc_conn () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#8 0x00002b268ac8d8ef in rxm_add_conn () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#9 0x00002b268ac8e084 in rxm_handle_event.isra.7 () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#10 0x00002b268ac8e428 in rxm_conn_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#11 0x00002b268ac98bee in rxm_ep_do_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#12 0x00002b268ac98c81 in rxm_ep_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#13 0x00002b268ac4b52d in ofi_cq_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#14 0x00002b268ac4aa6b in ofi_cq_readfrom () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#15 0x00002b2686708e3e in fi_cq_readfrom (src_addr=0x3e08150, count=16, buf=0x3e081d0, cq=0x2b28f0034ab0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/prereq/release/ofi/include/rdma/fi_eq.h:400
#16 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x3e08150, cq_events=0x3e081d0, context=0x2b28f00896d0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na_ofi.c:3288
#17 na_ofi_progress (na_class=0x2b28f002c070, context=0x2b28f00896d0, timeout=0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na_ofi.c:5451
#18 0x00002b26867040c1 in NA_Progress (na_class=na_class@entry=0x2b28f002c070, context=context@entry=0x2b28f00896d0, timeout=timeout@entry=0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na.c:1267
#19 0x00002b26862dea70 in hg_core_progress_na (na_class=0x2b28f002c070, na_context=0x2b28f00896d0, timeout=0, progressed_ptr=progressed_ptr@entry=0x3e08670 "")
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3896
#20 0x00002b26862e09e4 in hg_core_poll (progressed_ptr=, timeout=, context=0x2b28f00874e0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3838
#21 hg_core_progress (context=0x2b28f00874e0, timeout=0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3693
#22 0x00002b26862e5f1b in HG_Core_progress (context=, timeout=) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:5056
#23 0x00002b26862d8063 in HG_Progress (context=context@entry=0x2b28f002c0a0, timeout=) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury.c:2022
#24 0x00002b26838fa151 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x2b28f0026d38, timeout=timeout@entry=0) at src/cart/crt_hg.c:1277
#25 0x00002b26838bd005 in crt_progress (crt_ctx=0x2b28f0026d20, timeout=0) at src/cart/crt_context.c:1454
#26 0x000000000043c7dd in dss_srv_handler (arg=0x3d173f0) at src/engine/srv.c:474
#27 0x00002b26846e3dba in ABTD_ythread_func_wrapper () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#28 0x00002b26846e3f61 in make_fcontext () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#29 0x0000000000000000 in ?? ()

@frostedcmos
Copy link
Author

DAOS validation retried Frontera scenarios above with 3 versions of OFI: v1.12.0, 1.130.0 and 7d6d2a1 using verbs;ofi_rxm.

Results are:
v1.12.0: PASS
v1.13.0: PASS
7d6d2a1: Failures

Besides the original crash report above, another crash signature of testing with 7d6d2a1 OFI is following:

c155-073.frontera.tacc.utexas.edu ERROR 2021/08/06 16:53:51 daos_engine:0 libfabric:62384:verbs:eq:vrb_set_rnr_timer():474 Unable to modify QP attribute
c155-073.frontera.tacc.utexas.edu ERROR 2021/08/06 16:53:51 daos_engine:0 mlx5: c155-073.frontera.tacc.utexas.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
000000ca 00000000 00000000 00000000
00000000 00008813 1000761d 00002cd2
c155-073.frontera.tacc.utexas.edu ERROR 2021/08/06 16:53:51 daos_engine:0 libfabric:62384:ofi_rxm:cq:rxm_handle_comp_error():1719 fi_cq_readerr: err: Input/output error (5), prov_err: remote access error (10)

@frostedcmos
Copy link
Author

We've bisected it to:

885e643 - GOOD
c3ee423 - GOOD
aa04cff – rpc timeouts
a381445 – rpc timeouts
5f4fe58 – rpc timeouts
4c5461b – rpc timeouts
• Faec8a43aec959dc479f49f935e86576292f5091 – rpc timeouts
8f1c597 – No timeouts, but completion with error
o daos_engine:0 libfabric:266556:ofi_rxm:cq:rxm_handle_comp_error():1719 fi_cq_readerr: err: Input/output error (5), prov_err: remote access error (10)
4acbd64 – completion with error
o daos_engine:0 libfabric:269258:ofi_rxm:cq:rxm_handle_comp_error():1719 fi_cq_readerr: err: Input/output error (5), prov_err: remote access error (10)
5426311 – completion with error
o daos_engine:0 libfabric:276970:ofi_rxm:cq:rxm_handle_comp_error():1719 fi_cq_readerr: err: Input/output error (5), prov_err: remote access error (10)

@shefty
Copy link
Member

shefty commented Sep 24, 2021

See PR #7091 -- both verbs and tcp have the potential for use after free, or double free under connection failures. The underlying problems are in the verbs/tcp providers and have been there for some time. This is an easy change that should avoid the problems until the underlying providers are fixed.

There's still an issue with 7091, which is why the QP creation failed in the first place, causing the reject path.

@shefty
Copy link
Member

shefty commented Sep 29, 2021

Changes in #7091 were merged. Initial testing shows memory corruption issues have been resolved.

@shefty shefty closed this as completed Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants