Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] ERROR failed to register address in mpi-small-tests #2702

Closed
amaslenn opened this issue Jun 23, 2018 · 3 comments
Closed

[mtt] ERROR failed to register address in mpi-small-tests #2702

amaslenn opened this issue Jun 23, 2018 · 3 comments
Assignees
Labels
Bug MTT MTT Error

Comments

@amaslenn
Copy link
Contributor

Configuration:

MOFED: MLNX_OFED_LINUX-4.3-1.0.1.0
OMPI: 3.1.1rc1
Nodes: vega x40 (ppn=36(x40), nodelist=clx-vega-[025-064])
Job: ucx-hwtm-dc

MTT log: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20180623_084048_26719_117558_clx-vega-025/html/test_stdout_G0xlOH.txt

Doesn't reproduce with UCX_DC_VERBS_TM_ENABLE=n. Same for rc, works without HWTM, fails with enabled HWTM. Can be related to Lustre issues on server.

Cmd:
mpirun -np 305 -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc,sm -x UCX_IB_SL=1 -x UCX_DC_VERBS_TM_ENABLE=y -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20180623_084048_26719_117558_clx-vega-025/installs/W9jm/tests/mpi-small-tests/hpc_tests.git/mpi/misc/all2all_ibm

Output:

...
Sat Jun 23 17:16:39 2018[1,0]<stdout>:iter 141
Sat Jun 23 17:16:39 2018[1,0]<stdout>:iter 142
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428483] [clx-vega-025:10641:0]         rcache.c:534  UCX  WARN  failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428494] [clx-vega-025:10641:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428499] [clx-vega-025:10641:0]    ucp_request.c:259  UCX  ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440739] [clx-vega-025:10641:0]         rcache.c:534  UCX  WARN  failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440745] [clx-vega-025:10641:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440749] [clx-vega-025:10641:0]    ucp_request.c:259  UCX  ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450210] [clx-vega-025:10641:0]         rcache.c:534  UCX  WARN  failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450215] [clx-vega-025:10641:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450218] [clx-vega-025:10641:0]    ucp_request.c:259  UCX  ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460304] [clx-vega-025:10641:0]         rcache.c:534  UCX  WARN  failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460308] [clx-vega-025:10641:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460311] [clx-vega-025:10641:0]    ucp_request.c:259  UCX  ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428102] [clx-vega-036:25799:0]         rcache.c:534  UCX  WARN  failed to register region 0x9cacc0 [0x108ae00..0x108ce00]: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428115] [clx-vega-036:25799:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108ae00 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428121] [clx-vega-036:25799:0]    ucp_request.c:259  UCX  ERROR failed to register user buffer datatype 0x20 address 0x108ae00 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.438322] [clx-vega-036:25799:0]         rcache.c:534  UCX  WARN  failed to register region 0x9cacc0 [0x108ae00..0x108ce00]: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.438328] [clx-vega-036:25799:0]         ucp_mm.c:105  UCX  ERROR failed to register address 0x108ae00 length 8192 on md[0]=ib/mlx5_0: Input/output error
...
@yosefe yosefe added Bug MTT MTT Error labels Jun 24, 2018
@yosefe
Copy link
Contributor

yosefe commented Jun 24, 2018

happens because ompi_coll_base_bcast_intra_generic() calls pml_ucx_irecv() with length=8192, while the actual buffer size, and the expected receive data, is less than that:

131	            /* post new irecv */
132	            err = MCA_PML_CALL(irecv( tmpbuf + realsegsize, count_by_segment,
133	                                      datatype, tree->tree_prev,
134	                                      MCA_COLL_BASE_TAG_BCAST,
135	                                      comm, &recv_reqs[req_index]));

This is actually legal because according to MPI spec 3.2.4: "If a message that is shorter than the receive buffer arrives, then only those locations corresponding to the (shorter) message are modified."
With HW TM, we register the whole receive buffer, hence we get the error. So need to avoid error prints for HW TM, and fallback to SW TM.

@hanyunfan
Copy link

What's the solution to this issue?

@yosefe
Copy link
Contributor

yosefe commented Nov 7, 2018

@hanyunfan it's fixed in #2775

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
Development

No branches or pull requests

4 participants