You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
...
Sat Jun 23 17:16:39 2018[1,0]<stdout>:iter 141
Sat Jun 23 17:16:39 2018[1,0]<stdout>:iter 142
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428483] [clx-vega-025:10641:0] rcache.c:534 UCX WARN failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428494] [clx-vega-025:10641:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.428499] [clx-vega-025:10641:0] ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440739] [clx-vega-025:10641:0] rcache.c:534 UCX WARN failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440745] [clx-vega-025:10641:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.440749] [clx-vega-025:10641:0] ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450210] [clx-vega-025:10641:0] rcache.c:534 UCX WARN failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450215] [clx-vega-025:10641:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.450218] [clx-vega-025:10641:0] ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460304] [clx-vega-025:10641:0] rcache.c:534 UCX WARN failed to register region 0x9c03e0 [0x108abb0..0x108cbb0]: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460308] [clx-vega-025:10641:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108abb0 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,80]<stdout>:[1529763398.460311] [clx-vega-025:10641:0] ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x108abb0 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428102] [clx-vega-036:25799:0] rcache.c:534 UCX WARN failed to register region 0x9cacc0 [0x108ae00..0x108ce00]: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428115] [clx-vega-036:25799:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108ae00 length 8192 on md[0]=ib/mlx5_0: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.428121] [clx-vega-036:25799:0] ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x108ae00 len 8192: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.438322] [clx-vega-036:25799:0] rcache.c:534 UCX WARN failed to register region 0x9cacc0 [0x108ae00..0x108ce00]: Input/output error
Sat Jun 23 17:16:39 2018[1,11]<stdout>:[1529763398.438328] [clx-vega-036:25799:0] ucp_mm.c:105 UCX ERROR failed to register address 0x108ae00 length 8192 on md[0]=ib/mlx5_0: Input/output error
...
The text was updated successfully, but these errors were encountered:
happens because ompi_coll_base_bcast_intra_generic() calls pml_ucx_irecv() with length=8192, while the actual buffer size, and the expected receive data, is less than that:
This is actually legal because according to MPI spec 3.2.4: "If a message that is shorter than the receive buffer arrives, then only those locations corresponding to the (shorter) message are modified."
With HW TM, we register the whole receive buffer, hence we get the error. So need to avoid error prints for HW TM, and fallback to SW TM.
Configuration:
MTT log: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20180623_084048_26719_117558_clx-vega-025/html/test_stdout_G0xlOH.txt
Doesn't reproduce with
UCX_DC_VERBS_TM_ENABLE=n
. Same forrc
, works without HWTM, fails with enabled HWTM. Can be related to Lustre issues on server.Cmd:
mpirun -np 305 -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc,sm -x UCX_IB_SL=1 -x UCX_DC_VERBS_TM_ENABLE=y -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20180623_084048_26719_117558_clx-vega-025/installs/W9jm/tests/mpi-small-tests/hpc_tests.git/mpi/misc/all2all_ibm
Output:
The text was updated successfully, but these errors were encountered: