Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MTT] infinity log with cuda errors #234

Closed
avildema opened this issue Jun 23, 2021 · 2 comments
Closed

[MTT] infinity log with cuda errors #234

avildema opened this issue Jun 23, 2021 · 2 comments
Assignees
Labels
mtt Issue detected in MTT testing

Comments

@avildema
Copy link
Contributor

UCX: 1.11
UCC: master
OMPI: v5.0.x

setup: GPU (cuda)

infinity print to log

[1624426315.715236] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715239] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715244] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715247] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715252] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715256] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715260] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703205] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703210] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703214] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703218] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703221] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703226] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
@avildema avildema added the mtt Issue detected in MTT testing label Jun 23, 2021
@avildema
Copy link
Contributor Author

cmd for reproduce

   
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 4 --display-map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core -x HCOLL_CUDA_SBGP=p2p -x HCOLL_CUDA_BCOL=nccl -x HCOLL_ALLREDUCE_ZCOPY_THRESH_CUDA=512 /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/allreduce_cuda/mtt-tests.git/mpi_corr_tests/allreduce_cuda/bin/allreduce_cuda

@vspetrov
Copy link
Collaborator

Fixed with #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mtt Issue detected in MTT testing
Projects
None yet
Development

No branches or pull requests

4 participants