-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mm ep failed to connect to remote FIFO id : shared memory error; open(file_name=/proc/25599/fd/35 flags=0x0) failed: No such file or directory #8511
Comments
@hoopoepg any idea? |
something wrong with proc file system. Are there containers used? |
@hoopoepg no containers tried adding Please see log: https://gist.github.com/jamesongithub/ca1c9618f0dd994f6bf8356147111543 |
ok, it seems POSIX shm transport failed to access to shared memory. |
/proc errors gone, now are shmat errors:
|
it seems there are some restrictions to operate shared memory on your system - UCX can't use this transport at all. |
with
|
preferably instead of disabling shared memory we can adjust system also since if we disable ucx completely we can get a successfully run are these reasonable?
|
hi thank you |
hey with
|
glad we were able to get a successful run but would like to know how to get it working with the default parameters. does this last result give us an idea of what should be changed to work with defaults? |
Describe the bug
A clear and concise description of what the bug is.
During an mpirun of hpl benchmark ucx errors were encountered which caused the job to fail.
The error message looks like the following:
Looks like a side issue that was reported in #4224 as well as easybuilders/easybuild#756
Steps to Reproduce
ucx_info -v
)Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXhttps://gist.github.com/jamesongithub/bda88d5575aa06bedcf31255dae82b25
The text was updated successfully, but these errors were encountered: