Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I use OpenMpi in AWS EFA #9858

Open
chenshixinnb opened this issue Jan 11, 2022 · 38 comments
Open

How do I use OpenMpi in AWS EFA #9858

chenshixinnb opened this issue Jan 11, 2022 · 38 comments

Comments

@chenshixinnb
Copy link

I use a normal node and it works fine, but I use a node that supports EFA network and report the following error:

@chenshixinnb
Copy link
Author

[1641461840.238519] [c-96-4-worker0001:3595 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238244] [c-96-4-worker0001:3598 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[c-96-4-worker0001:03640] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03646] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03671] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03603] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03604] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03610] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03615] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03616] pml_ucx.c:291 Error: Failed to create UCP worker
[1641461840.237779] [c-96-4-worker0001:3605 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.239128] [c-96-4-worker0001:3606 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.239325] [c-96-4-worker0001:3610 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.237871] [c-96-4-worker0001:3619 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238154] [c-96-4-worker0001:3616 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.237974] [c-96-4-worker0001:3637 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.237984] [c-96-4-worker0001:3639 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.239203] [c-96-4-worker0001:3647 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238474] [c-96-4-worker0001:3664 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238478] [c-96-4-worker0001:3660 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238485] [c-96-4-worker0001:3661 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.240738] [c-96-4-worker0001:3676 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.242309] [c-96-4-worker0001:3680 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.238529] [c-96-4-worker0001:3674 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.241533] [c-96-4-worker0001:3678 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[1641461840.241750] [c-96-4-worker0001:3704 :0] rc_iface.c:492 UCX ERROR ibv_create_srq() failed: Operation not supported
[c-96-4-worker0001:03593] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03595] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03597] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03600] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03612] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03613] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001:03619] pml_ucx.c:291 Error: Failed to create UCP worker
[c-96-4-worker0001][[52336,1],41][btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[52336,1],29]
[c-96-4-worker0001][[52336,1],31][btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[52336,1],28]
[c-96-4-worker0001][[52336,1],32][btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[52336,1],0]

@brminich
Copy link
Member

Can you please post the commands you used to run your application?

@chenshixinnb
Copy link
Author

I added the following parameters via this link(#6795 ):
--mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 10,
The following error has occurred:
[c-96-4-worker0002:04711] mca: base: components_register: component cm register function successful
[c-96-4-worker0002:04711] mca: base: components_open: opening pml components
[c-96-4-worker0002:04711] mca: base: components_open: found loaded component cm
[c-96-4-worker0002:04710] mca: base: components_register: registering framework pml components
[c-96-4-worker0002:04710] mca: base: components_register: found loaded component cm
[c-96-4-worker0002:04710] mca: base: components_register: component cm register function successful
[c-96-4-worker0002:04710] mca: base: components_open: opening pml components
[c-96-4-worker0002:04710] mca: base: components_open: found loaded component cm
[c-96-4-worker0001:04704] mca: base: components_register: registering framework pml components
[c-96-4-worker0001:04704] mca: base: components_register: found loaded component cm
[c-96-4-worker0001:04704] mca: base: components_register: component cm register function successful
[c-96-4-worker0001:04704] mca: base: components_open: opening pml components
[c-96-4-worker0001:04704] mca: base: components_open: found loaded component cm
[c-96-4-worker0002:04708] mca: base: components_register: registering framework pml components
[c-96-4-worker0002:04716] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[c-96-4-worker0002:04716] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list

@chenshixinnb
Copy link
Author

Can you please post the commands you used to run your application?

#!/bin/bash
module add GROMACS/2021-foss-2020b
mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu

@chenshixinnb
Copy link
Author

chenshixinnb commented Jan 11, 2022

I added the following parameters via this link(#6795 ): --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 10, The following error has occurred: [c-96-4-worker0002:04711] mca: base: components_register: component cm register function successful [c-96-4-worker0002:04711] mca: base: components_open: opening pml components [c-96-4-worker0002:04711] mca: base: components_open: found loaded component cm [c-96-4-worker0002:04710] mca: base: components_register: registering framework pml components [c-96-4-worker0002:04710] mca: base: components_register: found loaded component cm [c-96-4-worker0002:04710] mca: base: components_register: component cm register function successful [c-96-4-worker0002:04710] mca: base: components_open: opening pml components [c-96-4-worker0002:04710] mca: base: components_open: found loaded component cm [c-96-4-worker0001:04704] mca: base: components_register: registering framework pml components [c-96-4-worker0001:04704] mca: base: components_register: found loaded component cm [c-96-4-worker0001:04704] mca: base: components_register: component cm register function successful [c-96-4-worker0001:04704] mca: base: components_open: opening pml components [c-96-4-worker0001:04704] mca: base: components_open: found loaded component cm [c-96-4-worker0002:04708] mca: base: components_register: registering framework pml components [c-96-4-worker0002:04716] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)" [c-96-4-worker0002:04716] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list [c-96-4-worker0002:04716] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list

#!/bin/bash
module add GROMACS/2021-foss-2020b
mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu --mca pml cm --mca mtl ofi --mca pml_base_verbose 10

@brminich
Copy link
Member

what is the command which produces the original errors from pml_ucx? Can you please also post the whole log?
Also why do you specify psm2 provider for running with efa?

@wzamazon
Copy link
Contributor

Hi @chenshixinnb,

I have a few questions:

  1. What version of open mpi are you using?
  2. What version of libfabric are you using (you can get the info by running fi_info)?
  3. Are you trying to use GPU version of GROMACS?

@chenshixinnb
Copy link
Author

chenshixinnb commented Jan 11, 2022

Hi @chenshixinnb,

I have a few questions:

  1. What version of open mpi are you using?
  2. What version of libfabric are you using (you can get the info by running fi_info)?
  3. Are you trying to use GPU version of GROMACS?

thanks,
1.openmpi4.0.5
2.[cloudam@c-96-4-worker0001 ~]$ fi_info -p efa
provider: efa
fabric: EFA-fe80::4fc:a6ff:fe55:bb98
domain: rdmap0s6-rdm
version: 111.20
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4fc:a6ff:fe55:bb98
domain: rdmap0s6-dgrm
version: 111.20
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
3.no,I use MPI parallelism, -cpi nvt_gpu -deffnm nvt_gpu,"nvt_gpu" is input file name

@chenshixinnb
Copy link
Author

chenshixinnb commented Jan 11, 2022

what is the command which produces the original errors from pml_ucx? Can you please also post the whole log? Also why do you specify psm2 provider for running with efa?

what is the command which produces the original errors from pml_ucx? Can you please also post the whole log? Also why do you specify psm2 provider for running with efa?

sorry,I pasted it wrong,
--mca pml cm --mca mtl ofi --mca pml_base_verbose 10

@wckzhang
Copy link
Contributor

Those errors look like ucx errors, which wouldn't appear if the ofi mtl was properly selected. EFA is only supported with the ofi mtl. IIRC, specifying --mca pml cm doesn't fail if the cm pml cannot find a proper mtl. Try excluding the ucx pml --mca pml ^ucx. My suspicion is that the ofi mtl is not properly selecting the EFA provider and is causing a fallback to ucx.

@wckzhang
Copy link
Contributor

EDIT: My recollection was wrong, specifying --mca pml cm should force the cm pml to be selected or an error will be thrown if it cannot select an MTL. I misread and realized you added --mca pml cm later.

@wzamazon
Copy link
Contributor

Can you add --mca mtl_base_verbose 100 to your mpirun command line and share the output?

@chenshixinnb
Copy link
Author

slurm2-out.txt

#!/bin/bash
module add GROMACS/2021-foss-2020b
mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 100

Can you add --mca mtl_base_verbose 100 to your mpirun command line and share the output?

@wzamazon
Copy link
Contributor

Looks like the initialization of ofi (libfabric) failed.

[c-96-4-worker0001:03645] select: init returned failure for component ofi
[c-96-4-worker0001:03645] select: no component selected
[c-96-4-worker0001:03645] select: init returned failure for component cm

Please add -x FI_LOG_LEVEL=warn to your mpirun command. This will make libfabric print more information.

@chenshixinnb
Copy link
Author

slurm3-out.txt

mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 100 -x FI_LOG_LEVEL=warn

Looks like the initialization of ofi (libfabric) failed.

[c-96-4-worker0001:03645] select: init returned failure for component ofi
[c-96-4-worker0001:03645] select: no component selected
[c-96-4-worker0001:03645] select: init returned failure for component cm

Please add -x FI_LOG_LEVEL=warn to your mpirun command. This will make libfabric print more information.

@wzamazon
Copy link
Contributor

I did not see any information from libfabric printed, which makes me wonder whether openmpi was compiled correctly with libfabric.

How did you obtain open mpi?

Did you compile by yourself or got it from other source?

@wckzhang
Copy link
Contributor

[c-96-4-worker0002:03606] mtl_ofi_component.c:541: select_ofi_provider: no provider found
[c-96-4-worker0002:03606] select: init returned failure for component ofi
[c-96-4-worker0002:03606] select: no component selected

This is the important section. The fi_getinfo call in Open MPI did not return a provider (efa) and the other available providers are in the exclude list, thus no MTL was returned and the PML CM could not progress.

I feel like there's something missing here if fi_info returns an EFA provider but fi_getinfo isn't. Have you verified that all nodes in your slurm cluster return a provider when fi_info -p efa is called?

I briefly took a look at the hints that were being provided, but unless MTL_OFI_PROG_AUTO is set, I don't think there's an issue with the hints. If you want to dig deeper, the section of the code is the function ompi_mtl_ofi_component_init where the call to fi_getinfo returns no efa provider. (select_ofi_provider only checks and filters based off the include/exclude list, the problem is that the fi_getinfo call doesn't return the efa provider.

@wckzhang
Copy link
Contributor

I did not see any information from libfabric printed, which makes me wonder whether openmpi was compiled correctly with libfabric.

How did you obtain open mpi?

Did you compile by yourself or got it from other source?

With just FI_LOG_LEVEL=warn, I don't think seeing a lack of logs is indicative. I'm pretty sure libfabric is responding as it looks like the fi_getinfo call is returning tcp;ofi_rxm, UDP;ofi_rxd, and shm providers. The problem is that the EFA provider isn't in the fi_getinfo call.

@chenshixinnb
Copy link
Author

thanks,I used EasyBuild OpenMpi tool chain default automatic compilation

@wzamazon
Copy link
Contributor

it is possible that open mpi was not configured or compiled with libfabric correctly.

Because you have libfaric, I assume you used EFA installer to install it. Can you try to use the open mpi comes with EFA installer? It is under /opt/amazon/openmpi.

@chenshixinnb
Copy link
Author

基于业务原因,我需要将openmpi安装到共享盘,并且需要通过共享盘的openmpi编译安装软件

可能未使用 libfabric 正确配置或编译 open mpi。

因为你有 libfaric,我假设你使用 EFA 安装程序来安装它。你可以尝试使用 EFA 安装程序自带的 open mpi 吗?它在/opt/amazon/openmpi.

@wzamazon
Copy link
Contributor

I see. Can you run the command ompi_info that is par of the openmpi you are using, and paste the result?

@chenshixinnb
Copy link
Author

ompi_info.txt

I see. Can you run the command ompi_info that is par of the openmpi you are using, and paste the result?

@chenshixinnb
Copy link
Author

thanks

@wzamazon
Copy link
Contributor

Hi, I noticed that the open mpi you are using is not configured with libfabric.

 Configure command line: '--prefix=/public/software/.local/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-cuda=no' '--with-hwloc=/public/software/.local/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0' '--with-libevent=/public/software/.local/easybuild/software/libevent/2.1.12-GCCcore-10.2.0' '--with-ofi=/public/software/.local/easybuild/software/libfabric/1.11.0-GCCcore-10.2.0' '--with-pmix=/public/software/.local/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0' '--with-ucx=/public/software/.local/easybuild/software/UCX/1.9.0-GCCcore-10.2.0' '--without-verbs'

You will need configure open mpi with libfabric, e.g. when you compile open mpi, you will need to add --with-ofi=/opt/amazon/libfabric to the configure command, also remove --with-ucx.

@chenshixinnb
Copy link
Author

chenshixinnb commented Jan 13, 2022

curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
thanks,Can I install EFA to a shared disk by changing the script path?

Hi, I noticed that the open mpi you are using is not configured with libfabric.

 Configure command line: '--prefix=/public/software/.local/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-cuda=no' '--with-hwloc=/public/software/.local/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0' '--with-libevent=/public/software/.local/easybuild/software/libevent/2.1.12-GCCcore-10.2.0' '--with-ofi=/public/software/.local/easybuild/software/libfabric/1.11.0-GCCcore-10.2.0' '--with-pmix=/public/software/.local/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0' '--with-ucx=/public/software/.local/easybuild/software/UCX/1.9.0-GCCcore-10.2.0' '--without-verbs'

You will need configure open mpi with libfabric, e.g. when you compile open mpi, you will need to add --with-ofi=/opt/amazon/libfabric to the configure command, also remove --with-ucx.

@wzamazon
Copy link
Contributor

thanks,Can I install EFA to a shared disk by changing the script path?

No, it always install open mpi to /opt/amazon.

Note that to use EFA you will need to run EFA installer on each compute node any way, because using EFA requires you to install rdma-core and EFA kernel module, which is shipped as part of EFA installer and usually cannot be installed to a shared disk.

@chenshixinnb
Copy link
Author

slurm5-out.txt

#!/bin/bash
module add GROMACS/2021-foss-2020b
export PATH=/home/cloudam/OpenMpi-4.0.5/bin:$PATH
export LD_LIBRARY_PATH=/home/cloudam/OpenMpi-4.0.5/lib:$LD_LIBRARY_PATH
mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 100 -x FI_LOG_LEVEL=warn

@chenshixinnb
Copy link
Author

slurm5-out.txt

#!/bin/bash module add GROMACS/2021-foss-2020b export PATH=/home/cloudam/OpenMpi-4.0.5/bin:$PATH export LD_LIBRARY_PATH=/home/cloudam/OpenMpi-4.0.5/lib:$LD_LIBRARY_PATH mpirun -v gmx_mpi mdrun -v -cpi nvt_gpu -deffnm nvt_gpu --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 100 -x FI_LOG_LEVEL=warn

I recompiled OpenMPi-4.0.5,but there are still problems.

Configure command line: '--prefix=/home/cloudam/OpenMpi-4.0.5'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu'
'--enable-mpirun-prefix-by-default'
'--enable-shared' '--with-cuda=no'
'--with-hwloc=/public/software/.local/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0'
'--with-libevent=/public/software/.local/easybuild/software/libevent/2.1.12-GCCcore-10.2.0'
'--with-pmix=/public/software/.local/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0'
'--with-ofi=/opt/amazon/efa' '--without-verbs'

@wzamazon
Copy link
Contributor

Hi, does the compute node (such as c-96-4-worker0002) has EFA installer installed on it?

@chenshixinnb
Copy link
Author

chenshixinnb commented Jan 22, 2022

Hi, does the compute node (such as c-96-4-worker0002) has EFA installer installed on it?

yes,This is the output of executing this command:[aa@c-96-4-worker0001 ~]$ fi_info -p efa -t FI_EP_RDM

provider: efa
    fabric: EFA-fe80::4fc:a6ff:fe55:bb98
    domain: rdmap0s6-rdm
    version: 111.20
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA

@chenshixinnb
Copy link
Author

[c-96-4-worker0004:03574] select: component ofi selected
[c-96-4-worker0004:03574] select: init returned priority 25
[c-96-4-worker0004:03574] selected cm best priority 25
[c-96-4-worker0004:03574] select: component cm selected
[c-96-4-worker0004:03571] select: init returned success
[c-96-4-worker0004:03571] select: component ofi selected
[c-96-4-worker0004:03571] select: init returned priority 25
[c-96-4-worker0004:03571] selected cm best priority 25
[c-96-4-worker0004:03571] select: component cm selected
[c-96-4-worker0003:03672] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[c-96-4-worker0003:03672] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[c-96-4-worker0003:03672] mtl_ofi_component.c:347: mtl:ofi:prov: efa
[c-96-4-worker0003:03650] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[c-96-4-worker0003:03650] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[c-96-4-worker0003:03650] mtl_ofi_component.c:347: mtl:ofi:prov: efa
[c-96-4-worker0003:03567] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[c-96-4-worker0003:03567] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[c-96-4-worker0003:03567] mtl_ofi_component.c:347: mtl:ofi:prov: efa
[c-96-4-worker0003:03622] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[c-96-4-worker0003:03622] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[c-96-4-worker0003:03622] mtl_ofi_component.c:347: mtl:ofi:prov: efa

[c-96-4-worker0003:03627] select: init returned priority 25
[c-96-4-worker0003:03627] selected cm best priority 25
[c-96-4-worker0003:03627] select: component cm selected
[c-96-4-worker0003:03585] select: init returned success
[c-96-4-worker0003:03585] select: component ofi selected
[c-96-4-worker0003:03585] select: init returned priority 25
[c-96-4-worker0003:03585] selected cm best priority 25
[c-96-4-worker0003:03585] select: component cm selected
[c-96-4-worker0003:03623] select: init returned success
[c-96-4-worker0003:03623] select: component ofi selected
[c-96-4-worker0003:03623] select: init returned priority 25
[c-96-4-worker0003:03623] selected cm best priority 25
[c-96-4-worker0003:03623] select: component cm selected
[c-96-4-worker0003:03646] select: init returned success
[c-96-4-worker0003:03646] select: component ofi selected
[c-96-4-worker0003:03646] select: init returned priority 25

[c-96-4-worker0004:03561] check:select: modex not reqd
[c-96-4-worker0004:03675] check:select: modex not reqd
[c-96-4-worker0003:03609] check:select: modex not reqd
[c-96-4-worker0004:03603] check:select: modex not reqd
[c-96-4-worker0003:03630] *** An error occurred in MPI_Bcast
[c-96-4-worker0003:03630] *** reported by process [1917386753,63]
[c-96-4-worker0003:03630] *** on communicator MPI_COMM_WORLD
[c-96-4-worker0003:03630] *** MPI_ERR_OTHER: known error not in list
[c-96-4-worker0003:03630] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c-96-4-worker0003:03630] *** and potentially your MPI job)
[c-96-4-worker0003:03521] 63 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[c-96-4-worker0003:03521] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@wzamazon
Copy link
Contributor

Hi @chenshixinnb it looks like you made some progress, now EFA is being picked up but there is some issue with MPI_bcast.

Can you provide your full command line and full log? Because the log is quite long, I recommend making it an attachment.

@chenshixinnb
Copy link
Author

Thanks, I have sent it to you by email

Hi @chenshixinnb it looks like you made some progress, now EFA is being picked up but there is some issue with MPI_bcast.

Can you provide your full command line and full log? Because the log is quite long, I recommend making it an attachment.

@wzamazon
Copy link
Contributor

wzamazon commented Feb 25, 2022

@chenshixinnb

According to your log file, gromacs has started successfully, then encountered an error. Hence, I suspect the error might not be related to Open MPI, but is related to the application (gromacs) itself.

Can you try to run some basic mpi benchmark to verify that Open MPI itself works? For example, the OSU micro benchmark https://mvapich.cse.ohio-state.edu/benchmarks/

@chenshixinnb
Copy link
Author

Sorry, there is no problem with single nodes, but multiple nodes still report an error.I have sent the attachment to you.

@chenshixinnb
Copy link
Author

chenshixinnb commented Mar 4, 2022

The same script,I run Gromacs with a single node,the output is normal,But I get an error when I use two nodes.
running script:run.sh
Run logs of a node:slurm-17.txt
Run logs of two nodes:slurm-18.txt

Same,I use ./osu_alltoall,for the same script, the single node works well, but the two nodes fail.
running script:/home/cloudam/Software/Osu/libexec/osu-micro-benchmarks/mpi/collective/osu.sh
Run logs of a node:slurm-22.txt
Run logs of two nodes:slurm-21.txt
[](url
slurm-17.txt
slurm-18.txt
slurm-21.txt
slurm-22.txt
osu.sh.txt
run.sh.txt
)

@libhe
Copy link

libhe commented Jul 5, 2022

[c-96-4-worker0004:03574] select: component ofi selected [c-96-4-worker0004:03574] select: init returned priority 25 [c-96-4-worker0004:03574] selected cm best priority 25 [c-96-4-worker0004:03574] select: component cm selected [c-96-4-worker0004:03571] select: init returned success [c-96-4-worker0004:03571] select: component ofi selected [c-96-4-worker0004:03571] select: init returned priority 25 [c-96-4-worker0004:03571] selected cm best priority 25 [c-96-4-worker0004:03571] select: component cm selected [c-96-4-worker0003:03672] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)" [c-96-4-worker0003:03672] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [c-96-4-worker0003:03672] mtl_ofi_component.c:347: mtl:ofi:prov: efa [c-96-4-worker0003:03650] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)" [c-96-4-worker0003:03650] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [c-96-4-worker0003:03650] mtl_ofi_component.c:347: mtl:ofi:prov: efa [c-96-4-worker0003:03567] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)" [c-96-4-worker0003:03567] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [c-96-4-worker0003:03567] mtl_ofi_component.c:347: mtl:ofi:prov: efa [c-96-4-worker0003:03622] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)" [c-96-4-worker0003:03622] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [c-96-4-worker0003:03622] mtl_ofi_component.c:347: mtl:ofi:prov: efa

[c-96-4-worker0003:03627] select: init returned priority 25 [c-96-4-worker0003:03627] selected cm best priority 25 [c-96-4-worker0003:03627] select: component cm selected [c-96-4-worker0003:03585] select: init returned success [c-96-4-worker0003:03585] select: component ofi selected [c-96-4-worker0003:03585] select: init returned priority 25 [c-96-4-worker0003:03585] selected cm best priority 25 [c-96-4-worker0003:03585] select: component cm selected [c-96-4-worker0003:03623] select: init returned success [c-96-4-worker0003:03623] select: component ofi selected [c-96-4-worker0003:03623] select: init returned priority 25 [c-96-4-worker0003:03623] selected cm best priority 25 [c-96-4-worker0003:03623] select: component cm selected [c-96-4-worker0003:03646] select: init returned success [c-96-4-worker0003:03646] select: component ofi selected [c-96-4-worker0003:03646] select: init returned priority 25

[c-96-4-worker0004:03561] check:select: modex not reqd [c-96-4-worker0004:03675] check:select: modex not reqd [c-96-4-worker0003:03609] check:select: modex not reqd [c-96-4-worker0004:03603] check:select: modex not reqd [c-96-4-worker0003:03630] *** An error occurred in MPI_Bcast [c-96-4-worker0003:03630] *** reported by process [1917386753,63] [c-96-4-worker0003:03630] *** on communicator MPI_COMM_WORLD [c-96-4-worker0003:03630] *** MPI_ERR_OTHER: known error not in list [c-96-4-worker0003:03630] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [c-96-4-worker0003:03630] *** and potentially your MPI job) [c-96-4-worker0003:03521] 63 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [c-96-4-worker0003:03521] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Hello, I run into the same issue when I run a MPI_Hello_World with OpenMPI. The output doesn't indicate that EFA is using.

It looks like that you made some progress, now EFA is being picked up. Can you tell me what is the exact steps to let OpenMPI pick up EFA?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants