Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run on node without infiniband? #8952

Closed
cponder opened this issue Mar 17, 2023 · 8 comments
Closed

How to run on node without infiniband? #8952

cponder opened this issue Mar 17, 2023 · 8 comments
Labels

Comments

@cponder
Copy link

cponder commented Mar 17, 2023

I'm using UCX 1.13.1 built inside a container that works fine on an IB cluster.
The problem is that I'm trying to use it on a node that doesn't have IB cards or a Mellanox driver installed, much less a backbone network.
I can run my using the setting

export OMPI_MCA_pml=^ucx

but is there a way to not have to disable the UCX? Or should disabling it be the preferred method?

@cponder cponder added the Bug label Mar 17, 2023
@cponder
Copy link
Author

cponder commented Mar 17, 2023

If I use these settings

export OMPI_MCA_pml=ucx
export UCX_TLS=self,sm,tcp

then I get these errors:

No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
  Host:      b7135fd2fee3
  Framework: pml
--------------------------------------------------------------------------
[b7135fd2fee3:10793] PML ucx cannot be selected
[b7135fd2fee3:10794] PML ucx cannot be selected
[b7135fd2fee3:10796] PML ucx cannot be selected
[b7135fd2fee3:10795] PML ucx cannot be selected
[b7135fd2fee3:10788] 4 more processes have sent help message help-mca-base.txt / find-available:none found
[b7135fd2fee3:10788] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@cponder
Copy link
Author

cponder commented Mar 17, 2023

For the record, some related issues
#8940
#2570
open-mpi/ompi#8321

@yosefe
Copy link
Contributor

yosefe commented Mar 18, 2023

@cponder By default, OpenMPI disables UCX for non-rdma networks.
Please see open-mpi/ompi#11419 (comment)

@cponder
Copy link
Author

cponder commented Mar 18, 2023

If I explicitly disable the UCX here

export UCX_TLS=self,rc,sm,tcp,cuda_copy,cuda_ipc,gdr_copy               # UCX warns about this.
export OMPI_MCA_pml=^ucx

I don't see any problems. But if I unset the variable instead

export UCX_TLS=self,rc,sm,tcp,cuda_copy,cuda_ipc,gdr_copy               # UCX warns about this.
unset OMPI_MCA_pml

which I'd expect to exhibit the default behavior, still triggers UCX warnings

[1679142555.902881] [c8ce6f8ef3d3:137  :0]     ucp_context.c:1014 UCX  WARN  transports 'rc','cuda_copy','cuda_ipc','gdr_copy' are not available, please use one or more of: mm, posix, self, shm, sm, sysv, tcp

although the MPI operation completes.
Using these commands

export UCX_TLS=self,sm,tcp                                # Omit the IB-related transports.
unset OMPI_MCA_pml

the MPI operation completes with no warning message.
So it looks like, in the unset case, there is still interaction with UCX when the OMPI_MCA_pml is left unset.

@yosefe
Copy link
Contributor

yosefe commented Mar 19, 2023

@cponder when OMPI_MCA_pml is unset, OpenMPI would try to initialize UCX, which will print a warning that some of the transports specified by ICX_TLS are not available. Is it possible to avoid setting UCX_TLS?

@cponder
Copy link
Author

cponder commented Mar 21, 2023

Yeah we'll just use

export OMPI_MCA_pml=^ucx

on these systems. But my closing question is that if this is the case

By default, OpenMPI disables UCX for non-rdma networks

and there's no IB on the node, then I would expect this

export UCX_TLS=self,rc,sm,tcp,cuda_copy,cuda_ipc,gdr_copy               # UCX warns about this.
unset OMPI_MCA_pml

to trigger the default behavior and not use UCX. But it still initializes the UCX in spite of this, right?

@yosefe
Copy link
Contributor

yosefe commented Mar 21, 2023

Yes, it still initializes UCX to check if there are rdma networks.

@cponder
Copy link
Author

cponder commented Mar 21, 2023

Ok, I'll go ahead and close.

@cponder cponder closed this as completed Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants