Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vader in a Docker Container #4948

Closed
ax3l opened this issue Mar 22, 2018 · 23 comments
Closed

Vader in a Docker Container #4948

ax3l opened this issue Mar 22, 2018 · 23 comments
Labels

Comments

@ax3l
Copy link

ax3l commented Mar 22, 2018

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OpenMPI 3.0.0 (and 2.1.2 for comparisons)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source (via spack) inside a docker container.

Please describe the system on which you are running

  • Operating system/version: Ubuntu "xenial" 16.04.4 LTS with kernel 4.4.0-116 (host), Ubuntu "xenial" 16.04.4 LTS with kernel 4.4.0-116 (container)
  • Computer hardware: two NUMA nodes with each an Intel Xeon E5-2698v4 (Nvidia DGX-1)
  • Network type: only in-node relevant for now.

Details of the problem

Starting MPI with more than one rank will result in errors of the form

Read -1, expected <someNumber>, errno =1
Read -1, expected <someNumber>, errno =1
Read -1, expected <someNumber>, errno =1
Read -1, expected <someNumber>, errno =1
Read -1, expected <someNumber>, errno =1
...

as soon as communication is performed (send, receive, reduce, etc.). Simple programs that only contain MPI startup (Init, Get_Rank, Finalize) and shutdown run without issues.

The only way to work-around this issue was for me to down-grade to OpenMPI 2.X which still supports "sm" as a BTL and deactivating vader, e.g. with export OMPI_MCA_btl="^vader".

Is it possible the detection/test of a working CMA is not fully functional? This issue is likely caused by either a non-existent or not-fully forwarded CMA kernel support inside the docker container. Do you have any recommendations on how to use vader in such an environment as the in-node BTL?

@jsquyres
Copy link
Member

What are those error messages from -- are they from your program? I.e., what exactly are those error messages indicating?

FWIW: I do not believe we have any tests -- in configure or otherwise -- to check for non-functional CMA. If we find CMA support, we assume it's working. Vader should work just fine if CMA support is not present -- it will fall back to regular copy-in-copy-out shared memory.

@hppritcha
Copy link
Member

Are the processes within different docker containers? If so, then its likely CMA is failing because the containers may be in different name spaces. The workaround is to disable CMA on the mpirun command line:

--mca btl_vader_single_copy_mechanism=none

@ax3l
Copy link
Author

ax3l commented Mar 27, 2018

Thank you for the details!

What are those error messages from -- are they from your program?

No they originate from OpenMPI. I guess from here:

https://github.com/open-mpi/ompi/blob/v3.0.0/opal/mca/btl/vader/btl_vader_get.c#L74-L78

Are the processes within different docker containers?

No it's the same container on a Nvidia DGX-1 (low detail datasheet, detailed guide) which has two Xeon packages in case that is relevant. Might be, I am not sure if they have regular QPI on it. QPI is there (last link, page 6)

We will try to debug it hands-on again with Nvidia engineers next week. I was just wondering if the error (see code lines above) already tells you something that could give me pointers on how to debug CMA (or if you had runtime CMA tests in place).

I see that you define OPAL_BTL_VADER_HAVE_CMA at compile time, which is of course a bit tricky for an image that is shipped around (might exist on the build machine but not on the run machine).

@ggouaillardet
Copy link
Contributor

@ax3l the issue here is that process_vm_readv() fails with EPERM which is suspicious.

The root cause could be docker prevents this, and some sysadmin config might be required.

From the man page

In order to read from or write to another process, either the caller must have the capability CAP_SYS_PTRACE, or the real user ID, effective user ID, and saved set-user-ID of the remote process must match the real user ID of the caller and the real group ID, effective group ID, and saved set-group-ID of the remote process must match the real group ID of the caller. (The permission required is exactly the same as that required to perform a ptrace(2) PTRACE_ATTACH on the remote process.)

you might want to manually check those conditions are met as well.
there is a theoretical risk your kernel incorrectly supports this in the context of docker (e.g. namespaces).

I would suggest you first try to run your app on the host first, and then in the container.
As pointed earlier

mpirun --mca btl_vader_single_copy_mechanism none ...

might help you

Note the btl/sm is still available in recent Open MPI versions, so

mpirun --mca btl ^vader ...

might also help you here.

@hjelmn
Copy link
Member

hjelmn commented Mar 29, 2018

@ggouaillardet btl/sm is not needed. The way we are recommending to deal with docker if the ptrace permissions can't be fixed is to set OMPI_MCA_btl_vader_single_copy_mechanism=none. That will disable CMA.

@felker
Copy link

felker commented Jun 7, 2018

I recently upgraded the OpenMPI library in our project's Travis CI setup from 2.1.1 to 3.0.2 (and tried 3.1.0), and we were observing

Read -1, expected 5120, errno = 1
Read -1, expected 5120, errno = 1
Read -1, expected 5120, errno = 1
....

as soon as mpirun was launched with 2 or more ranks on the Ubuntu Trusty Docker image build environments.

Setting

export OMPI_MCA_btl_vader_single_copy_mechanism=none

before launching the jobs, as @hjelmn suggested, seems to have fixed the problem in 3.0.2 for us.

@hjelmn
Copy link
Member

hjelmn commented Jun 7, 2018

Note if you want better performance you want CMA to work. It will only work if all the local MPI processes are in the same namespace.

An alternative (I haven't tested this) would be to use xpmem: http://gitlab.com/hjelmn/xpmem (there is a version on github but it will no longer be maintained).

@blechta
Copy link

blechta commented Jul 24, 2018

Note that one can allow ptrace permissions by docker run --cap-add=SYS_PTRACE ... which seems to get CMA working.

@JiaweiZhuang
Copy link

Setting

export OMPI_MCA_btl_vader_single_copy_mechanism=none

Got exactly the same issue with OpenMPI 3.1.3 in Docker. This fixes the problem!

JiaweiZhuang added a commit to geoschem/geos-chem-docker that referenced this issue Dec 15, 2018
@milthorpe
Copy link

Same problem on OpenMPI 4.0.0 in Docker; disabling vader single copy mechanism as suggested above fixes it.

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/GPU_DockerFiles that referenced this issue Jan 13, 2019
antoinetavant pushed a commit to antoinetavant/teamcity-docker-agent-fedora-lppic that referenced this issue Mar 19, 2019
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Mar 31, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Mar 31, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Apr 4, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Apr 5, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue May 28, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Jun 6, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
rhaas80 added a commit to rhaas80/jupyter-et that referenced this issue Jun 7, 2019
not directly relevant for our current 16.04 image but still good to
have. See open-mpi/ompi#4948
@AdamSimpson
Copy link

AdamSimpson commented Jul 18, 2019

FWIW the root cause of this is likely that process_vm_readv()/process_vm_writev() are disabled in the default Docker seccomp profile. A slightly less heavy handed option than --cap-add=SYS_PTRACE would be to modify the seccomp profile so that process_vm_readv and process_vm_writev are whitelisted, by adding them to the syscalls.names list. Having done that I was able to use vader and UCX with CMA without issue.

Edit:
It is also worth mentioning that if you're still running into CMA issues: openucx/ucx#3545

voduchuy added a commit to voduchuy/pacmensl that referenced this issue Aug 14, 2021
… Docker container.

We add a line in Dockerfile to set OpenMPI environment variable after everything has bee installed.

For further details, see [the issue in OpenMPI Github page](open-mpi/ompi#4948)
@ax3l
Copy link
Author

ax3l commented Sep 21, 2021

Saw the problem again today with:

  • Docker version 20.10.7, build 20.10.7-0ubuntu1~20.04.1
  • OpenMPI 4.0.3
  • Ubuntu 20.04 image (x86-64)

Work-around as before is still:

export OMPI_MCA_btl_vader_single_copy_mechanism=none

@gpaulsen
Copy link
Member

master PR #6844 was merged into v4.0.x with PR #6997 in time for Open MPI v4.0.2, so perhaps that's not the only issue here, or something changed after that.

chimaerase added a commit to chimaerase/PTMCMCExample that referenced this issue Nov 20, 2021
Update Docker container and run instructions to avoid MPI / Docker security conflicts

This approach is expedient for this example, but probably not the best approach for production deployments.  See separate discussion on this issue open-mpi/ompi#4948.
mgovoni-devel pushed a commit to west-code-development/West that referenced this issue Aug 19, 2022
mgovoni-devel pushed a commit to west-code-development/West that referenced this issue Aug 19, 2022
@jjhursey
Copy link
Member

A few notes on this ticket:

  • In main/v5.0.x the btl_vader_single_copy_mechanism MCA parameter does not exist. Use -mca smsc ^cma instead. Also vader has been renamed to sm
  • In PR smsc/cma: Add a check for CAP_SYS_PTRACE between processes #10694 I added an additional check that should disqualify cma if CAP_SYS_PTRACE is not granted to the container. See example output in the PR.

I think this ticket can be closed given the workaround for the v4.x series. The change in PR #10694 should make it so that the workaround is not required and the cma will disqualify itself automatically.

As noted above cma does have some nice performance benefits so it is recommended that you pass --cap-add SYS_PTRACE to your favorite container runtime to make sure that CAP_SYS_PTRACE is granted to processes in the resulting namespace.

devreal added a commit to devreal/ttg that referenced this issue Sep 22, 2022
…e call

OMPI_MCA_btl_vader_single_copy_mechanism is meant to suppress an error
message from an incompatibility between btl/vader and docker, see
open-mpi/ompi#4948.

PARSEC_MCA_runtime_bind_threads is meant to disable thread binding in
PaRSEC, potentially speeding up test runs.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
devreal added a commit to devreal/ttg that referenced this issue Sep 23, 2022
…e call

OMPI_MCA_btl_vader_single_copy_mechanism is meant to suppress an error
message from an incompatibility between btl/vader and docker, see
open-mpi/ompi#4948.

PARSEC_MCA_runtime_bind_threads is meant to disable thread binding in
PaRSEC, potentially speeding up test runs.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
vasdommes added a commit to davidsd/sdpb that referenced this issue Oct 3, 2023
vasdommes added a commit to davidsd/sdpb that referenced this issue Oct 10, 2023
@vedantroy
Copy link

vedantroy commented Jan 2, 2024

I'm running into this issue on https://modal.com, and it is causing cudaIpcOpenMemHandle to fail. Using --mca btl ^vader doesn't solve the problem, and I can't modify the container environment, is there any other way to make this work?

@rhc54
Copy link
Contributor

rhc54 commented Jan 2, 2024

What version of OMPI are you using? It's hard to help if you don't provide at least some basic info.

@vedantroy
Copy link

mpirun --version prints mpirun (Open MPI) 4.1.2.

The command I'm running is:

mpirun --hostfile hostfile.txt --mca btl ^vader --allow-run-as-root -np 2 csrc/reference_allreduce/fastallreduce_test.bin

where the hostfile is: localhost slots=2 max_slots=2

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2024

@hppritcha @ggouaillardet You folks have any thoughts here? I don't know anything about v4.1, I'm afraid.

@ggouaillardet
Copy link
Contributor

This looks like a gpu within docker issue.

@vedantroy please open a new issue and give a full description of your problem.

@bosilca
Copy link
Member

bosilca commented Jan 3, 2024

vader does not use cudaIpcOpenMemHandle, the culprit might be smcuda. Try --mca btl ^smcuda.

@hppritcha
Copy link
Member

I second @ggouaillardet on opening a separate issue if @bosilca 's suggestion doesn't work.

hcmh added a commit to mrirecon/bart that referenced this issue Feb 9, 2024
This leads to errors such as

[runner-...] Read -1, expected <some numer>, errno = 1

in docker, so we disable it. Some more discussion can be found here:

open-mpi/ompi#4948
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests