Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run ipythonparallel with openmpi #865

Closed
jorgensd opened this issue Feb 7, 2024 · 7 comments
Closed

Cannot run ipythonparallel with openmpi #865

jorgensd opened this issue Feb 7, 2024 · 7 comments

Comments

@jorgensd
Copy link

jorgensd commented Feb 7, 2024

Trying to run ipythonparallel with openmpi crashes at startup:
MWE.

FROM ubuntu:22.04
ARG OPENMPI_SERIES=5.0
ARG OPENMPI_PATCH=1
RUN DEBIAN_FRONTEND=noninteractive  apt-get update && \
    apt-get install -y wget g++ cmake  python3-dev
RUN wget https://download.open-mpi.org/release/open-mpi/v${OPENMPI_SERIES}/openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH}.tar.gz && \
    tar xfz openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH}.tar.gz  && \
    cd openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH} && \
    ./configure  && \
    make -j${BUILD_NP} install && \
    ldconfig

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y python3-pip
RUN python3 -m pip install mpi4py ipyparallel

RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()"

returns

#9 [6/6] RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()"
#9 1.926 mpiexec error output:
#9 1.926 --------------------------------------------------------------------------
#9 1.926 
#9 1.926 engine set stopped 1707339397: {'exit_code': 1, 'pid': 39, 'identifier': 'ipengine-1707339396-t2bn-1707339397-7'}
#9 1.932 Traceback (most recent call last):
#9 1.932   File "<string>", line 1, in <module>
#9 1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 72, in _synchronize
#9 1.932     return _asyncio_run(async_f(*args, **kwargs))
#9 1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 18, in _asyncio_run
#9 1.932     return loop.run_sync(lambda: asyncio.ensure_future(coro))
#9 1.932   File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 539, in run_sync
#9 1.932     return future_cell[0].result()
#9 1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/cluster/cluster.py", line 757, in start_and_connect
#9 1.932     await asyncio.wrap_future(
#9 1.932 ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 39, 'identifier': 'ipengine-1707339396-t2bn-1707339397-7'}
#9 1.933 Stopping cluster <Cluster(cluster_id='1707339396-t2bn', profile='default', controller=<running>, engine_sets=['1707339397'])>
#9 ERROR: process "/bin/sh -c python3 -c \"import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()\"" did not complete successfully: exit code: 1
------
 > [6/6] RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()":
1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 72, in _synchronize
1.932     return _asyncio_run(async_f(*args, **kwargs))
1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 18, in _asyncio_run
1.932     return loop.run_sync(lambda: asyncio.ensure_future(coro))
1.932   File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 539, in run_sync
1.932     return future_cell[0].result()
1.932   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/cluster/cluster.py", line 757, in start_and_connect
1.932     await asyncio.wrap_future(
1.932 ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 39, 'identifier': 'ipengine-1707339396-t2bn-1707339397-7'}
1.933 Stopping cluster <Cluster(cluster_id='1707339396-t2bn', profile='default', controller=<running>, engine_sets=['1707339397'])>
------
Dockerfile:16
--------------------
  14 |     RUN python3 -m pip install mpi4py ipyparallel
  15 |     
  16 | >>> RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()"
  17 |     
  18 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c python3 -c \"import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()\"" did not complete successfully: exit code: 1
@jorgensd
Copy link
Author

jorgensd commented Feb 7, 2024

Also tested with the latest 5.0.2 patch, with no success.

@jorgensd
Copy link
Author

jorgensd commented Feb 8, 2024

A hunch is to set:

ENV OMPI_ALLOW_RUN_AS_ROOT=1 
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 

in the dockerfile. At least it works when openmpi is installed through apt:

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y openmpi-bin libopenmpi-dev

@jorgensd
Copy link
Author

jorgensd commented Feb 8, 2024

It did not work with:

FROM ubuntu:22.04
ARG OPENMPI_SERIES=5.0
ARG OPENMPI_PATCH=1
RUN DEBIAN_FRONTEND=noninteractive  apt-get update && \
    apt-get install -y wget g++ cmake  python3-dev
# RUN DEBIAN_FRONTEND=noninteractive apt-get install -y openmpi-bin libopenmpi-dev
RUN wget https://download.open-mpi.org/release/open-mpi/v${OPENMPI_SERIES}/openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH}.tar.gz && \
    tar xfz openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH}.tar.gz  && \
    cd openmpi-${OPENMPI_SERIES}.${OPENMPI_PATCH} && \
    ./configure  && \
    make -j${BUILD_NP} install && \
    ldconfig

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y python3-pip
RUN python3 -m pip install mpi4py ipyparallel
ENV OMPI_ALLOW_RUN_AS_ROOT=1 
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 
RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()"
bash
> [6/6] RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()":                                                                                                                                                                                                                                                                                 
1.894 mpiexec error output:                                                                                                                                                                                                                                                                                                                                                                                                                        
1.894 --------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                   
1.894 mpiexec has detected an attempt to run as root.                                                                                                                                                                                                                                                                                                                                                                                              
1.894                                                                                                                                                                                                                                                                                                                                                                                                                                              
1.894 Running as root is *strongly* discouraged as any mistake (e.g., in
1.894 defining TMPDIR) or bug can result in catastrophic damage to the OS
1.894 file system, leaving your system in an unusable state.
1.894 
1.894 We strongly suggest that you run mpiexec as a non-root user.
1.894 
1.894 You can override this protection by adding the --allow-run-as-root option
1.894 to the cmd line or by setting two environment variables in the following way:
1.894 the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
1.894 protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
1.894 add one more layer of certainty that you want to do so.
1.894 We reiterate our advice against doing so - please proceed at your own risk.
1.894 --------------------------------------------------------------------------
1.894 
1.894 engine set stopped 1707377642: {'exit_code': 1, 'pid': 39, 'identifier': 'ipengine-1707377641-o120-1707377642-7'}
1.895 Traceback (most recent call last):
1.895   File "<string>", line 1, in <module>
1.895   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 72, in _synchronize
1.895     return _asyncio_run(async_f(*args, **kwargs))
1.895   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/_async.py", line 18, in _asyncio_run
1.895     return loop.run_sync(lambda: asyncio.ensure_future(coro))
1.895   File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 539, in run_sync
1.895     return future_cell[0].result()
1.895   File "/usr/local/lib/python3.10/dist-packages/ipyparallel/cluster/cluster.py", line 757, in start_and_connect
1.895     await asyncio.wrap_future(
1.895 ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 39, 'identifier': 'ipengine-1707377641-o120-1707377642-7'}
1.896 Stopping cluster <Cluster(cluster_id='1707377641-o120', profile='default', controller=<running>, engine_sets=['1707377642'])>
------
Dockerfile:17
--------------------
  15 |     RUN python3 -m pip install mpi4py ipyparallel
  16 |     
  17 | >>> RUN python3 -c "import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()"
--------------------
ERROR: failed to solve: process "/bin/sh -c python3 -c \"import ipyparallel as ipp; cluster = ipp.Cluster(engines='mpi', n=3); rc = cluster.start_and_connect_sync();cluster.stop_cluster_sync()\"" did not complete successfully: exit code: 1

@minrk
Copy link
Member

minrk commented Feb 8, 2024

Yeah, OMPI refuses to run as root unless you tell it you're really super sure that's what you want. I think the goal hereis perhaps to figure out why you're not getting error output in the first case, because I'ld call the second one "working as intended" since you get OMPI's actionable error message. Maybe a race/buffering issue.

@jorgensd
Copy link
Author

jorgensd commented Feb 8, 2024

Yeah, OMPI refuses to run as root unless you tell it you're really super sure that's what you want. I think the goal hereis perhaps to figure out why you're not getting error output in the first case, because I'ld call the second one "working as intended" since you get OMPI's actionable error message. Maybe a race/buffering issue.

It didn't run with 4.1.2 with wget, which should be the same as the one on apt.

@minrk
Copy link
Member

minrk commented Feb 8, 2024

I think the issue in IPP is in how it tries to parse output from mpiexec in order to log errors. The structure changed, so IPP doesn't extract messages from OMPI anymore, and doesn't report what it does find, which is either:

prterun has detected an attempt to run as root.

Running as root is *strongly* discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

We strongly suggest that you run prterun as a non-root user.

You can override this protection by adding the --allow-run-as-root
option to your command line.  However, we reiterate our strong advice
against doing so - please do so at your own risk.
--------------------------------------------------------------------------

if you are missing allow-run-as-root, and then

--------------------------------------------------------------------------
It looks like "prte_init()" failed for some reason. There are many
reasons that can cause PRRTE to fail during "prte_init()", some of
which are due to configuration or environment problems.  This failure
appears to be an internal failure — here's some additional information
(which may only be relevant to a PRRTE developer):

   prte_plm_base_select failed
   --> Returned value  (-46) instead of PRTE_SUCCESS
--------------------------------------------------------------------------

if you are missing OMPI_MCA_plm_ssh_agent=false

Ultimately, I think you need to set these environment variables:

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_plm=ssh
export OMPI_MCA_plm_ssh_agent=false

In OMPI 4, OMPI_MCA_plm=isolated is simpler, but PRRTE, which was adopted in ompi 5, lacks an isolated plm, so you need to use ssh with ssh_agent=false.

The issues seem to stem from OMPI migrating from ORTE to PRRTE, which renamed a bunch of things, and doesn't seem to produce particularly informative errors.

@jorgensd
Copy link
Author

jorgensd commented Feb 8, 2024

Working setting for OMPI4:

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_plm=isolated

Working for 5.0.x

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_plm=ssh
export OMPI_MCA_plm_ssh_agent=false

for the minimal test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants