New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't launch prted #1960
Comments
A few observations:
|
Thanks Ralph, I'll look into your first few suggestions. I had assumed that installing the latest version of OpenMPI installed the the latest version of PRRTE. I did have an OpenMPI 4.x installation and ran make uninstall on that before installing 5.0.2. Is it possible that the older PRRTE was not removed? |
PRRTE wasn't included in OMPI v4, it was only introduced in OMPI v5. I was only commenting based on your input:
If you want to use the latest PRRTE, you'll need to download it directly as OMPI always has a time lag in its distribution. Then, you build OMPI with the |
OK, I'll install the latest. Regarding |
When you get an allocation via You can see them for yourself - just do |
Thanks again Ralph. The software chain has been rebuilt using PRRTE release is v3.0.5. That did seem to be part of the problem. I hate asking you another question, especially since it's probably due to my lack of familiarity with SLURM. This seems so close to working, but isn't. I'm obviously not launching prted correctly. shell$ srun -N 2 -n 2 prte
DVM ready
DVM ready In another term on that same node: shell$ ps -u ubuntu
PID TTY TIME CMD
12775 pts/0 00:00:00 srun
12780 pts/0 00:00:00 srun
12800 ? 00:00:00 prte
12810 ? 00:00:00 srun
12812 ? 00:00:00 prted
12815 ? 00:00:00 srun
12818 ? 00:00:00 prted And in a term on the other node: shell$ ps -u ubuntu
PID TTY TIME CMD
21891 ? 00:00:00 prte
21894 ? 00:00:00 srun
21895 ? 00:00:00 srun
21906 ? 00:00:00 prted
21907 ? 00:00:00 prted But in either of these terms prun still fails, albeit differently than before: shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available Using salloc instead of srun: shell$ salloc -N 2 -n 2 prte
salloc: Granted job allocation 43
DVM ready Another term on the same node now has no prted (I notice that only the term where salloc is run has the envars): shell$ ps -u ubuntu
PID TTY TIME CMD
13014 pts/0 00:00:00 salloc
13018 pts/0 00:00:00 prte
13021 pts/0 00:00:00 srun
13024 pts/0 00:00:00 srun But the term on the other node does have prted: shell$ ps -u ubuntu
PID TTY TIME CMD
23189 ? 00:00:00 prted
23190 ? 00:00:00 prted prun attempt on both nodes is unchanged: shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available I've been reading the man pages and trying different options, but this is eluding me. Sorry. |
No worries - it's a simple misunderstanding. I need to add material to the docs so this is easier. The problem is that you cannot start multiple copies of So what you want to do is: $ salloc -N 2
$ prte --daemonize
$ prun <myapp>
...do whatever you want...
$ pterm (to terminate the DVM) See if that works for you! |
Something else must be incorrect in my setup. I restarted the SLURM daemons with -c to have a clean start. shell$ salloc -N 2
salloc: Granted job allocation 50 Process started on that node: shell$ ps -u ubuntu
PID TTY TIME CMD
13811 pts/0 00:00:00 salloc
13815 pts/0 00:00:00 bash Checked the envars (should the "(x2)" be there?): shell $ env | grep SLURM
SLURM_TASKS_PER_NODE=16(x2)
SLURM_SUBMIT_DIR=/home/ubuntu
SLURM_CLUSTER_NAME=cluster
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=2
SLURM_JOBID=50
SLURM_NODELIST=ip-13-100-66-[218,228]
SLURM_NNODES=2
SLURM_SUBMIT_HOST=ip-13-100-66-228
SLURM_JOB_ID=50
SLURM_CONF=/usr/local/etc/slurm.conf
SLURM_JOB_NAME=interactive
SLURM_JOB_NODELIST=ip-13-100-66-[218,228]: Launched prte: shell$ prte --daemonize Checked for new processes on this node: shell$ ps -u ubuntu
PID TTY TIME CMD
13811 pts/0 00:00:00 salloc
13815 pts/0 00:00:00 bash
13842 ? 00:00:00 prte
13845 ? 00:00:00 srun
13848 ? 00:00:00 srun No prted process. Checked for stared processes on the other node: shell$ ps -u ubuntu
PID TTY TIME CMD
29255 ? 00:00:00 prted
29256 ? 00:00:00 prted Two prted processes started. So both nodes still fail with the Any thoughts on what tho check next? Thanks. |
Oh, x2 is probably just referring to threads. |
Here are a few things you could try:
|
shell$ srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-218
ip-13-100-66-228 shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[ip-13-100-66-228:18216] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:receive start comm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm creating map
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] setup:vm: working unmanaged allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] using default hostfile /usr/local/etc/prte-default-hostfile
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm only HNP in allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setting slots for node ip-13-100-66-228 by core
DVM ready The prte.ip-13-100-66-228.13377.1000 directory exists in the specified tmp directory, BUT I decided to look more closely at the PMIx installation log files for clues. The output from running Transports
-----------------------
Cisco usNIC: no
HPE Slingshot: no
NVIDIA: no
OmniPath: yes
Simptest: no
TCP: no I installed UCX as I'm hoping it will default to TCP while developing on AWS EC2 instances, and then support IB when I port code to an HPC cluster. I expected TCP to be The interesting thing is that previously The other error is" ../../../src/util/pmix_net.c:423:6: error: conflicting types for 'pmix_net_samenetwork'; have '_Bool(const struct sockaddr *, const struct sockaddr *, uint32_t)' {aka '_Bool(const struct sockaddr *, const struct sockaddr *, unsigned int)'}
423 | bool pmix_net_samenetwork(const struct sockaddr *addr1, const struct sockaddr *addr2,
| ^~~~~~~~~~~~~~~~~~~~
In file included from ../../../src/util/pmix_net.c:66:
/home/ubuntu/software/sandbox_pmix-5.0.2/src/util/pmix_net.h:101:18: note: previous declaration of 'pmix_net_samenetwork' with type '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, uint32_t)' {aka '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, unsigned int)'}
101 | PMIX_EXPORT bool pmix_net_samenetwork(const struct sockaddr_storage *addr1,
| ^~~~~~~~~~~~~~~~~~~~ I tried commenting out one of the declarations, but that caused a link error on library file, so I'm not yet understanding the correct fix. Thanks in advance for your help. -Gene |
Well, first off the transports shown by PMIx have nothing to do with what your MPI supports. It only indicates what transports PMIx knows about and can provide support with some info. For example, OmniPath needs a security key, and so we generate one and provide it for that environment. We don't have support for the others at this time. The plm verbose output indicates that The errors you are reporting indicate that you lack a The configure error seems very strange - we don't generate that code. It comes straight out of autoconf. Nothing we can really do about it. Interestingly, I see the line @wenduwan @lrbison This is an Amazon user - can you perhaps help him out? I'm running out of ideas and have no access to the system. |
@efweber999 For HPC applications on AWS we recommend https://aws.amazon.com/hpc/parallelcluster/ which pre-installs essential applications including Intel MPI, Open MPI etc., and sets up the network for you. Would you be willing to try it out? |
Thank you for the clarifications and continued help. My access to AWS has some restrictions, and I'm not sure if using parallelcluster is enabled/allowed. I've inquired. Some clarification after working on this yesterday. The configure check to determine if the system is Linux with TCP fails: After doing that and rebuilding all the tools again I re-ran what you suggested. I admit to taking your suggestions too literally the day before and not running salloc before prte. DOH!! So those results were misleading. Here is exactly the string of commands I ran yesterday and the output. shell$ sudo rm /var/log/slurm/*
shell$ sudo slurmctld -c -vvvvv && sudo slurmd -c -vvvvv
shell$ ps -u root | grep slurm
16030 ? 00:00:00 slurmctld
16032 ? 00:00:00 slurmscriptd
16049 ? 00:00:00 slurmstepd
16051 ? 00:00:00 slurmd
shell$ srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-228
ip-13-100-66-218
shell$ salloc -N 2
salloc: Granted job allocation 2
shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[1] 16368
shell$ [ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:slurm: available for selection
[ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive start comm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: LAUNCH DAEMONS CALLED
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm creating map
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm add new daemon [prte-ip-13-100-66-218-16368@0,1]
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm assigning new daemon [prte-ip-13-100-66-218-16368@0,1] to node ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: launching on nodes ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"
srun: error: ip-13-100-66-228: task 0: Exited with exit code 213
srun: Terminating StepId=2.0
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: srun returned non-zero exit status (54528) from launching the per-node daemon
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive stop comm
[1]+ Exit 250 The tool installations are all scripted, and I'm willing to share that and/or any of my configuration files if that helps. Thanks, |
I'll look at the configure code again, but that has nothing to do with this problem. I do see something troubling in your output. It appears you have some PRRTE MCA parameters set, either in the environment or perhaps in a default MCA param file??? This one in particular is bothersome: plm_slurm_args "--external-launcher" Is there some reason you are setting this? I suspect it is causing |
I'm not setting Searching for information about that option it appears to be new in this SLURM release. The online srun man page says: "This is meant for use MPI implementations that require their own launcher" The SLURM install documentation says it is passed as an argument for Intel MPI, MPICH, and MVAPICH2 when Hydra is used. So it shouldn't be applied. I have "MpiDefault=pmix" set in my slurm.conf file. Options were: none, pmi2, cray_shasta, or pmix. |
@wenduwan My system administrators just responded to my inquiry. "At this time, Parallelcluster is not approved". |
Sounds like you may be out of luck, but let's try one more thing: $ srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 hostname See if that aborts or runs (change the nodelist to whatever node you are allocated). This is the cmd that PRRTE is trying to use to start the remote daemon. Obviously, someone has Slurm adding that new |
@efweber999 Thanks for confirming. AFAIK AWS does not officially support custom installation of slurm/pmix/prrte. They should either be installed by parallel cluster, or via EFA installer. In your case I believe the system admin has chosen a different installation - I wonder if it's possible to try the EFA installer which includes its own prrte/pmix under The latest release 1.31.0 is built with openpmix 4.2.8 and prrte 3.0.3. |
@rhc54 That line works fine. |
I guess you might as well try running the rest of that cmd line on its own: prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" and see if the |
It doesn't barf, it does nothing. Ralph, Wenduo, Thanks for your help. |
Environment
prte (PRRTE) 3.0.3rc1
pmix_info (PMIx) 5.0.2
OS: Ubuntu 22.04.1
Hardware: AWS EC2 instances. Just 2 instances for initial testing.
Network: UCX (only TCP is available for these instances. Will eventually move to an HPC cluster with IB)
Details of the problem
Hi,
I installed PMIx, OpenMPI, SLURM and the supporting software on two AWS EC2 instances. Full installation listing is below. The munge daemon runs without issue, as do the SLURM daemons. Both the
hostname
command and an MPIhello_world
program can be executed across both nodes (EC2 instances).When I try and launch the prte daemon to use PMIx, I get the following error:
shell$ prted -v --bootstrap --leave-session-attached --prtemca ess_base_verbose 5 [ip-13-100-66-228:12351] NODE[0]: 13.100.66.228 [ip-13-100-66-228:12351] NODE[1]: 13.100.66.218 [ip-13-100-66-228:12351] PRTE ERROR: Not found in file ../../../../../../../3rd-party/prrte/src/mca/ess/env/ess_env_module.c at line 129 [ip-13-100-66-228:12351] [[INVALID],0] setting up session dir with tmpdir: UNDEF host ip-13-100-66-228
I tried various command line options with the same result. The session directory location is specified in the config files:
mpi.conf.txt
prte.conf.txt
I also tried setting the TMPDIR environmental variable, and got the same results.
prte --daemonize
launches fine, and I can see the prte process running. Butprun
produces the following results:Again, I tried multiple command line options with the same result.
I'm new to UCX, MPI, PMIx, and SLURM (though I'm a graybeard) so I'm probably missing something that I just haven't yet managed to find in the documentation. Some guidance would be greatly appreciated. The documentation is saying to bootstrap
prted
at node startup, but it's not supposed to run as root, so that rules outsystemd
. I can put it in the .bashrc, but that's not node startup. And, does it matter if the SLURM daemons are already running as I plan to usesystemd
to launch them.BTW, during one test launch when I was in the wrong directory and the
hostfile
could not be found, this message was printed:The file prte is looking for is actually
help-hostfiles.txt
, so that's a minor bug.Thanks,
Gene
Installed Packages
libssl-dev
libnuma-dev
binutils-bpf
libbpf-devs
dbus
libdbus-1-dev
Built Software (In this order)
munge-0.5.16
configure --with-crypto-lib=openssl --prefix=/usr --sysconfdir=/etc --localstatedir=/var --runstatedir=/run
hwloc-2.7.1
configure
lbnl-nhc-1.4.3
configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec
ucx-1.15.0 (tarball of deb packages)
libevent-2.1.12
configure --disable-openssl
pmix-5.0.2
configure --with-munge=/usr --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm
openmpi-5.0.2
configure --enable-mca-no-build=btl-uct --with-ucx=/usr --with-pmix=/usr/local --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm
slurm-23.11.5
configure --enable-debug --with-ucx=/usr --with-pmix=/usr/local --with-munge=/usr --with-hwloc=/usr/local
The text was updated successfully, but these errors were encountered: