-
Notifications
You must be signed in to change notification settings - Fork 931
Description
Per discussion on the OMPI call today, I tried the following keepalive experiment on master and v1.8:
- Not in a SLURM allocation
- Used a trivial "mpi_sleep" program that I wrote:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <mpi.h>
#include <time.h>
int main()
{
int i;
pid_t pid = getpid();
char hostname[MPI_MAX_PROCESSOR_NAME];
int len = MPI_MAX_PROCESSOR_NAME;
MPI_Init(NULL, NULL);
MPI_Get_processor_name(hostname, &len);
for (i = 0; i < 10000; ++i) {
time_t t;
char *s;
t = time(NULL);
s = ctime(&t);
s[24] = '\0';
printf("%s: PID %d on host %s sleeping...\n",
s, pid, hostname);
sleep(2);
}
MPI_Finalize();
return 0;
}mpirun --host a,b -np 2 mpi_sleep(this is using the connectionless usnic transport between servers).- mpirun is running on the cluster head node (which is neither
anorb) - I let the job run for a few iterations.
- I then powered off node
b. - As expected, the output continued from node a.
However, the job did not die after several minutes (I did not keep track of how long, but it was several minutes). So I checked the output of ompi_info --all --parsable | grep keep | grep value:. I saw that our keepalive time is 10, but the interval is 60 (and the probes is 3).
FIRST PROBLEM: I'm not sure if it makes sense to have the (interval*probes) be longer than the keepalive time. I think we need to change the default values of these MCA params.
So I re-ran the same test, but with:
mpirun --host a,b --mca oob_tcp_keepalive_intvl 2 -np 2 mpi_sleep- Just like above, I let it run for several iterations (beyond the keepalive time of 10 seconds), and then I powered off node
b. - As expected, the output from
acontinued for a while. - And then, also as expected, then the output
astopped.
SECOND PROBLEM: ...but mpirun did not quit.
I re-ran with --mca oob_base_verbose 100 to see if it gave more insight, and it did. After (timeout + interval * probes), I see ("ivy01" is node a):
[ivy01:06135] [[15802,0],1] USOCK SHUTDOWN
[ivy01:06135] [[15802,0],1] TCP SHUTDOWN
[ivy01:06135] [[15802,0],1] RELEASING PEER OBJ [[15802,0],2]
[ivy01:06135] [[15802,0],1] RELEASING PEER OBJ [[15802,0],0]
[ivy01:06135] [[15802,0],1] CLOSING SOCKET 11
[ivy01:06135] [[15802,0],1] oob:ud:component_close entering
[ivy01:06135] mca: base: close: component ud closed
[ivy01:06135] mca: base: close: unloading component ud
[ivy01:06135] mca: base: close: component tcp closed
[ivy01:06135] mca: base: close: unloading component tcp
[ivy01:06135] mca: base: close: component usock closed
[ivy01:06135] mca: base: close: unloading component usock
So it looks like ORTE detected the keepalive failure, and tried to shut everything down and quit, but failed to do so. Attaching a debugging to mpirun, I see that it's stuck in the following orterun.c loop (which is pretty much expected):
/* loop the event lib until an exit event is detected */
while (orte_event_base_active) {
opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
}Digging a little deeper, it looks like the ssh out to node b is still active ("ivy02" is b) -- i.e., the ssh to b hasn't timed out and died yet:
$ ps -eadf | grep orted
jsquyres 12043 12040 0 18:17 pts/17 00:00:00 /usr/bin/ssh -x ivy02 PATH=/home/jsquyres/bogus/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/jsquyres/bogus/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/jsquyres/bogus/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_ess_jobid "2114846720" -mca orte_ess_vpid 2 -mca orte_ess_num_procs "3" -mca orte_hnp_uri "2114846720.0;usock;tcp://10.0.8.254,10.193.184.48,10.193.184.49,10.10.0.254,10.2.0.254,10.3.0.254,10.3.0.252,10.50.0.254:49786" --tree-spawn --mca oob_base_verbose "100" --mca oob_tcp_keepalive_intvl "2" -mca plm "rsh" --tree-spawnIs the presence of this ssh keeping mpirun around?
I read ssh_config(5), and it looks like ssh defaults to setting TCPKeepAlive to "yes", meaning that it enables TCP keepalive. But we have no idea what its timeout/interval/probes values are. Apparently they're pretty long.
However, I see ssh's ServerAliveInterval option -- I hacked plm_rsh_module.c to add "-o" and "ServerAliveInterval 5" to the ssh command line (couldn't use plm_rsh_args because it splits the "ServerAliveInterval 5" into 2 tokens). In this case, ssh does recognize that the remote side has died and it quits -- apparently emitting the following message to stderr:
Timeout, server not responding.
THIRD PROBLEM: However, mpirun still does not quit. :-(
Sooo... I'm not sure if we need to pass this option to ssh or not (i.e., if ORTE detecting the dead socket will be sufficient for ORTE to be able to shut everything down, including any children ssh's that are still hanging around).
I'm not sure what mpirun is waiting for -- @rhc54 -- can you have a look into this?