Problems with keepalive support

Per discussion on the OMPI call today, I tried the following keepalive experiment on master and v1.8:
- Not in a SLURM allocation
- Used a trivial "mpi_sleep" program that I wrote:

``` c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <mpi.h>
#include <time.h>

int main()
{
    int i;
    pid_t pid = getpid();

    char hostname[MPI_MAX_PROCESSOR_NAME];
    int len = MPI_MAX_PROCESSOR_NAME;

    MPI_Init(NULL, NULL);
    MPI_Get_processor_name(hostname, &len);

    for (i = 0; i < 10000; ++i) {
        time_t t;
        char *s;
        t = time(NULL);
        s = ctime(&t);
        s[24] = '\0';
        printf("%s: PID %d on host %s sleeping...\n",
               s, pid, hostname);
        sleep(2);
    }
    MPI_Finalize();
    return 0;
}
```
- `mpirun --host a,b -np 2 mpi_sleep` (this is using the connectionless usnic transport between servers).
- mpirun is running on the cluster head node (which is neither `a` nor `b`)
- I let the job run for a few iterations.
- I then powered off node `b`.
- As expected, the output continued from node a.

However, the job _did not die_ after several minutes (I did not keep track of how long, but it was several minutes).  So I checked the output of `ompi_info --all --parsable | grep keep | grep value:`.  I saw that our keepalive time is 10, but the interval is 60 (and the probes is 3).

**FIRST PROBLEM:** I'm not sure if it makes sense to have the (interval*probes) be longer than the keepalive time.  I think we need to change the default values of these MCA params.

So I re-ran the same test, but with:
- `mpirun --host a,b --mca oob_tcp_keepalive_intvl 2  -np 2 mpi_sleep`
- Just like above, I let it run for several iterations (beyond the keepalive time of 10 seconds), and then I powered off node `b`.
- As expected, the output from `a` continued for a while.
- And then, also as expected, then the output `a` stopped.

**SECOND PROBLEM:** ...but mpirun did not quit.

I re-ran with `--mca oob_base_verbose 100` to see if it gave more insight, and it did.  After (timeout + interval \* probes), I see ("ivy01" is node `a`):

```
[ivy01:06135] [[15802,0],1] USOCK SHUTDOWN
[ivy01:06135] [[15802,0],1] TCP SHUTDOWN
[ivy01:06135] [[15802,0],1] RELEASING PEER OBJ [[15802,0],2]
[ivy01:06135] [[15802,0],1] RELEASING PEER OBJ [[15802,0],0]
[ivy01:06135] [[15802,0],1] CLOSING SOCKET 11
[ivy01:06135] [[15802,0],1] oob:ud:component_close entering
[ivy01:06135] mca: base: close: component ud closed
[ivy01:06135] mca: base: close: unloading component ud
[ivy01:06135] mca: base: close: component tcp closed
[ivy01:06135] mca: base: close: unloading component tcp
[ivy01:06135] mca: base: close: component usock closed
[ivy01:06135] mca: base: close: unloading component usock
```

So it looks like ORTE detected the keepalive failure, and _tried_ to shut everything down and quit, but failed to do so.  Attaching a debugging to mpirun, I see that it's stuck in the following orterun.c loop (which is pretty much expected):

``` c
    /* loop the event lib until an exit event is detected */
    while (orte_event_base_active) {
        opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
    }
```

Digging a little deeper, it looks like the `ssh` out to node `b` is still active ("ivy02" is `b`) -- i.e., the ssh to `b` hasn't timed out and died yet:

``` shell
$ ps -eadf | grep orted
jsquyres 12043 12040  0 18:17 pts/17   00:00:00 /usr/bin/ssh -x ivy02     PATH=/home/jsquyres/bogus/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/jsquyres/bogus/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /home/jsquyres/bogus/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_ess_jobid "2114846720" -mca orte_ess_vpid 2 -mca orte_ess_num_procs "3" -mca orte_hnp_uri "2114846720.0;usock;tcp://10.0.8.254,10.193.184.48,10.193.184.49,10.10.0.254,10.2.0.254,10.3.0.254,10.3.0.252,10.50.0.254:49786" --tree-spawn --mca oob_base_verbose "100" --mca oob_tcp_keepalive_intvl "2" -mca plm "rsh" --tree-spawn
```

Is the presence of this ssh keeping mpirun around?

I read ssh_config(5), and it looks like ssh defaults to setting TCPKeepAlive to "yes", meaning that it enables TCP keepalive.  But we have no idea what its timeout/interval/probes values are.  Apparently they're pretty long.

However, I see ssh's ServerAliveInterval option -- I hacked plm_rsh_module.c to add "-o" and "ServerAliveInterval 5" to the ssh command line (couldn't use `plm_rsh_args` because it splits the "ServerAliveInterval 5" into 2 tokens).  In this case, ssh _does_ recognize that the remote side has died and it quits -- apparently emitting the following message to stderr:

```
Timeout, server not responding.
```

**THIRD PROBLEM:** However, mpirun still does not quit.  :-(

Sooo... I'm not sure if we need to pass this option to ssh or not (i.e., if ORTE detecting the dead socket will be sufficient for ORTE to be able to shut everything down, including any children ssh's that are still hanging around).

I'm not sure what mpirun is waiting for -- @rhc54 -- can you have a look into this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with keepalive support #592

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with keepalive support #592

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions