Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal/runtime: Add a hang-up detection feature #3700

Closed
wants to merge 2 commits into from

Conversation

kawashima-fj
Copy link
Member

Add a hang-up detection feature

Debugging communication deadlock of MPI processes is a difficult task. If one process stops progress for some reason, peer processes who wait the process also stop progress. And all processes in the job will stop progress. Usually, finding the causal process is not a trivial task. Sometimes, even determining whether the situation is communication deadlock or just slowdown is difficult. This commit adds a feature to detect a possible deadlock (hang-up) and output information which may be useful to analyze the deadlock.

I want to discuss this code modification at the upcoming July 2017 OMPI developer's meeting.

Added Feature

Detection

The logic of the deadlock detection is very simple. If a waiting operation (in MPI_WAIT etc.) does not complete within a certain time, we consider it as a hang-up situation. The time limit can be set by the MCA parameter opal_progress_timeout in seconds. I know this logic is suboptimal and there are studies about communication deadlock detection/analysis. But this logic is easy to implement and is proved to be very useful among Fujitsu MPI customers.

By default, opal_progress_timeout is 0 and this feature is disabled.

Output

If a hang-up situation is detected, the following information is output at each process.

  • message to indicate a possible hang-up is detected
  • stack trace
  • data of waiting request

An output example:

[dirac:19756] Possible hang-up (no progress) is detected on [[56375,1],0]
[dirac:19756] [ 0] /home/tkawa/openmpi/lib/libopen-pal.so.0(opal_progress_handle_hangup+0xc0)[0x7fa267a9a540]
[dirac:19756] [ 1] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(+0x102e8)[0x7fa25c2c42e8]
[dirac:19756] [ 2] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x599)[0x7fa25c2c6731]
[dirac:19756] [ 3] /home/tkawa/openmpi/lib/libmpi.so.0(PMPI_Send+0x2a7)[0x7fa26873995e]
[dirac:19756] [ 4] ./deadlock[0x400c7c]
[dirac:19756] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa2680dcb45]
[dirac:19756] [ 6] ./deadlock[0x400a99]
[dirac:19756] send request (c=0xb6c100 f=UNDEFINED) to rank=0 (jobid=3694592001 vpid=0) on communicator=MPI_COMM_WORLD (c=0x601500 f=0 id=0 my_rank=0) with tag=0 for datatype=MPI_CHAR (c=0x601300 f=34 id=1) x count=131072 in addr=0x7ffcdd9cdb00 [complete=n state=2 type=1 pml_complete=n free_called=n sequence=1 send_mode=4 bytes_packed=131072 recv=(nil) state=0 throttle_sends=n pipeline_depth=0 bytes_delivered=0 rdma_cnt=1 pending=0]

Assembling output from all processes and analyzing the data will help finding the cause of the hang-up. But such script is not ready at this moment.

Termination

By default, when hang-up is detected and information is output, the process will abort with exit(1).

If one MPI process aborts, orted will terminates all MPI processes. But in many cases, other processes also detect hang-up and try to output information. To ensure information is output on all processes, the existing MCA parameter opal_abort_delay can be used. If a positive value is specified, each process sleeps for the given seconds to wait other processes to output information. If a negative value is specified, each process sleeps forever and allows attaching of a debugger.

This detection logic gives a false-positive. If the value of opal_progress_timeout is negative, the process does not abort. It only output information and continue progress when possible hang-up is detected.

Similar Feature

orterun has the --timeout option and also can detect possible hang-up. Advantage of this feature is:

  • The timeout of this feature is per-operation basis.
  • Request data can be output.
  • It can be used with resource manager other than orted.

Code Modification

Detection

while loops which call the opal_progress function are replaced with the OPAL_PROGRESS_WHILE macro and OPAL_PROGRESS_BLOCK_WHILE macro, which processes timeout. In this commit, only while loops in the following functions are replaced.

  • sync_wait_st
  • sync_wait_mt
  • opal_condition_wait

These functions covers most of blocking MPI functions but not all. You can replace loops in other functions easily.

Output

In this commit, output of request data is supported by only ob1 PML requests. You can support data output by setting the req_dump callback function of the ompi_request_t structure.

void *cbdata)
{
char prefix[100 + OPAL_MAXHOSTNAMELEN];
FILE *stream = stderr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoding stderr could we harness the opal_stacktrace_output_filename MCA parameter (-mca opal_stacktrace_output) to allow for some flexibility in the destination of the stack output (so we can choose a file, for example). There is some logic in opal/util/stacktrace.c to process that MCA parameter. Maybe we can move that logic to a more commonly accessible area.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll update the PR to use it.

@bosilca
Copy link
Member

bosilca commented Jun 15, 2017

This has been tried before with little success, mainly because (1) the timeout is application specific and (2) we prevent multi-threaded applications from using blocking communications to provide progress.

OMPI has support for debugging message queues, a mechanism that can be used by an external process (such as padb) to extract information about all pending requests, and aggregate this information before exposing it to the users. I think a similar mechanisms are already implemented in TV.

We should certainly add this on the list of discussions for the developers meeting.

@kawashima-fj
Copy link
Member Author

@bosilca I know this is not a complete/sophisticated solution. This feature is disabled by default and a user who suffers from a hang-up bug should enable it with a proper timeout (application specific). Though not all applications can use it, many applications (with a bug) get benefits. Many our customers use it when debugging.

The advantage of this code is that a user does not need any external tools and can try it only with an MCA parameter. Under a batch job scheduling system, attaching to MPI processes is sometimes difficult. So it's an important point.

The message queue mechanism, which I forgot, can be useful. I'll check it up. Thanks.

I added this to the meeting topics list on the Wiki.

Debugging communication deadlock of MPI processes is a difficult task.
If one process stops progress for some reason, peer processes who wait
the process also stop progress. And all processes in the job will stop
progress. Usually, finding the causal process is not a trivial task.
Sometimes, even determining whether the situation is communication
deadlock or just slowdown is difficult. This commit adds a feature to
detect a possible deadlock (hang-up) and output information which may
be useful to analyze the deadlock.

Added Feature
=============

Detection
---------

The logic of the deadlock detection is very simple. If a waiting
operation (in `MPI_WAIT` etc.) does not complete within a certain
time, we consider it as a hang-up situation. The time limit can be
set by the MCA parameter `opal_progress_timeout` in seconds. I know
this logic is suboptimal and there are studies about communication
deadlock detection/analysis. But this logic is easy to implement and
is proved to be very useful among Fujitsu MPI customers.

By default, `opal_progress_timeout` is 0 and this feature is disabled.

Output
------

If a hang-up situation is detected, the following information is
output at each process.

- message to indicate a possible hang-up is detected
- stack trace
- data of waiting request

An output example:

```
[dirac:19756] Possible hang-up (no progress) is detected on [[56375,1],0]
[dirac:19756] [ 0] /home/tkawa/openmpi/lib/libopen-pal.so.0(opal_progress_handle_hangup+0xc0)[0x7fa267a9a540]
[dirac:19756] [ 1] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(+0x102e8)[0x7fa25c2c42e8]
[dirac:19756] [ 2] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x599)[0x7fa25c2c6731]
[dirac:19756] [ 3] /home/tkawa/openmpi/lib/libmpi.so.0(PMPI_Send+0x2a7)[0x7fa26873995e]
[dirac:19756] [ 4] ./deadlock[0x400c7c]
[dirac:19756] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa2680dcb45]
[dirac:19756] [ 6] ./deadlock[0x400a99]
[dirac:19756] send request (c=0xb6c100 f=UNDEFINED) to rank=0 (jobid=3694592001 vpid=0) on communicator=MPI_COMM_WORLD (c=0x601500 f=0 id=0 my_rank=0) with tag=0 for datatype=MPI_CHAR (c=0x601300 f=34 id=1) x count=131072 in addr=0x7ffcdd9cdb00 [complete=n state=2 type=1 pml_complete=n free_called=n sequence=1 send_mode=4 bytes_packed=131072 recv=(nil) state=0 throttle_sends=n pipeline_depth=0 bytes_delivered=0 rdma_cnt=1 pending=0]
```

Assembling output from all processes and analyzing the data will help
finding the cause of the hang-up. But such script is not ready at this
moment.

Termination
-----------

By default, when hang-up is detected and information is output, the
process will abort with `exit(1)`.

If one MPI process aborts, `orted` will terminates all MPI processes.
But in many cases, other processes also detect hang-up and try to
output information. To ensure information is output on all processes,
the existing MCA parameter `opal_abort_delay` can be used. If a
positive value is specified, each process sleeps for the given seconds
to wait other processes to output information. If a negative value is
specified, each process sleeps forever and allows attaching of a
debugger.

This detection logic gives a false-positive. If the value of
`opal_progress_timeout` is negative, the process does not abort.
It only output information and continue progress when possible hang-up
is detected.

Similar Feature
===============

`orterun` has the `--timeout` option and also can detect possible
hang-up. Advantage of this feature is:

- The timeout of this feature is per-operation basis.
- Request data can be output.
- It can be used with resource manager other than `orted`.

Code Modification
=================

Detection
---------

`while` loops which call the `opal_progress` function are replaced
with the `OPAL_PROGRESS_WHILE` macro and `OPAL_PROGRESS_BLOCK_WHILE`
macro, which processes timeout. In this commit, only `while` loops in
the following functions are replaced.

- `sync_wait_st`
- `sync_wait_mt`
- `opal_condition_wait`

These functions covers most of blocking MPI functions but not all.
You can replace loops in other functions easily.

Output
------

In this commit, output of request data is supported by only ob1 PML
requests. You can support data output by setting the `req_dump`
callback function of the `ompi_request_t` structure.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Instead of hardcoding `stderr`, harness the `opal_stacktrace_output`
MCA parameter to allow for some flexibility in the destination of the
hang-up detection message and the stack output.

Thanks to Josh Hursey for the suggestion.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
@kawashima-fj
Copy link
Member Author

bot:mellanox:retest

1 similar comment
@kawashima-fj
Copy link
Member Author

bot:mellanox:retest

@kawashima-fj
Copy link
Member Author

I'll try another approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants