opal/runtime: Add a hang-up detection feature #3700

kawashima-fj · 2017-06-15T11:30:15Z

Add a hang-up detection feature

Debugging communication deadlock of MPI processes is a difficult task. If one process stops progress for some reason, peer processes who wait the process also stop progress. And all processes in the job will stop progress. Usually, finding the causal process is not a trivial task. Sometimes, even determining whether the situation is communication deadlock or just slowdown is difficult. This commit adds a feature to detect a possible deadlock (hang-up) and output information which may be useful to analyze the deadlock.

I want to discuss this code modification at the upcoming July 2017 OMPI developer's meeting.

Added Feature

Detection

The logic of the deadlock detection is very simple. If a waiting operation (in MPI_WAIT etc.) does not complete within a certain time, we consider it as a hang-up situation. The time limit can be set by the MCA parameter opal_progress_timeout in seconds. I know this logic is suboptimal and there are studies about communication deadlock detection/analysis. But this logic is easy to implement and is proved to be very useful among Fujitsu MPI customers.

By default, opal_progress_timeout is 0 and this feature is disabled.

Output

If a hang-up situation is detected, the following information is output at each process.

message to indicate a possible hang-up is detected
stack trace
data of waiting request

An output example:

[dirac:19756] Possible hang-up (no progress) is detected on [[56375,1],0]
[dirac:19756] [ 0] /home/tkawa/openmpi/lib/libopen-pal.so.0(opal_progress_handle_hangup+0xc0)[0x7fa267a9a540]
[dirac:19756] [ 1] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(+0x102e8)[0x7fa25c2c42e8]
[dirac:19756] [ 2] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x599)[0x7fa25c2c6731]
[dirac:19756] [ 3] /home/tkawa/openmpi/lib/libmpi.so.0(PMPI_Send+0x2a7)[0x7fa26873995e]
[dirac:19756] [ 4] ./deadlock[0x400c7c]
[dirac:19756] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa2680dcb45]
[dirac:19756] [ 6] ./deadlock[0x400a99]
[dirac:19756] send request (c=0xb6c100 f=UNDEFINED) to rank=0 (jobid=3694592001 vpid=0) on communicator=MPI_COMM_WORLD (c=0x601500 f=0 id=0 my_rank=0) with tag=0 for datatype=MPI_CHAR (c=0x601300 f=34 id=1) x count=131072 in addr=0x7ffcdd9cdb00 [complete=n state=2 type=1 pml_complete=n free_called=n sequence=1 send_mode=4 bytes_packed=131072 recv=(nil) state=0 throttle_sends=n pipeline_depth=0 bytes_delivered=0 rdma_cnt=1 pending=0]

Assembling output from all processes and analyzing the data will help finding the cause of the hang-up. But such script is not ready at this moment.

Termination

By default, when hang-up is detected and information is output, the process will abort with exit(1).

If one MPI process aborts, orted will terminates all MPI processes. But in many cases, other processes also detect hang-up and try to output information. To ensure information is output on all processes, the existing MCA parameter opal_abort_delay can be used. If a positive value is specified, each process sleeps for the given seconds to wait other processes to output information. If a negative value is specified, each process sleeps forever and allows attaching of a debugger.

This detection logic gives a false-positive. If the value of opal_progress_timeout is negative, the process does not abort. It only output information and continue progress when possible hang-up is detected.

Similar Feature

orterun has the --timeout option and also can detect possible hang-up. Advantage of this feature is:

The timeout of this feature is per-operation basis.
Request data can be output.
It can be used with resource manager other than orted.

Code Modification

Detection

while loops which call the opal_progress function are replaced with the OPAL_PROGRESS_WHILE macro and OPAL_PROGRESS_BLOCK_WHILE macro, which processes timeout. In this commit, only while loops in the following functions are replaced.

sync_wait_st
sync_wait_mt
opal_condition_wait

These functions covers most of blocking MPI functions but not all. You can replace loops in other functions easily.

Output

In this commit, output of request data is supported by only ob1 PML requests. You can support data output by setting the req_dump callback function of the ompi_request_t structure.

jjhursey · 2017-06-15T13:56:16Z

opal/runtime/opal_progress.c

+                                 void *cbdata)
+{
+    char prefix[100 + OPAL_MAXHOSTNAMELEN];
+    FILE *stream = stderr;


Instead of hardcoding stderr could we harness the opal_stacktrace_output_filename MCA parameter (-mca opal_stacktrace_output) to allow for some flexibility in the destination of the stack output (so we can choose a file, for example). There is some logic in opal/util/stacktrace.c to process that MCA parameter. Maybe we can move that logic to a more commonly accessible area.

Thanks. I'll update the PR to use it.

bosilca · 2017-06-15T15:33:26Z

This has been tried before with little success, mainly because (1) the timeout is application specific and (2) we prevent multi-threaded applications from using blocking communications to provide progress.

OMPI has support for debugging message queues, a mechanism that can be used by an external process (such as padb) to extract information about all pending requests, and aggregate this information before exposing it to the users. I think a similar mechanisms are already implemented in TV.

We should certainly add this on the list of discussions for the developers meeting.

kawashima-fj · 2017-06-16T05:43:07Z

@bosilca I know this is not a complete/sophisticated solution. This feature is disabled by default and a user who suffers from a hang-up bug should enable it with a proper timeout (application specific). Though not all applications can use it, many applications (with a bug) get benefits. Many our customers use it when debugging.

The advantage of this code is that a user does not need any external tools and can try it only with an MCA parameter. Under a batch job scheduling system, attaching to MPI processes is sometimes difficult. So it's an important point.

The message queue mechanism, which I forgot, can be useful. I'll check it up. Thanks.

I added this to the meeting topics list on the Wiki.

Debugging communication deadlock of MPI processes is a difficult task. If one process stops progress for some reason, peer processes who wait the process also stop progress. And all processes in the job will stop progress. Usually, finding the causal process is not a trivial task. Sometimes, even determining whether the situation is communication deadlock or just slowdown is difficult. This commit adds a feature to detect a possible deadlock (hang-up) and output information which may be useful to analyze the deadlock. Added Feature ============= Detection --------- The logic of the deadlock detection is very simple. If a waiting operation (in `MPI_WAIT` etc.) does not complete within a certain time, we consider it as a hang-up situation. The time limit can be set by the MCA parameter `opal_progress_timeout` in seconds. I know this logic is suboptimal and there are studies about communication deadlock detection/analysis. But this logic is easy to implement and is proved to be very useful among Fujitsu MPI customers. By default, `opal_progress_timeout` is 0 and this feature is disabled. Output ------ If a hang-up situation is detected, the following information is output at each process. - message to indicate a possible hang-up is detected - stack trace - data of waiting request An output example: ``` [dirac:19756] Possible hang-up (no progress) is detected on [[56375,1],0] [dirac:19756] [ 0] /home/tkawa/openmpi/lib/libopen-pal.so.0(opal_progress_handle_hangup+0xc0)[0x7fa267a9a540] [dirac:19756] [ 1] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(+0x102e8)[0x7fa25c2c42e8] [dirac:19756] [ 2] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x599)[0x7fa25c2c6731] [dirac:19756] [ 3] /home/tkawa/openmpi/lib/libmpi.so.0(PMPI_Send+0x2a7)[0x7fa26873995e] [dirac:19756] [ 4] ./deadlock[0x400c7c] [dirac:19756] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa2680dcb45] [dirac:19756] [ 6] ./deadlock[0x400a99] [dirac:19756] send request (c=0xb6c100 f=UNDEFINED) to rank=0 (jobid=3694592001 vpid=0) on communicator=MPI_COMM_WORLD (c=0x601500 f=0 id=0 my_rank=0) with tag=0 for datatype=MPI_CHAR (c=0x601300 f=34 id=1) x count=131072 in addr=0x7ffcdd9cdb00 [complete=n state=2 type=1 pml_complete=n free_called=n sequence=1 send_mode=4 bytes_packed=131072 recv=(nil) state=0 throttle_sends=n pipeline_depth=0 bytes_delivered=0 rdma_cnt=1 pending=0] ``` Assembling output from all processes and analyzing the data will help finding the cause of the hang-up. But such script is not ready at this moment. Termination ----------- By default, when hang-up is detected and information is output, the process will abort with `exit(1)`. If one MPI process aborts, `orted` will terminates all MPI processes. But in many cases, other processes also detect hang-up and try to output information. To ensure information is output on all processes, the existing MCA parameter `opal_abort_delay` can be used. If a positive value is specified, each process sleeps for the given seconds to wait other processes to output information. If a negative value is specified, each process sleeps forever and allows attaching of a debugger. This detection logic gives a false-positive. If the value of `opal_progress_timeout` is negative, the process does not abort. It only output information and continue progress when possible hang-up is detected. Similar Feature =============== `orterun` has the `--timeout` option and also can detect possible hang-up. Advantage of this feature is: - The timeout of this feature is per-operation basis. - Request data can be output. - It can be used with resource manager other than `orted`. Code Modification ================= Detection --------- `while` loops which call the `opal_progress` function are replaced with the `OPAL_PROGRESS_WHILE` macro and `OPAL_PROGRESS_BLOCK_WHILE` macro, which processes timeout. In this commit, only `while` loops in the following functions are replaced. - `sync_wait_st` - `sync_wait_mt` - `opal_condition_wait` These functions covers most of blocking MPI functions but not all. You can replace loops in other functions easily. Output ------ In this commit, output of request data is supported by only ob1 PML requests. You can support data output by setting the `req_dump` callback function of the `ompi_request_t` structure. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>

Instead of hardcoding `stderr`, harness the `opal_stacktrace_output` MCA parameter to allow for some flexibility in the destination of the hang-up detection message and the stack output. Thanks to Josh Hursey for the suggestion. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>

kawashima-fj · 2017-06-26T11:56:21Z

bot:mellanox:retest

kawashima-fj · 2017-07-04T00:05:10Z

bot:mellanox:retest

kawashima-fj · 2017-11-27T00:18:57Z

I'll try another approach.

kawashima-fj added enhancement ⚠️ WIP-DNM! labels Jun 15, 2017

kawashima-fj self-assigned this Jun 15, 2017

jjhursey reviewed Jun 15, 2017

View reviewed changes

kawashima-fj added 2 commits June 20, 2017 11:17

kawashima-fj force-pushed the pr/hangup-detect branch from 606fb08 to 5a1341b Compare June 20, 2017 02:40

kawashima-fj closed this Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opal/runtime: Add a hang-up detection feature #3700

opal/runtime: Add a hang-up detection feature #3700

kawashima-fj commented Jun 15, 2017

jjhursey Jun 15, 2017

kawashima-fj Jun 16, 2017

bosilca commented Jun 15, 2017

kawashima-fj commented Jun 16, 2017

kawashima-fj commented Jun 26, 2017

kawashima-fj commented Jul 4, 2017

kawashima-fj commented Nov 27, 2017

opal/runtime: Add a hang-up detection feature #3700

opal/runtime: Add a hang-up detection feature #3700

Conversation

kawashima-fj commented Jun 15, 2017

Add a hang-up detection feature

Added Feature

Detection

Output

Termination

Similar Feature

Code Modification

Detection

Output

jjhursey Jun 15, 2017

Choose a reason for hiding this comment

kawashima-fj Jun 16, 2017

Choose a reason for hiding this comment

bosilca commented Jun 15, 2017

kawashima-fj commented Jun 16, 2017

kawashima-fj commented Jun 26, 2017

kawashima-fj commented Jul 4, 2017

kawashima-fj commented Nov 27, 2017