Skip to content

Hangs when btl_tcp_progress_thread=1 #1793

@jsquyres

Description

@jsquyres

@bosilca In testing the btl_tcp_progress_thread=1 functionality on master and v2.x today, runs usually -- but not always -- hang with even a simple MPI ring program when the progress thread is enabled.

On master, when I run the example ring_c program across 2 servers, if I run with np=2, 10 times out of 10, it runs fine. But if I run with np=4, 10 times out of 10, it hangs (I limited it down to a single IP interface, just to make the test case simpler):

$ mpirun -np 4 --map-by node --mca btl_tcp_if_include 10.3.0.1/16 --mca btl_tcp_progress_thread 1 --mca btl tcp,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
...hang...

I see the same behavior on the head of v2.x.

@hppritcha It looks like we neglected to mention this feature in v2.0.0 NEWS, so we're probably not in much danger for v2.0.0.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions