Skip to content

IBM dynamic/no-disconnect test hangs #3525

@rhc54

Description

@rhc54

The IBM dynamic/no-disconnect test is hanging, and then eventually "failing" when the PMIx publish/lookup exchange times out. This only occurs if multiple nodes are involved in the job - if you run the test on a single node, it will pass.

I have run the test on two nodes, using the following command:

mpirun -n 2 --leave-session-attached ./no-disconnect

with a default hostfile containing two nodes, each with 24 slots. After a lot of print statements and perusing output, I have found that:

  • the test spawns a total of 32 processes. Thus, the last 8 procs "bleed" onto the second node. This is where the problem occurs.

  • the procs on the 2nd node issue a lookup on their "accept" key, but that key is never published. This is why we hang. However, I see all the "connect" keys with the same signature sitting in the orte_data_server queue.

Unfortunately, I don't understand this test sufficiently to know which process is supposed to be publishing the corresponding key, and so I cannot track down why it isn't doing so. I added a lot of prints to the code path, and all data that is published appears to be getting into the orte_data_server as it should. Likewise, lookup is going thru the entire collected data.

@ggouaillardet Would you have some time to dig into this a bit more? I have added a verbosity to the orte_data_server (orte_data_server_verbose) to aid in debugging, plus some additional verbose statements in the orted PMIx server. See PR #3524

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions