Cleanup DVM leaks #2733

rhc54 · 2017-01-13T03:57:26Z

Strange - I had created a new IOF API "complete" for cleaning up at the end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM.

Complete the implementation and seal the leaks

Fixes #2691 (hopefully!)

Signed-off-by: Ralph Castain rhc@open-mpi.org

…he end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM. Complete the implementation and seal the leaks Signed-off-by: Ralph Castain <rhc@open-mpi.org>

rhc54 · 2017-01-13T04:18:27Z

bot:mellanox:retest

ggouaillardet · 2017-01-16T13:01:15Z

@rhc54 this PR introduced a minor error
if you
mpirun --host <an_other_host>:1 -np 1 mpi_helloworld
then you will likely end up with an error message from the remote orted
this is caused by the ORTE_PROC_STATE_WAITPID_FIRED being fired twice, and hence resulting ORTE_PROC_STATE_TERMINATED being fired twice too, and the second time in track_procs

    /* get the job object for this proc */
    if (NULL == (jdata = orte_get_job_data_object(proc->jobid))) {
        ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
        goto cleanup;
}

the job data object cannot be retrieved since the job was previously when the ORTE_PROC_STATE_TERMINATED was processed for the first time.
i tried to add a new ORTE_PROC_FLAG_TERMINATING flag so we do not fire ORTE_PROC_STATE_TERMINATED twice. that fixes some paths but not all, so i guess there is a kind of race condition here too.

rhc54 · 2017-01-16T15:02:06Z

?? how is waitpid being fired twice?

ggouaillardet · 2017-01-16T15:14:27Z

you can witness that with
mpirun --mca state_base_verbose 5 ...

i will post the stack traces tomorrow

rhc54 · 2017-01-16T15:19:29Z

I was asking because I only see it being fired once - I know how to see the trace 😄

rhc54 · 2017-01-16T22:20:46Z

Hmmm...I can reproduce it here now too. Will investigate. For some reason, we seem to be getting multiple waitpid's triggering on every MPI process - but not on non-MPI procs.

rhc54 · 2017-01-16T22:33:16Z

Got it - see #2738

rhc54 merged commit d9fc88c into open-mpi:master Jan 13, 2017

rhc54 deleted the topic/dvm branch January 13, 2017 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup DVM leaks #2733

Cleanup DVM leaks #2733

Uh oh!

rhc54 commented Jan 13, 2017

Uh oh!

rhc54 commented Jan 13, 2017

Uh oh!

ggouaillardet commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

ggouaillardet commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cleanup DVM leaks #2733

Cleanup DVM leaks #2733

Uh oh!

Conversation

rhc54 commented Jan 13, 2017

Uh oh!

rhc54 commented Jan 13, 2017

Uh oh!

ggouaillardet commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

ggouaillardet commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

rhc54 commented Jan 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants