Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 13, 2017

Strange - I had created a new IOF API "complete" for cleaning up at the end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM.

Complete the implementation and seal the leaks

Fixes #2691 (hopefully!)

Signed-off-by: Ralph Castain rhc@open-mpi.org

…he end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM.

Complete the implementation and seal the leaks

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2017

bot:mellanox:retest

@rhc54 rhc54 merged commit d9fc88c into open-mpi:master Jan 13, 2017
@rhc54 rhc54 deleted the topic/dvm branch January 13, 2017 05:29
@ggouaillardet
Copy link
Contributor

@rhc54 this PR introduced a minor error
if you
mpirun --host <an_other_host>:1 -np 1 mpi_helloworld
then you will likely end up with an error message from the remote orted
this is caused by the ORTE_PROC_STATE_WAITPID_FIRED being fired twice, and hence resulting ORTE_PROC_STATE_TERMINATED being fired twice too, and the second time in track_procs

    /* get the job object for this proc */
    if (NULL == (jdata = orte_get_job_data_object(proc->jobid))) {
        ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
        goto cleanup;
}

the job data object cannot be retrieved since the job was previously when the ORTE_PROC_STATE_TERMINATED was processed for the first time.
i tried to add a new ORTE_PROC_FLAG_TERMINATING flag so we do not fire ORTE_PROC_STATE_TERMINATED twice. that fixes some paths but not all, so i guess there is a kind of race condition here too.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

?? how is waitpid being fired twice?

@ggouaillardet
Copy link
Contributor

you can witness that with
mpirun --mca state_base_verbose 5 ...

i will post the stack traces tomorrow

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

I was asking because I only see it being fired once - I know how to see the trace 😄

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

Hmmm...I can reproduce it here now too. Will investigate. For some reason, we seem to be getting multiple waitpid's triggering on every MPI process - but not on non-MPI procs.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

Got it - see #2738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants