-
Notifications
You must be signed in to change notification settings - Fork 68
Need to delay registration of the waitpid callback until after the fo… #747
Conversation
…rk/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP. (cherry picked from commit open-mpi/ompi@f1483eb)
|
Test FAILed. |
|
bot:retest |
|
removed comment which contained unrelated data intended for different PR |
|
I see - is the fix going to deal with the VG complaint, or will you be ignoring it? |
|
actually, i failed to see that this PR is for v2.x. v2.x VG complains are valid, don`t know if they are result of your PR or were present before. master is clean on this front. |
|
Ummm...no, I just checked with VG on master, and these complaints have been there for a long time. It has to do with us not initializing every buffer that we are going to read into. So are you going to now demand that everything be valgrind clean? That seems pretty restrictive, or else you are going to have to aggressively update the valgrind.supp file and use it in your tests |
|
IMO - it is better to keep it clean all the time for every PR. master is clean. Jenkins runs VG with NP=1 and only with OMPI built-in components (no mxm, no verbs) - to make sure we will not need suppression file. |
|
Test FAILed. |
|
Okay - then I hope you guys are willing and able to help fix it. I'm seeing similar complaints in master from more than just PMIx. The community will have to decide about the gating based on valgrind - that's a new one. |
|
sure |
|
I checked my results on master where this change has already been applied and it has fixed the issues with returning 0 in an error case. So, looks good to me. What does the 0x4000 signify? |
|
The 0x4000 designates a "master" as opposed to an "hnp" - basically, it distinguishes orte-dvm from mpirun, thus allowing the selection of different components. They are very similar in terms of their role in any job, but the dvm requires some different behaviors as it is persistent. |
...
19:42:03 ==8414== Invalid read of size 4
19:42:03 ==8414== at 0x3D6980A7B0: pthread_mutex_unlock (in /lib64/libpthread-2.12.so)
19:42:03 ==8414== by 0x5341576: opal_libevent2022_event_del (event.c:2210)
19:42:03 ==8414== by 0x6E57FD4: pdes (usock.c:316)
19:42:03 ==8414== by 0x6E5A7B2: pmix_obj_run_destructors (pmix_object.h:449)
19:42:03 ==8414== by 0x6E5C1F4: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:413)
19:42:03 ==8414== by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414== by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414== by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414== by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414== by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414== by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414== Address 0x5c8a9f0 is 16 bytes inside a block of size 40 free'd
19:42:03 ==8414== at 0x4A06484: free (vg_replace_malloc.c:468)
19:42:03 ==8414== by 0x5341BA6: opal_libevent2022_event_base_free (event.c:790)
19:42:03 ==8414== by 0x6E5C19A: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:407)
19:42:03 ==8414== by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414== by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414== by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414== by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414== by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414== by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414==
19:42:03 ==8414== Invalid read of size 4
19:42:03 ==8414== at 0x3D6980A360: __pthread_mutex_unlock_full (in /lib64/libpthread-2.12.so)
19:42:03 ==8414== by 0x5341576: opal_libevent2022_event_del (event.c:2210)
19:42:03 ==8414== by 0x6E57FD4: pdes (usock.c:316)
19:42:03 ==8414== by 0x6E5A7B2: pmix_obj_run_destructors (pmix_object.h:449)
19:42:03 ==8414== by 0x6E5C1F4: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:413)
19:42:03 ==8414== by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414== by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414== by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414== by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414== by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414== by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414== Address 0x5c8a9f0 is 16 bytes inside a block of size 40 free'd
19:42:03 ==8414== at 0x4A06484: free (vg_replace_malloc.c:468)
19:42:03 ==8414== by 0x5341BA6: opal_libevent2022_event_base_free (event.c:790)
19:42:03 ==8414== by 0x6E5C19A: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:407)
19:42:03 ==8414== by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414== by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414== by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414== by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414== by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414== by 0x40088F: main (hello_c.c:24) |
|
these don't appear to be operationally problematic, but rather have to do with the old valgrind issue about reading data into parts of buffers. If you folks have time to quiet valgrind, please do so. |
|
Please see rhc54/ompi-release/pull/2 for a minor addendum to this PR. If Ralph accepts it, I'll commit it on master, too. |
|
Also, FWIW, I see that the using-memory-after-free error exists on v2.x regardless of this PR. So this PR should probably be allowed to move forward regardless of the Valgrind-detected issue. |
|
Valgrind issue is now fixed in a separate PR: ompi-release/pull/753 |
orte proc_info.h: use symbolic names
|
Yes, this PR has nothing to do with the reported valgrind issue. |
|
Test FAILed. |
Need to delay registration of the waitpid callback until after the fo…
mtl/ofi: don't inline ofi progress method
…rk/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.
(cherry picked from commit open-mpi/ompi@f1483eb)