Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@rhc54
Copy link

@rhc54 rhc54 commented Nov 7, 2015

…rk/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.

(cherry picked from commit open-mpi/ompi@f1483eb)

…rk/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.

(cherry picked from commit open-mpi/ompi@f1483eb)
@rhc54 rhc54 added the bug label Nov 7, 2015
@rhc54 rhc54 added this to the v2.0.0 milestone Nov 7, 2015
@rhc54 rhc54 assigned rolfv and unassigned jsquyres Nov 7, 2015
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/989/ for details.

@rhc54
Copy link
Author

rhc54 commented Nov 7, 2015

bot:retest

@mike-dubman
Copy link
Member

removed comment which contained unrelated data intended for different PR

@rhc54
Copy link
Author

rhc54 commented Nov 7, 2015

I see - is the fix going to deal with the VG complaint, or will you be ignoring it?

@mike-dubman
Copy link
Member

actually, i failed to see that this PR is for v2.x.
The fix I was talking about is for v1.10.

v2.x VG complains are valid, don`t know if they are result of your PR or were present before.

master is clean on this front.

@rhc54
Copy link
Author

rhc54 commented Nov 7, 2015

Ummm...no, I just checked with VG on master, and these complaints have been there for a long time. It has to do with us not initializing every buffer that we are going to read into.

So are you going to now demand that everything be valgrind clean? That seems pretty restrictive, or else you are going to have to aggressively update the valgrind.supp file and use it in your tests

@mike-dubman
Copy link
Member

IMO - it is better to keep it clean all the time for every PR.

master is clean.
v1.10 is clean for MPI
v1.10 has one "invalid read" for SHMEM API which we found in a hard way.
v2.x has PMIx only issues which probably easy to fix.

Jenkins runs VG with NP=1 and only with OMPI built-in components (no mxm, no verbs) - to make sure we will not need suppression file.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/990/ for details.

@rhc54
Copy link
Author

rhc54 commented Nov 7, 2015

Okay - then I hope you guys are willing and able to help fix it. I'm seeing similar complaints in master from more than just PMIx. The community will have to decide about the gating based on valgrind - that's a new one.

@mike-dubman
Copy link
Member

sure
once we establish stable baseline for VG in the branches - it will be easy to keep it clean.

@rolfv
Copy link

rolfv commented Nov 9, 2015

I checked my results on master where this change has already been applied and it has fixed the issues with returning 0 in an error case. So, looks good to me. What does the 0x4000 signify?
👍

@rhc54
Copy link
Author

rhc54 commented Nov 9, 2015

The 0x4000 designates a "master" as opposed to an "hnp" - basically, it distinguishes orte-dvm from mpirun, thus allowing the selection of different components. They are very similar in terms of their role in any job, but the dvm requires some different behaviors as it is persistent.

@mike-dubman
Copy link
Member

  • valgrind errors related to pmi*
...
19:42:03 ==8414== Invalid read of size 4
19:42:03 ==8414==    at 0x3D6980A7B0: pthread_mutex_unlock (in /lib64/libpthread-2.12.so)
19:42:03 ==8414==    by 0x5341576: opal_libevent2022_event_del (event.c:2210)
19:42:03 ==8414==    by 0x6E57FD4: pdes (usock.c:316)
19:42:03 ==8414==    by 0x6E5A7B2: pmix_obj_run_destructors (pmix_object.h:449)
19:42:03 ==8414==    by 0x6E5C1F4: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:413)
19:42:03 ==8414==    by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414==    by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414==    by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414==    by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414==    by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414==    by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414==  Address 0x5c8a9f0 is 16 bytes inside a block of size 40 free'd
19:42:03 ==8414==    at 0x4A06484: free (vg_replace_malloc.c:468)
19:42:03 ==8414==    by 0x5341BA6: opal_libevent2022_event_base_free (event.c:790)
19:42:03 ==8414==    by 0x6E5C19A: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:407)
19:42:03 ==8414==    by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414==    by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414==    by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414==    by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414==    by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414==    by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414== 
19:42:03 ==8414== Invalid read of size 4
19:42:03 ==8414==    at 0x3D6980A360: __pthread_mutex_unlock_full (in /lib64/libpthread-2.12.so)
19:42:03 ==8414==    by 0x5341576: opal_libevent2022_event_del (event.c:2210)
19:42:03 ==8414==    by 0x6E57FD4: pdes (usock.c:316)
19:42:03 ==8414==    by 0x6E5A7B2: pmix_obj_run_destructors (pmix_object.h:449)
19:42:03 ==8414==    by 0x6E5C1F4: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:413)
19:42:03 ==8414==    by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414==    by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414==    by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414==    by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414==    by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414==    by 0x40088F: main (hello_c.c:24)
19:42:03 ==8414==  Address 0x5c8a9f0 is 16 bytes inside a block of size 40 free'd
19:42:03 ==8414==    at 0x4A06484: free (vg_replace_malloc.c:468)
19:42:03 ==8414==    by 0x5341BA6: opal_libevent2022_event_base_free (event.c:790)
19:42:03 ==8414==    by 0x6E5C19A: OPAL_PMIX_PMIX1XX_PMIx_Finalize (pmix_client.c:407)
19:42:03 ==8414==    by 0x6C24EAB: pmix1_client_finalize (pmix1_client.c:130)
19:42:03 ==8414==    by 0x6A1D599: rte_finalize (ess_pmi_module.c:412)
19:42:03 ==8414==    by 0x4FE459B: orte_finalize (orte_finalize.c:72)
19:42:03 ==8414==    by 0x4C6A105: ompi_mpi_finalize (ompi_mpi_finalize.c:440)
19:42:03 ==8414==    by 0x4C9A8B0: PMPI_Finalize (pfinalize.c:44)
19:42:03 ==8414==    by 0x40088F: main (hello_c.c:24)

@rhc54
Copy link
Author

rhc54 commented Nov 9, 2015

these don't appear to be operationally problematic, but rather have to do with the old valgrind issue about reading data into parts of buffers. If you folks have time to quiet valgrind, please do so.

@jsquyres
Copy link
Member

Please see rhc54/ompi-release/pull/2 for a minor addendum to this PR. If Ralph accepts it, I'll commit it on master, too.

@jsquyres
Copy link
Member

Also, FWIW, I see that the using-memory-after-free error exists on v2.x regardless of this PR. So this PR should probably be allowed to move forward regardless of the Valgrind-detected issue.

@jsquyres
Copy link
Member

Valgrind issue is now fixed in a separate PR: ompi-release/pull/753

orte proc_info.h: use symbolic names
@rhc54
Copy link
Author

rhc54 commented Nov 10, 2015

Yes, this PR has nothing to do with the reported valgrind issue.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1000/ for details.

hppritcha added a commit that referenced this pull request Nov 10, 2015
Need to delay registration of the waitpid callback until after the fo…
@hppritcha hppritcha merged commit 2451def into open-mpi:v2.x Nov 10, 2015
alinask pushed a commit to alinask/ompi-release that referenced this pull request Dec 10, 2015
mtl/ofi: don't inline ofi progress method
@rhc54 rhc54 deleted the cmr2.0/waitpid branch December 15, 2015 23:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants