Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented Jul 1, 2016

  • Negative priority cleanup
  • Fixes a cleanup segv in hcoll if it is asked to take a negative priority and has to cleanup.
  • Improves coll/base verbose messages making it easier to see the set of collectives selected and being queried (as some might be rejected based on negative priority).

open-mpi/ompi#1834

jjhursey added 2 commits July 1, 2016 15:42
 * Print a verbose message if the component was disqualified because of
   a negative priority.
 * If a disqualified component provided a module, release it.
 * Display list of selected components in priority order
   - During the process of volunteering collective functions for a
     communicator, print the component name and priority. This will
     cause the verbose messages to be displayed in reverse priority
     order (lowest priority first, up to highest). This is helpful
     when determining which collective components are active in which
     order for a given communicator.
     To see the messages you need the following MCA parameter set to 9
     or higher: `-mca coll_base_verbose 9`
 * Adjust verbose for commonly needed verbose output from 10 to 9 to
   make it easier to access this information.

(cherry picked from commit open-mpi/ompi@59f304b)
 * If hcoll is given a negative priority, but not enabled=0 then
   the module is constructed, but then destructed before calling
   it's query(). So the previous pointers are not initialized.
   If we try to OBJ_RELEASE them in a debug build an assert will fire.
   This commit adds some protection against that and initializes
   the _module pointers to NULL.

(cherry picked from commit open-mpi/ompi@0a09f8b)
@jjhursey
Copy link
Member Author

jjhursey commented Jul 1, 2016

bot:label:Code-cleanup-low-priority
bot:milestone:v2.0.1

@ibm-ompi
Copy link

ibm-ompi commented Jul 1, 2016

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/a92d7a396401df03e8c8bc99857fe23f

@ibm-ompi
Copy link

ibm-ompi commented Jul 1, 2016

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/fc7ec9e9c8d5aaa95eb394593cbc60d8

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1830/ for details.

@jjhursey
Copy link
Member Author

jjhursey commented Jul 5, 2016

bot:ibm:retest

@jjhursey
Copy link
Member Author

bot:retest

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1847/ for details.

@jjhursey
Copy link
Member Author

@gpaulsen and @jladd-mlnx you both take a look to review this. It touches the hcoll shutdown process, but the rest is mostly cosmetic and cleaning up a small memory leak with negative priorities.
bot:assign: @gpaulsen
bot:assign: @jladd-mlnx

@ompiteam-bot
Copy link

OMPIBot error: Cannot assign more than one user on an issue.

@jjhursey
Copy link
Member Author

bot:assign: @gpaulsen

@jsquyres
Copy link
Member

bot:mellanox:retest

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1970/ for details.

@jladd-mlnx
Copy link
Member

👍

@gpaulsen
Copy link
Member

So I reviewed the changes, but also the whole file ompi/mca/coll/hcoll/coll_hcoll_module.c for context. I have a few comments:

  1. hcoll_module->previous_reduce = NULL; in tere twice
  2. Not calling OBJ_RELEASE_IF_NOT_NULL on all modules, why not?
  3. commented setction 131 - 138 - no comment, some of these are now implemented above, confusing, please either call OBJ_RELEASE_IF_NOT_NULL on all (my preference), or comment this section, and make it not conflict with above.
  4. comment @ line 184 - is this still valid? Why not just do those too, despite what hcoll supports today?
  5. Sort these names? it's pretty easy to miss one. I'd like to see some for_each_collective() general macroing built to allow for generic code to be compiled 'per MPI collective'.... but another PR another day. :)

@jjhursey
Copy link
Member Author

@gpaulsen So all of these comments are with regard to the hcoll component change.

  1. I don't see hcoll_module->previous_reduce twice. I see one for the blocking and one for the nonblocking version. Can you point out which line you think is duplicated?
  2. I only call OBJ_RELEASE_IF_NOT_NULL on the modules that were previously listed. I did not want to make a more extensive change to a component I did not own. So I opted for changing just what was necessary and leave the more complete cleanup to the maintainers.
  3. I'm not really sure what you are asking for. I can remove that comment block - is what what you are asking for?
  4. Again, it's not my component so I'll leave that for someone else unless it becomes a problem for us. I fixed what we needed to get it working again when it is given a negative priority.

@jladd-mlnx
Copy link
Member

@jjhursey @gpaulsen

  1. Correct. One for blocking and another for the non-blocking variant.
  2. We (MLNX) own HCOLL and I appreciate Josh's consideration. Mellanox will be glad to address any issues causing IBM concern, but it should be handled by us.
  3. Please leave it as is.
  4. Again, we appreciate your efforts, but please let us know if something in HCOLL is causing IBM problems, and MLNX will address it. HCOLL is performance sensitive in all aspects including startup and shutdown.

@jsquyres
Copy link
Member

@jladd-mlnx I'm not quite clear from your comments yesterday: are you thumbs-down'ing this PR? (you thumbs-up'ed this PR before that)

@jladd-mlnx
Copy link
Member

Thumbs up to Josh's original commit. No further changes should be made.

@jsquyres
Copy link
Member

@hppritcha Good to go.

@jsquyres jsquyres merged commit c71996e into open-mpi:v2.x Jul 27, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants