Skip to content

WeeklyTelcon_20230124

Geoffrey Paulsen edited this page Jan 24, 2023 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Oops, didn't capture attendee list this week.

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Edgar Gabriel (UoH)
  • Geoffrey Paulsen (IBM)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Josh Fisher (Cornelis Networks)
  • Josh Hursey (IBM)
  • Luke Robison
  • Thomas Naughton (ORNL)
  • Tommy Janjusic (nVidia)
  • William Zhang (AWS)

New Items

  • When issues with various company CI controls, please post in #general Slack.

  • Issue 11269 - https://github.com/open-mpi/ompi/issues/11269 - consensus on what the desired mpirun behavior is? When launching an application using mpirun, should the mpirun library used be the same as the one used to launch the application, or should it be the one the application was built against? Or another alternative

    • Lets figure out what the correct behavior for finding MPI libraries. If you launch an MPI application with v5.0.x mpirun.
      • on default for v4.1.x
        • full path to mpirun, --prefix to mpirun, or , we'll look for executables in prefix path/bin
        • at one time, we set LD_LIBRARY_PATH of application (rather than one originally built with)
          • Today main and v5.0.x we don't do this last part, but this is confusing.
    • Confuses everyone when application uses what it was linked with rather than mpirun used.
    • does --prefix set the LD_LIBRARY_PATH for the application.
    • For some it's natural to set the LD_LIBRARY_PATH to change the mpi library.
    • Concern if we set LD_LIBRARY_PATH, we might pickup other libraries (than just MPI).
      • USE_CASE: mpi set in default location ex: /usr/local/lib,
    • If using modules, shouldn't use --prefix
    • By default mpirun won't add anything to PATH or LD_LIBRARY_PATH.
      • Exception is if you specify full path to mpirun, or --prefix as arg to mpirun. Then mpirun sets paths and ld-library-paths.
    • Do we think the behavior of the manpage is what we want?
    • Plan: if on a platform that supports runpath or rpath, just use that to find libraries associated with orted/prted.
      • distinction between what you want for starting orted, or starting application.
      • There's a bunch of code to make prrte and application behave rationally for --prefix
        • 90% of users will have prrte and ompi installed in same path
      • --prefix will apply to anything using mpirun.
      • on v5.0.x, it does set
    • Consensus, William will try to summarize the desired behavior in the issue, and then we can document and PR whatever changes are needed.
    • Sounds like this is a blocker for v5.0.x

v4.1.x

  • Bug in PMIx for large corecount machines was investigated. AWS got patch for fix.
    • Another bug in PMIx for v4.1.5 (released this weekend).
    • Will integrate this new PMIx and get it out the door.

v5.0.x

  • RC from last week, got pushed to this week.

  • Synced a number of fixes from main.

  • Waiting on PMIx and PRRTE submodule update.

  • Need documentation for v5.0.0

  • Manpages need an audit before release.

    • Double check --prefix behavior
    • Not the same behavior as v4.1.x
  • What is status of HAN?

    • Joseph had some more experiments. HAN collective component with shared memory PR, we were pretty good compared to tuned and another
      • Comparing HAN with shared Mem component.
      • How many ppr? Between 2ppr and 64ppr
    • Better numbers, would be good to document this.
      • In OSU there's always a barrier before the operation. If Barrier and operation match up well, you get lower latency.
      • We'd talked about supplying some docs about how HAN is great, and why we're enabling it for v5.0.0 by default.
        • Like to include instructions on how to reproduce as well for users.
        • document in ECP -
      • Our current resolution is to enable it as is, and fix current regressions in future releases.
      • What else is needed to enable it by default?
        • Just need to flip a switch.
        • The module that Joseph has for shared memory for HAN at the moment would need some work to add additional collectives.
        • And it relies on xpmem to be available.
        • So for now just enable HAN for collectives we have, and later enable for other collectives.
        • George would like to re-use what tuned does, without reimplementhing everything, but a shared memory component is a better choice, but with more work.
        • If we don't enabled HAN now by default, it's v5.1 (best case) before it's enabled.
          • The trade offs lean toward turning it on and fixing whatever problems might be there.
        • There is a PR for tuned (increases default segment size), and changes algorithms in tuned for shared memory.
        • Need to start moving forward, rather than doing more analysis.
  • Did the --enable-timing code has been broken. The configure + env works, but the results are gibberish.

    • Setup a discussion for longer term solution.
      • Could also discuss on PMIx meeting every 2 weeks.
    • As far as timer mechanism, fixes are in main and in v5.0.x
      • Main has not been merged yet 11305 - not directly to do with timers, but cleans up. So we would like this in main and v5.0.x
  • hwloc and PV (https://github.com/open-mpi/ompi/issues/11246)

    • Document in v5.0 README: If you have PV, need hwloc 2.7.2+, there is a v2.8 out there as well.
    • Is there any reason to update our internal hwloc to v2.7.2
      • Users of PV are large HPC installation.
      • Work someone has to do, got to stop messing with hwloc

Main branch

  • Would like to remove hwloc and libevent internals for next MAJOR (from master) release.

ITT

Clone this wiki locally