Skip to content

WeeklyTelcon_20180731

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brian
  • akshay
  • Geoffroy Vallee
  • Matthew Dosanjh
  • Nathan Hjelm
  • Peter Gottesman (Cisco)
  • Ralph Castain
  • Todd Kordenbrock
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Joshua Ladd
  • Josh Hursey
  • Matias Cabral
  • Howard.
  • Edgar Gabriel
  • Thomas Naughton
  • Akvenkatesh (nVidia)
  • Howard Pritchard
  • Dan Topa (LANL)
  • David Bernholdt
  • Dan Topa (LANL)

Agenda/New Business

  • NEW: info.c warning - Jeff thought we'd fixed, but ralph saw on Cray.

  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
    • eliminate all of our atomic for C11 atomics.
    • ACTION: Please review and comment on code.
  • ORTE discussion went well, Geoffroy Vallee wrote up summary and posted to devel-core on Jul 24th.

    • ACTION: Everyone please read and reply to devel-core with your thoughts.
  • github suggestion on email filtering

Minutes

Review All Open Blockers

Review v2.x Milestones v2.1.4

  • v2.1.4 - Final release on v2.1.x
  • Aug 10th is release date.
  • Look for an RC canidate this week
  • Type-o fix for PMIx (MB prefix), but not upgrading because 2.1.4 is end of 2.x stream

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.3 - targeting Sept 1st (more start RCs when 2.1 wraps up.
    • Anticipate RC1 after Aug 10th release of v2.1.4 releases.
    • Got good progress in reviews.

Review v3.1.x Milestones v3.1.0

  • v3.1.2 release process, starts after Sept 1st release of v3.0.3
  • Lots of PRs

v4.0.0

  • Schedule: branch: July 18. release: Sept 17
    • Date for first RC - Aug 13 (after sunset of 2.1.4)
  • Cuda support:
    • Does nVidia want if --with-cuda, then openib included by default?
      • Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
      • Warning message will mention deficate openib vs ucx
      • Has this work been done???
  • NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
  • Sent email to ompi-packagers list with schedule and info on
    • Debian (Allister)
    • Jeff has a half-typed out reply. Allister asked Do we really need to change the major .so version?
      • His point is that they have more and more packages compiled against Open MPI.
      • The real question is Do we mark the version as backwards incompatible?
      • The idea that you have to have the same version of Open MPI everywhere.
      • Nathan would say (watching what's gone in), that we should be okay, but that we should look.
    • In v3.1 --mpi-compat was on by default. In v4.0 there's a flag where the ABI didn't change.
      • MPI_UB was #define to &ompi_datatypeT
      • Could put the symbols back so they're there.
      • verification - enable cxx - Would Paul Hargrove help with .so testing.
    • Fork in the road here... A couple of options:
      1. Set the CRA (.so versions) (Assuming --mpi-compat is NOT enabled). An app compiled against v3.1 won't run with v4.0
      2. Dynamicly set the CRA values (.so versions) based on the (--mpi-compat flag) bunch of maintence work.
        • Really HAIRY!
      3. NICE: Could Make --mpi-compat only affects the mpi.h, but don't affect the symbols in back end.
        • As long as they go away eventually - no point in removing from standard if we don't eventually remove them.
        • Raise C by 10, AND raise A by 10 - Make sure we get it right (do it like a minor release bump)
          • use rules from v3.0.x to v3.0.x+1 -
      • Don't know how this affects Fortran. (seperate MPI_UB and MPI_LB). It's the same as in C. Fail to compile without --mpi-compat
    • Symbols WILL go away in v5.0
    • Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.

PMIx

  • ORTE/PRTE - Geoffroy Vallee sent out document with summary to core-devel. Everyone please read and reply.

New topics

  • From last week:
    • MTT License discussion - MTT needs to be de-GPL-ified.
      • All go try the python. - All the GPL is in the perl modules (using python works around that).
    • Last week Brian had an interesting proposal to remove all of the perl out, or the python out?
    • Schedule - Like resolution by end of july.
      • What does this look like? Run our MTT python client.

Overall Runtime Discussion (talking v5.0 timeframe, 2019)

  • Will discuss this in a sperate call 2nd week in July.
  • Two Options:
    1. Keep going on our current path, and taking updates to ORTE, etc.
    2. Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
      • Opal used to be single process abstraction, but not as true anymore.
      • API of foo, looks pretty much like PMIx API.
        • Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
      • to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
      • what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
      • Could use their prun and wrap in a new mpirun wrapper
      • PRTE doesn't just replace ORTE. PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
        • prun has a lam-boot looking approach.
      • Build system about opal, etc. Code Shufflling, retooling of components.
      • We want to leverage the work the PMIx community is doing correctly.
  • If we do this, we still need people to do runtime work over in PRTE.
    • In some ways it might be harder to get resources from management for yet another project.
    • Nice to have a componentized interface, without moving runtime to a 3rd party project.
    • Need to think about it.
  • Concerns with working adding ORTE PMIx integration.
  • Want to know the state of SLURM PMIx Plugin with PMIx v3.x
    • It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
  • A few related to OMPIx project, talking about how much to contribute to this effort.
    • How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
    • Would be nice to support both groups with a straight forward component to handle both of these.
  • Thinking about how much effort this will be. and manage these tasks in a timely manor.
  • Testing, will need to discuss how to best test all of this.
  • ACTION: Lets go off and reflect and discuss at next week's Web-Ex.
    • We aren't going to do this before v4.0 branches in mid-July.
    • Need to be thinking about the Schedule, action items, and owners.

Review Master Master Pull Requests

  • PR for setting VERSION on master Have we broken any VERSIONs

Review Master MTT testing

  • Annual committer by July 18 -

    • AMD, ARM, Ructers, Mellanox, nVidia - Please go do Annual Commiter Reviews.
  • Hope to have better Cisco MTT in a week or two

    • Peter is going through, and he found a few failures, which some have been posted.
      • one-sided - nathan's looking at.
      • some more coming.
    • OSC_pt2pt will exclude yourself in a MT run.
      • One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
        • Now that osc_pt2pt is ineligible, many tests fail.
        • on Master, this will fix itself 'soon'
        • BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
        • Probably an issue on v3.x also.
      • Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
  • OSHMEM v1.4 - cleanup work

    • How do we look for test coverage of this? Right now just basic API tests.
  • Next Face to Face?

    • When? Early fall Septemberish.
    • Where? San Jose - Cisco, Albuquerque - Sandia
    • ACTION: Geoff will Doodle this.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally
You can’t perform that action at this time.