Skip to content

WeeklyTelcon_20191105

Geoffrey Paulsen edited this page Nov 8, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Artem Polyakov (Mellanox)
  • David Bernhold (ORNL)
  • Akshay Venkatesh (NVIDIA)
  • William Zhang (AWS)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Noah Evans (Sandia)
  • Thomas Naughton (ORNL)

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian Barrett (AWS)
  • Edgar Gabriel (UH)
  • George Bosilca (UTK)
  • Todd Kordenbrock (Sandia)
  • Brendan Cunningham (Intel)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Ralph Castain (Intel)
  • Tom Naughton
  • Xin Zhao (Mellanox)
  • mohan (AWS)

Agenda/New Business

New PRRTE launcher proposal on mailing list.

  • All of this in context in v5.0

  • Intel is no longer driving PRRTE work, and Ralph won't be available for PRRTE much either.

  • PRRTE will be a good PMIX developement environment, but no longer a focus to be a scale and robust launcher.

  • OMPI community could come into PRRTE, and put in production / scalability testing, features, etc.

  • Given that we have not been good at contributing to PRRTE (other than Ralph), there's another proposal

    • There's been a drift from ORTE / PRRTE, so transitioning is risky.
  • Step 1. Make PMIX a first class citizen

    • Still good to keep PMIX as a static framework (no more glue, but still under orte/mca/pmix, but basicly just passes through, and call PMIX_ calls directly.
    • Allows us to still have internal backup PMIx if no external PMIX is found.
  • Step 2. We can whittle down orte, since PMIX does much of this.

  • Two things PRRTE won't care about, is scale and all binding patterns.

  • Only recent versions of SLURM have PMIx

  • Need to continue to support ssh.

    • Not just core PMIx, still need daemons for SSH to work, but they're not part of PMIx.
    • Part of ORTE that we wouldn't be deleting.
  • What does Altair PbsPro and open source PbsPro do?

    • Torque is different than PbsPro
  • Are there OLD systems that we currently support that we still don't care, and could discontinue support in v5.x

    • Who supports PMIx, and who doesn't
  • If PMIx becomes a first class citizen and rest of code base just makes PMIx calls, how do we support these things?

    • mpirun would still have to launch orteds via plm.
    • srun wouldn't need
    • But this is how it works today. Torque doesn't support PMIx at all, but TM just launches ORTEDs
    • ALPS - aprun ./a.out - requires a.out to connect up to ALPS daemons.
      • Cray still supports PMI - someone would need to write a PMI -> PMIX adapter.
    • ORTE does not have the concept of persistant daemons
  • Is there a situation where we might have a launcher launching ortes and we'd need to relay pmix calls to the correct pmix server layer?

    • Generally we won't have that situation, since the launcher won't launch ORTEds.
  • George's work currently depends on PRRTE

    • If ORTEDs provides PMIx_Events, would that be enough?
      • No George needs PRRTE's fault-tollerant overlay network.
      • George will scope the effort to port that feature from PRRTE to ORTE.
  • ACTION - Please gather list of resource managers, and Tools that we care about supporting in Open-MPI v5.0.x

  • Today - Howard

    • Summary - make PMIx a first class citizen.
    • Then whittle away ORTE as much as possible.
    • We think the only one who uses PMI1, and PMI2 might be cray.
      • Howard doesn't think Cray's even going to go that direction, might be adopting pmix for future direciton. Good super computing question.
      • Most places will be whatever SLURM does.
      • What will MPICH do? suspect PMIx
    • Howard thinks that by the time Open-MPI v5 gets out
    • Is SLURM + PMIx dead? No, it's supported, just not all of the
  • George looked into scoping the amount of work to bring reliable overlay network from

    • PRRTE frameworks not in
  • Howard also brought up that Sessions only works with PRRTE right now, so would need to backport this as well.

  • Only thing that depends on PRRTE is Sessions, Reliable connections, and Resource allocation support. Thing Geoffry Valle was working on before. Howard will investigate.

More discussion regarding PRRTE vs ORTE discussion

  • Sounds like no one needs PMI1 or PMI2 in Open MPI v5+
  • DVM - persistant daemon aspect (even outside of MPI).
    • People pulling runtime out and not using MPI
    • A lot of fixes for this went into PRRTE but didn't make ORTE - Thomas
    • Something we should/could bring back to ORTE.
      • Some people finding benifits inside of a resource manager.
      • Portability. Abstraction to have a runtime layer they could carry around.
    • One other thing that standalone PRRTE gives you.
    • Thomas feels long term PRRTE has more value long term.
    • MPICH used to use OMPI ORTE to launch at some point.
      • If there's a standalone project useful for other projects.
  • Two major items that favors PRRTE over ORTE
    • Reliable overlay network done on PRRTE, not ORTE.
      • Nothing pushed upstream of PMIX/PRRTE yet.
    • Sessions - prototype done with PRRTE
      • PMIx as first class citizen (nice, but not sufficent for sessions).
      • Ralph and Howard convinced themselves that what Sessions needs is in PMIx not ORTE layer.
      • Still some items only in PRRTE needed, but not huge.
  • This has been a bit of a roller coaster.
  • Either way, we've been largely relying on Ralph, so we'll need to step up.
  • We need to have a way to make the decision go forward.
  • Howard could look into
  • How bit-rotted IS DVM support.
  • Is there feature enhancements in ORTE not in PRRTE?
  • PRRTE - we need a diagram that high
    • Sometimes if you launch with persistant dameon, IT does the binding, but your later prun might be in conflict.
  • Need a feature list.
  • Don't lose sight of usefulness of external runtime environment.
  • Need unit tests.
  • Who can do this feature comparison?

Lock-bot

  • No reasons to NOT
  • Wanted comments from community.
  • George - hit this a few weeks ago on AWS.
  • William Zhang has not yet committed some graph code for reachability similar to usnic.
    • Brian/William will get with Josh Hursey to potentially test some more.

Face to face

  • Please register on Wiki page, since Jeff has to register you.
  • Think
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • But willing to step asside if others want to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.
  • It's official! Portland Oregon, Feb 17, 2020.
    • Safe to begin booking travel now.

Infrastrastructure

Submodule prototype

  • Can we just turn on locbot / probot until we can get AWS bot online? *

  • OMPI has been waiting for some git submodule work in Jenkins on AWS.

    • Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
      • Anyone with github account for ompi team should have access.
      • PR 6821
      • Apparently Jenkin's isn't behaving as it should.
    • Three pieces: Jenkins, CI, bot.
      • AWS has a libfabirc setup like this for testing.
      • Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
    • William Zhang talked to Brian
      • Not something AWS team will work on, but Brian will work on it.
    • Jeff will talk to Brian as well.
  • Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.

    • William will probably be admining the Jenkins/AWS or communicating with those who will.
  • Merged --recurse-submodules update into ompi-scripts Jenkins script as first step. Let's see if that works.

Modular thread re-write (noah)

  • PR used to see if Poll would wait at all, and if it would

    • Howard is working on configure stuff to work.
    • Argobots - problem is integrating libevent.
      • libevent today is a framework.
        • libev support, would it solve the problem? From a high level it's a stripped down version of libevent.
      • No mechanism yet for one user-level thread to switch to another user-level thread.
      • Problem? libeevent is polling too hard, and breaking things.
    • UGNI and Vader BTLs were getting better performance, not sure why.
    • For modular threading library, might be interesting to decide at compile time or runtime.
    • Previously similar things seemed to be related to ICACHE.
    • Howard will lok at.
  • Artem - Mellanox developers are doing some changes,

    • might require enabling
    • Google actions ? On the repo.
    • PMIx - already migrated to this

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • Will put out RCs for v3.0.5 and v3.1.5 this week.
  • Please test RCs when they become available.
  • Start drawing up a list of fixes that won't be backported to v3.0.x
    • Datatype bug won't be backported, because it snowballed too big.
    • Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.

Review v4.0.x Milestones v4.0.2

  • v4.0.2 was released and haven't had any catastrophic issues come in.
  • We're begining to merge in new v4.0.3 PRs
  • PR 7116 - giles updates some code for flang on master.
    • There were 3 commits. One definately broke ABI, but other two shouldn't have.
    • Some concern even those 2 might break something on release branch.
      • Going to ask giles on PR if this is important for him being on v4.0.x
      • Another detail is flang is old and being replaced.
    • another REAL problem with v4.0.x - no MPI_NO_OP
      • Geoff will just PR the missing MPI_NO_OP change to v4.0.x

v5.0.0

  • Schedule: April 2020?
    • Wiki - go look at items, and we should discuss a bit in weekly calls.
    • Some items:
      • MPI1 removed stuff.

Review Master Master Pull Requests

CI status

  • IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
    • Austen is looking into
  • Absoft 32bit fortran failures.

Depdendancies

PMIx Update

ORTE/PRRTE

  • No discussion this week.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally