Skip to content

Meeting Minutes 2018 09

Matias Cabral edited this page Nov 9, 2018 · 13 revisions

Open MPI fall/winter 2018 developer meeting

(yes, the "2018 09" title on this wiki page is a bit of a lie -- cope)

Tues Oct 16 (to accommodate limited availability)

Meeting information / logistics page

OFI (Libfabric):

  • Intel presenting some slides on OFI MTL
    • OFI Presentation
    • (pronounced Oh_Eff_Eye, or Oh_Fee, or Oh_Fye)
    • Added support for remote CQ data.
      • Not all providers support remote CQ data.
    • Affects MPI_Send/Isend/Irecv/Recv/Improbe/Iprobe
      • Now conditionals in there.
    • Scalable endpoints support in OFI MTL
    • Registering specialized communication functions based on provider capabilities
    • m4 generated C code to avoid code duplication?
    • Discussion: OFI components to set their priority based on the provider found.
    • OFI Common module creation.
  • Conversation:
    • Libfabric has to improve: The idea that all libfabric providers don't support all features is bad for libfabric users of the interface.
    • Libfabric should better set expectations of what's needed.
    • This is still useful feature, but libfabric api needs to be better.
  • slide 5 and 6:
    • Possible Solutions to generate provider specific implementation.
    • CPP Macros multiplies quickly via many arguments. Very fragile
    • Expand m4 configuration step.
  • Converstaion:
  • Today, don't require m4 to BUILD Open MPI, just to make dist.
    • Version of m4 is also an issue....
    • Suggestion, don't generate this C code at make time, but generate ALL C files at Make dist time, and only compile some of them at build time.
    • Make dist time also has perl and python.
    • No fallback to build slowly.
    • Will make ALL of the providers at make dist time, and will setup function pointers at runtime.
    • Jeff jokingly suggested that they could implement it with Patcher to get rid of function pointer indirection once the provider is known.
  • Providers would implement these pieces themselves.
  • TCP / Sockets reference may need an owner.
  • Does it make sense to NOW change OFI from an MTL to a PML?
    • Mellanox first did MTL, but then switched to PML.
    • Intel made CM better, so difference is now much smaller.
    • Intel thinks about 100 ns penalty from MTL rather than PML.
      • Benchmarking this is important, because much of CM layer would just need to push some of this functionality down to PML.
    • Most are okay if it's an MTL or a PML, but don't want extra work.
    • A resource issue now.

Scalable Endpoints - OFI SEPs

  • Expose hardware resources through libfabric APIs.
  • Thread -> ctx assignment.
  • This feature enables exclusive transmit/receive contexts per MPI Comm.
  • OFI MTL by communicator ID.
    • Ongoing work to add "on first access"
  • OFI BTL on first access (or round robin)
    • TLS used to save OFI context used by specific thread (round robin without TLS)
    • Forces own progress once an empirically set threshold exceeeds.
  • Both of the above only makes sense with multi-threaded opal_progress
    • ARM has slides and will talk about this later today.
  • Thread grouping
  • OFI MTL Benchmarked with IMB-MT (available with Intel MPI 2019 Tech Preview)
  • OFI BTL Benchmarked different usage with ARM's "Pairwise RMA" benchmark on thannon 's github fork
    • Each thread running on it's own MPI Communicator.
    • Open MPI has lots of thread unsafe code components. Could flag at runtime if opal_progress is thread safe or not (and either funnel or not).
    • ARM will discuss.
  • What are assumptions about endpoints? Are all connectionless. Any Endpoint can talk to any endpoint.
  • Only broadcast one endpoint, but address relative contexts
  • Problems (slide 14)
    • Multiple OFI components "may" inefficiently use resources.
    • Today without addressing Issue 5599, BTLs are always initialized
    • Suggestion: Portals library does this, look at vector addring.
  • Dicsussions about packaging, how to get instances to share common data structures. OFI Common module.
    • Different policies for resource sharing: EP, CQ, AV.
      • policies will be "share endpoints", or each component create their own.
  • OFI Components Dynamic Priority (slide 16) Issue 5794 - MTL will discard default currently known "slow" providers.

TCP on master right now.

  • Cisco and amazon are seeing tests fail "socket closed"
  • A patch about a month ago started treating a socket close as a hard error and aborting the job.
  • Just need to apply one of two patches.
    1. One back off the hard error in shutdown condition.
    2. Just revert the patch that broke everything. *. As a community we should do one or the other. But this just hides the real problem.
  • TCP patch just exposed a bigger problem.
    • TCP BTL doesn't have a connection and handshake.
    • Second PML calls abort, but during finalize perhaps it shouldn't abort.
    • PML should know about it's BTLs going away.
    • No longer a fence in Finalize.
  • We all agree that we should apply the same logic everywhere
    • If a commit breaks master, we should revert it, or disable it
    • Doesn't shut the door to better fix / redesign later.


Multi-threaded OPAL Progress

  • PR 5241: Add MCA param for multithread opal_progress() (George, Arm)
  • Arm presented some graph about ideas around multi-threaded opal_progress()
  • Old design still capped at single thread.
    • luck based performance (threads are undeterministic)
    • Current design performs well in oversubscribed case.
  • proposed design would be to move thread protection down to component level.
    • let every thread call opal_progress() to get the benefit of threads.
    • can have some thread go through progress, and can have some in thread pool to do tasks.
  • 1024 bytes +35% injection rate (shines at small msg, why not 1byte)
  • What can be taskify?
    • Matching
    • Pack, unpack
    • Anything that does not need to be in critical section.
  • Downsides
    • Task creation overhead.
    • If only one thread waiting, more overhead/latency
  • What we need from components:
    • They have to be thread safe (or disqualify with THREAD_MULTIPLE)
  • Components should not use lock in component_progress()
    • For those not safe, can just use trylock instead.
    • Using lock
  • Out of sequence impact
    • receiving
    • MPI has a way to tell MPI not to enforce ordering
      • Once you go multi-threaded, most of the time is then spent ordering.
      • can get 100% speed up if app doesn't need ordering (app handles tags themselves).
      • RDMA (bigger than FIFO), stalls pipelines, really bad.
  • MADNESS performance (C++, bad mem management)
    • 5 times slower on openib btl
    • RDMA path lose a lot of performance because of out of order seq.
  • We need to better educate users to use multiple communicators.
    • Is this because of Communicator lock contention or seperate match space?
      • Both.
  • What do we do today?
    • btl modules needs lock to protect it's progress.
  • All agree that mt_opal_progress makes sense, in hopes to get better injection rate.
    • Looking for proposal on 'core' locking around btl progress functions.
  • What about the single threaded?
    • Will affect, but should be a single thread through a trylock (~0)
    • With this functionality, single thread can easily have multiple progress threads.
  • WORK Phases:
    1. mca param # threads allowed in opal_progress, default: 1
    • Also means changing opal_progress() to be multi-threaded.
    • ARM will PR
    1. Arm provides recommendation(s)
    2. maintainers update components w giant trylock().
      • Any component that registers a progress function needs this trylock()
      • libnbc as well? (libnbc already does this sort of)
      • Everyone should do this, should be very simple.
    3. maintainers update components to make use of multiple contexts.
    4. Argue about default of mca param > 1
  • End Goal:
    • If components want to take advantage of multi-threaded progress, they can do a bit more work (3 lines)
    • Make

5599 - BML initialization

  • r2 / BTLs are initialized even when they are not used (Jeff)
    • Conclusion, there's not a great solution for NOT initializing the BTLs because of one-sided.
      • If we did this we need an answer for what BTL to use when we get there. Right now only portal UCX have an answer for that.
    • Also not a straightforward way for multiple components to have OFI have one endpoint be shared across multiple components.
      • Ralph says we DO have a way to Allow the OFI MTL component to let the OFI BTL component know the other is using a certain resource. Could set a PMIx event that the other could catch.
    • either extent libfabric / libpsm2, or write a common component between ompi and opal.

4.0.x status / roadmap

  • RC tomorrow
  • Release Monday!
  • Nathan just filed 5889 blocker.

TCP bric-a-brac:

  • Discuss TCP multilink support. What is possible, what we want to do and how can we do it.
    • Amazon signed up to do this, but might not happen until Q1 2019
  • Discuss TCP multiple IP on the same interface. What we want to do, and how we plan to do it.
    • Right now if you have both IPv4 AND IPv6, only publish the IPv6 if they have IPv6 enabled on all nodes.
    • Everyone's happy with this current behavior.
  • TCP BTL progress thread. Does 1 IP interface vs. >1 IP interface matter?
    • Discussed this morning.

C compiler discussion

  • ISSUE: Vader hit some bugs where compilers were ripping a 32bit write into two different 16bit writes.
  • If you read the linux doc on evil things the compilers do.
  • Linux has some macros WRITE_ONCE, READ_ONCE, ACCESS_ONCE. But the macro only works for gcc (> v4.1), llvm, and/or intelcc
    • Not sure if this is just a temporary measure until we have C11 compilant compilers.
  • Should we limit the number of C compilers that can be used to compile the OMPI core (e.g., limit the amount of assembly/atomic stuff we need to support).
    • E.g., PGI doesn't give us the guarantees we need
    • Probably need to add some extra wrapper glue: e.g., compile OMPI core with C compiler X and use C compiler Y in mpicc.
      • Are there any implications for Fortran? Probably not, but Jeff worries that there may be some assumption(s) about LDFLAGS/LIBS (and/or other things?) such that: "if it works for the C compiler, it works for the Fortran compiler".
  • Decided to require compilers that can correctly guarantee WRITE_ONCE, READ_ONCE, and ACCESS_ONCE macro
  • xlC does not guarantee and so can't compile Open MPI core.
  • Brian and Nathan will sort out who will do this work on master for next release from master.
  • Will clearly state which dirs to be compiled with core, and what a compiler needs to do to comiple the 'core'.
  • essentially configurey will ask fore 'core' compiler and 'main' compiler.

PMIx as "first class" citizen?

  • Shall we remove the OPAL pmix framework and directly call PMIx functions?

    • Require all with non-PMIx environments to provide a plugin that implements PMIx functions with their non-PMIx library
    • In other words, invert the current approach that abstracted all PMIx-related interfaces.
  • Why does Open MPI need to know about pmix server code?

    • Only reason it's here, is because of the OPAL abstraction layer.
    • Coming into the abstraction layer, because you're coming in with Opal types, and need to convert to pmix APIs.
  • Howard thought we'd already decided to just call PMIx directly.

  • And that this would be part of the same issue.

  • In the absence of ORTE, then have to download and use install PRTE and build it and do a LAMBOOT style.

  • LAMBOOT - could be buried under an mpirun like wrapper.

  • Engineering efforts?

    1. Making mpirun hide PRTE - Ralph says it's pretty trivial (he has something similar in PRTE)
    2. Would have a seperate "project" for this, and a number of ways to do this.
  • It is another piece with a different release schedule, and with different release goals. May be broader than what Open MPI needs.

  • If PMIx adds a new

  • New features coming down the road:

    • Groups (part of sessions)
    • Networking support in PMIx, but dont have a way to take advantage of in OMPI.
    • Containers.
  • About 4 or 5 things in next 6 months.

  • Ralph is having a talk at super computing about user's applications using PMIx directly. And more users will want to use this.

  • Is there a way a user who linked against pmix intercept our MODEX?

  • MODEX is special, but nothing stops a user from also getting a callback on ceratin PMIx events.

  • If we get rid of opal wrappers that call PMIx, does that concern us for future maintainability?

    • No, that was done back when we had to support multiple things, PMI1 and PMI2, and SLURM, etc.
  • If we did this, we the Open MPI community would expect PMIx to behave as an interface, like we do with MPI interface.

    • PMIx has made that promice, and now this is why they have PMIx standard.
  • Following the shared library versioning, and API promices PMIx is then in same boat as hwloc, libevent, etc.

  • Why not make hwloc and libevent 1st class citizens.

  • Talk about getting rid of libevent.

    • Next time do the work for libevent, we might consider move it UP to top level 3rdLevel directory.
    • Take a look at opal hwloc in PRTE.
      • We don't have an hwloc object, it's just a name translation.
      • We just call opal_hwloc...
      • part we have to retain, we define binding policies in
      • PRTE doesn't have imbedded hwloc. All the base functions are pulled into opal_hwloc
    • Ralph also has an opal event directory in PRTE.
    • In PRTE there is no abstractions
    • Oh wait, There IS and opal_pmix in PRTE, since there's some conversions in there.
      • Howard is taking a look at this.
  • Discussion expanded from just PMIx to also direct calling hwloc and libevent.

    • Orte pushes events into libevent as a way to sequence them. In a very large system, if you don't do things correctly, you can see this.
      • PMIx has been getting out of that habit and so everything's okay.
  • hwloc not much work.

  • pmix side would need to do translation. (Error codes can't get rid of, no way we can line up those error codes).

    • PMIx_info_ts - nice to get rid of doing these translations.
    • Easy for Ralph to update, since he's already done this in PRTE.
  • Difficulty - How do you deal with differences in version levels?

    • Easy for compile time differences. It either builds or doesn't.
    • Runtime differences in PMIx support is more difficult.
    • Example: Build OMPI against PMIx v4.0, but then RUN with v3.0 Linker should fail at runtime, if .so versioning is done correctly.
  • What do we do with SLRUM PMIx (16.05 first), and ALPS PMI?

    • Is there one for ALPs? No, but no reason they couldn't implement PMIx. They will support the launch mechnaism at least.
  • If we do this for OMPI v5.0, and you still want to run under an older SLURM, user COULD launch PRTE_BOOT in their SLURM PROPOSAL:

  1. Remove ORTE
  2. Remove RTE layer?
  3. Add PRTE
  4. Modify mpirun / prterun - see if there's an existing PMIx server, if not auto PRTE_BOOT the PRTE daemons, and then launch.
  • PRTE would HAVE to go to formal release

  • Any components in ORTE not in PRTE.

  • QUESTION: could mpirun act like a wrapper around srun?

    • No, but we could look at that.
    • Does SLURM support spawn?
    • If it did, could just call PMIx_Spawn, and let slurm launch the PMIx daemon.
  • This will destroy years of power point slides.

  • OMPI layer will just call PMIx.

  • Not THAT much work, because we don't have to change the ORTE side.

  • Howard is going to remove bfo component (removes a trouble spot)

  • Howard (ECP) in January timeframe.

  • Are we going to redistribute PRTE?

    • Yes, It's got same license as Open MPI.
  • Ralph moves hwloc and libevent up.

    • Some things that needs translation
    • But Don't do name shifting.
    • Want to see where the glue goes.
  • It will still check external first. and then build the internal pieces that don't exist.

  • Should we publish Open MPI release tarballs to



Summarize PMIx first class citizen

  • Ompi will call PMIx directly without opal wrappers.
  • Non-PMIx approaches will need to provide PMIx translations
    • For example PMI or PMI2
  • rip out ORTE
  • use released PRTE
  • mpirun would wrap prte_boot and boot it up launch, and shut down.
  • No name space shifting.
  • do need our own errors.
  • Intended for Open MPI v5.0 1st half of 2019

discuss PMIx compatibility issues and how to communicate them

  • addresses (some of) PMIx-to-PMIx compatibility issues.
    • But what about OMPI to PMIx compatibility?
    • And what about RM to PMIx compatibility?
    • How do we convey what this multi-dimensional variable space means to users in terms of delivered MPI features?
    • Case in point: (OMPI v3.0.x used with external PMIx 1.2.5, which resulted in some OMPI features not working).
  • At least three dimentions:
    1. What does OMPI support and do?
    2. What does PMIx client support and do?
    3. What does PMIx server support and do?
  • Giant rat-hole of issues.
  • OMPI can choose which versions of PMIx OMPI will support.
  • Would like to know WHAT failed, and WHY it failed in that mode.
    • Not just SLURM failed MPI_Comm_spawn.
    • But SLURM failed BECAUSE it was configured like.
  • Probably easier to do this at error gathering at end, rather than begining
  • Does PMIx have "WHO" is resource provider info?
    • Artem (Ralph proxy) not now, but could add key/value, without change in standard.
    • This would be optional, so it wouldn't change the standard.
  • DONT MAKE THIS OPTIONAL! This is why things are so hard!
    • Mellanox has this problem also (not sure what names of some keys are)
    • Places in PMIx standard which are obscure in some way.
  • Static capabilities we know at the begiing of time, we can put things in and reference when we fail.
  • Dynamic capabilities are harder.
  • Error reporting to the user is IMPORTANT.
  • Scientists need error messages that they can understand what to do, and who to talk to.
  • Jeff showed Josh's Opal_SOS - returned an error code at each level.
    • Lower level thought this error is critical, but higher level didn't think it was a critical error, because it can try other approaches.
  • Questions for PMIx
    • Error runtime for pmix
    • What about a tool that queries capability
  • What PMIx functionality won't work? May not be that much.
    • MPI_Comm_spawn - Spawn
    • MPI_Connect
    • Events - PMIx doesn't support this.
  • Need to find a use case, and drive that home.
  • Requirement is to identify where the error came from, and any additional description along the way, all the way back up to MPI.

public ompi-tests repository for easier sharing of testsuites among collaborators (Edgar)

  • originally ompi-tests repo was private because we didn't know where they came from (and if we have redistribution rights)
    • Move the known safe things to a new repo that is public.
    • Edgar developed some tests that are clearly ours.
  • Lets make a new public repo called 'tests' under 'open-mpi' group.
    • Edgar had one project that could be released
      • Latency test suite
      • ompi2 test suite
    • Needs a LICENSE
    • May want a release cycles.
    • may want two repos (to be checked out seperately) Edgar will think about.

Discuss memory utilization/scalability (ThomasN)

  • Thomas N has been on Oak Ridge National Lab ECP (Exascale Computing Project)
  • Scope and objectives:
    • want to minimize memory usage per rank.
    • using Tao for these.
    • Small patch to easily instrument OPAL for memory tracking.
    • opal_object - and a few spots get variable name, total size, and count.
    • subsctructures - like proc_t and how much of that is from the group_t.
  • Example profiling run: mpirun -np 2 tau_exec -T mpi.pdt ./a.out

Fujitsu's status

  • Fujitsu MPI for Post-K Computer
  • Development Status in Fujitsu
  • QA Activity in Fujitsu


Mellanox/Xin: Performance optimization on OMPI/OSC/UCX multithreading

Need vendors to reply to their issues on the github issue tracker

  • Open MPI really NEEDS vendors to check for new issues.

5.0.x roadmap

  • SCHEDULE: Aiming for next summer (2019)
  • ABI-changing commit on master (after v4.0.x branch) which will affect future v5.0.x branch:
    • Do we want to keep it? It's a minor update / could easily be deferred.
    • Keep it on master, since v5.0.x will be next master release mid-late 2019
  • George will cleanup checkpoint restart
  • C++ gone
  • PRTE change
  • Will NOT remove --enable-mpi1-compat or TKR version of use mpi module.
  • WILL put the TKR version of use mpi module under some new --enable-SOMETHING configury flag. Geoff Paulsen will get with Jeff Squyres and do this
  • Remove RoCE and iWARP support
    • Does UCX still work on older things? - Yes new emulation handles it except for very minor differences.
    • We didn't do this in Open MPI v4.0.0 because libfabric v1.7 was AFTER Open MPI v4.0.0
    • We WILL remove openib btl for v5.0
    • We do NOT need an iWARP (or ROCE) btl, UCX or Libfabric are fine.
  • So, This all means no v4.1.x
  • OMPIO automicity levels
  • OMPIO external 32

PMIx Roadmap

  • Review of v3 and v4 features
    • v3 has shipped, in OMPI v4.0.0
    • allows the scheduler/launcher to request a payload for PMIx that includes whatever info you want.
    • For example will pickup env variables in plugins and forward along.
    • Can ask it to assign networking resources.
    • PMIx puts all of this into the payload. Launch message carries it along to pmix daemons on the remote nodes, that knows how to pull out it's
    • All preliminary work to take advantage of this, is in Open MPI.
    • pnet framework has skeleton to use this.
    • Tool support for IO forwarding in v3
  • v4 - Howard
    • v4 includes the network topology graph, and provide a graph Right now, only doing it for Omnipath.
    • Two ways to get this data:
      1. pmixd query hwloc, and push it back up to head node, and wire up and send back down 'roll up' meathod - doesn't get switches
        • believe this appraoch only gets physical not virtual
      2. Query the subnetmanager.
  • Still talking about pointers to data, rather than copies of data. not sure if this will make PMIx v4
  • Outline changes in OMPI required to support them
    • If we follow up where PMIx symbols are exposed
  • Outline changes for minimizing footprint (modex pointers instead of copies)
  • Decide which features OMPI wants to use
  • Are there potential problems with users making pmix calls on same namespace as MPI?
    • Only thing we could think of is if user's app compiled against a different version of pmix.
  • Released the formal v2.0 standard.
  • Have a v3.0 draft standard sometime end of month.

Debugger transition from MPIR to PMIx

  • How to orchestrate it?
    • PRTE doesn't support MPIR.
    • Publicly said:
      • deprecate warning in mid-late 2018
      • remove MPIR in mid-late 2019 (lines up nicely with move to PRTE change)
    • If someone sets MPIR, can print a message to SHOW HELP
      • mpirun has a function that checks the "being debugged" flag (defaults 0)
      • MPIR_being_debugged
      • Ralph will put a show-help message, Jeff will edit PR
      • Will get this Deprecation warning into v4.0.0 in next few days.

Mail service (mailman, etc.) discussion - here are the lists we could consolidate down to:

  • Something happened in August where building was struck by lighting in basement

  • no redundancy or power strike. Something got fried, and took up to 7 days.

    • OMPI core
    • OMPI devel/packagers
    • OMPI users/announce
    • OMPI commits
    • HWLOC commits
    • HWLOC devel/users/announce
    • MTT users/devel
    • Do we want to move to a commercial hosting site? Would cost about $21/month, or about $250/year
  • Additionally: The Mail Archive is going through some changes. Unclear yet as to whether this will impact us or not

  • Ralph donated 3 years of hostgator (Expires July of next Year).

    • When this runs out, move website to AWS.
    • also hosting emails for some developers.
  • As of 4 Sep 2018, Open MPI has $575 in our account at SPI.

  • How do we detect if mail server is down?

  • Could have an hourly cron that checks if latest commit is the same as the last commit message.

  • Why don't we just move to AWS?

    • It's non-zero work. Lets just give mailman again, also $21/month is CHEAP
    • This sounds good to pay $250/year


  • Jeff created workspace and invited us.


To Be Scheduled

Clone this wiki locally
You can’t perform that action at this time.