Skip to content

WeeklyTelcon_20190611

Geoffrey Paulsen edited this page Jul 2, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Akshay Venkatesh (nVidia)
  • Artem Polyakov
  • Brian Barrett
  • Dan Topa
  • David Bernholdt
  • Geoff Paulsen
  • Howard Pritchard
  • Jeff Squyres
  • Josh Hursey
  • Ralph Castain
  • Thomas Naughton
  • Todd Kordenbrock

not there today (I keep this for easy cut-n-paste for future notes)

  • Aravind Gopalakrishnan (Intel)
  • Arm (UTK)
  • Brandon Yates (Intel)
  • Brendan Cunningham (Intel)
  • Edgar Gabriel
  • Geoffroy Vallee
  • George Bosilca
  • Jake Hemstad
  • Joshua Ladd
  • Matias Cabral
  • Matthew Dosanjh
  • Michael Heinz (Intel) - Introducing Brandon
  • Nathan Hjelm
  • Noah Evans (Sandia)
  • Peter Gottesman (Cisco)
  • Xin Zhao
  • mohan

Agenda/New Business

Issue 6666

  • Version checks vs functionality.
    • UCX has committed to stability
    • We don't want to be caught flat footed again, sounds like
    • Starting with UCX v1.7, they'll make UCT backwards compatible.
  • other than people who explicitly set the route to debrujin, who will this impact?
    • No one, never used unless specifically asked for.
  • Ralph's not convinced that debrujin component can even WORK with launch tree mechnaism. That's not how it's designed.
  • Thought, is there a way to make debrujin just alias the 'other' routed component, could we do that?
    • No way to do this easily.
  • Is it better to say - you asked for X and gave you Y, or you asked for X and we no longer support it.
  • Component routed initialization phase is not stateless, but fixed on master.
  • To immediately fix release branches is to just remove the problem component.
  • Did it ever work in .0 release, then no one could have been using it.
    • Worked when developed because only one component at a time.
    • When we transitioned to all components being active (for another reason)
      • we weren't testing this component at scale.
  • Don't want to do aliasing. Rather just git rm this broken component.
  • ACTION: Ralph will repush a PR to remove the broken component that is also messing up the OTHER components.

Infrastrastructure


Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • issue 6655.
    • hostname work isn't advancing
    • Waiting for vader and atomic fixes from v4.0.x (Geoff)

Review v4.0.x Milestones v4.0.2

  • Moved release date to June 21st.
    • Shouldn't move up releases quickly like this.
    • Others were planning on a Sept release
  • Should we do a 4.0.3 in Sept? (latest we can practically do in 2019)
  • Drivers: UCX compile issue
    • Vader / ob1 - more serious.
      • Other issues in vader, some attempts to fix (double free using xpmem)
        • Proposed fix, but never actually fixed.
    • If you compile with gcc 5 or 6, IMB hangs on x86_64.
      • yes have fix in hand (on master PR 6711)
    • Work around is to use an older GCC.
      • What testing gaps led to us not hitting this.
      • ACTION: we should discuss how testing holes.
      • ACTION: we should run IMB or OSU as part of CI.
  • Perhaps we should have a testing czar.
    • Now we look at it as an after thought.
  • So how bad is our vader issues?
  • If we don't have fixes we need, June is unrealistic
  • Could use Github Project KanBan for this.
  • Howard's making a new label and applying it to all issues needed for v4.0.2
  • We will review that next week and discuss schedule.

Review Master Master Pull Requests

v5.0.0


Depdendancies

PMIx Update

  • Schedule

Next face to face

  • Schedule
    • Suggesting Sept 16
    • Jeff re-update doodle for availability. We'll pick next week.
  • Location
    • TBD

MTT

  • IBM's not submitting after cluster update
  • Brian working on 512 nodes ssh.
  • not much MTT development going on.

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2019 WeeklyTelcon-2019

Clone this wiki locally