Skip to content

WeeklyTelcon_20160209

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Geoff Paulsen
  • Brad Benton
  • Edgar Gabriel
  • Howard Pritchard
  • Joshua Ladd
  • Nathan Hjelm
  • Nysal Jan
  • ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock
  • Yohann Burette

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3 - Targeting April, unless there is a need.
    • Nathan will look at 0 byte send issue.
    • dev list of SLURM issues already fixed in 1.10.2
    • verbs usNIC not build by default - wait for review by Howard.
    • Fortran 08 - Jeff will take a look at today.
    • SLES 12 - was a race condition fork/exec before SIGCHILD detection. Fixed.
      • Long running jobs (Linpack) still having SIGCHILD issues.

Review 2.0.x

Review Master / new RFCs:

  • RFC to set the add_procs_cutoff to 32. PR1340

    • Just drop it down to 0!
  • Async Modex - at scale helps for sparce connectivity, hurts for full connectivity.

    • Which direction for default? Right now Full-modex (longer launch time for people who may not need it).
    • Ralph thinks for 2.0 leave it where it is (optional). Figure out when to turn it on.
    • Concern if people don't know if it's available. Put in NEWS?
  • --host vs. --hostfile behavior PR1344

    • Jeff would like consistent with how over subscription works, but no -np runs 1 proc.
    • two issues... how many slots, and how many processes.
    • change behavior so that if user doesn't specify -np but DOES specify --host we'll get 1 slot (and one process).
    • keep hostfile behavior same as today.

MTT master's not doing well at the moment.

  • Alot of issues are usNIC related. Jeff will look at.
  • nVidia look like dynamics related. Slyvian fixing something about way it launches.
  • Nathan will look at all one-sided failures.
  • tcp btl might have an issue, getting tried to lock resource but already locked warning.

Status Updates:

  • LANL
  • Houston
  • HLRS
  • IBM

Status Update Rotation

  1. LANL, Houston, HLRS, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally