Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_spawn fails without correct mpiexec in PATH #6335

Open
jedbrown opened this issue Dec 18, 2022 · 15 comments
Open

MPI_Comm_spawn fails without correct mpiexec in PATH #6335

jedbrown opened this issue Dec 18, 2022 · 15 comments

Comments

@jedbrown
Copy link
Contributor

In debugging broken MPI_Comm_spawn on a Debian system, I reproduced the following issue with latest (68b574e) main and vanilla configuration. It looks like -pmi_args is being passed to /usr/bin/mpiexec (which is OMPI on this system). I think it should be able to spawn without specially crafted PATH or running within matching mpiexec.

$ ~/usr/mpich-clang/bin/mpicc spawnintra.c -o spawnintra
$ ./spawnintra
mpiexec: Error: unknown option "-pmi_args"
Type 'mpiexec --help' for usage.
^C
$ PATH=$HOME/usr/mpich-clang/bin ./spawnintra
$ ~/usr/mpich-clang/bin/mpiexec -n 1 ./spawnintra
$ ~/usr/mpich-clang/bin/mpichversion
MPICH Version:      4.2a1
MPICH Release date: unreleased development copy
MPICH ABI:          0:0:0
MPICH Device:       ch3:nemesis
MPICH configure:    --prefix=/home/jed/usr/mpich-clang --enable-error-checking=runtime --enable-error-messages=all --enable-g=meminit CC=clang CXX=clang++ F77=gfortran FC=gfortran --with-device=ch3
MPICH CC:           clang    -O2
MPICH CXX:          clang++   -O2
MPICH F77:          gfortran   -O2
MPICH FC:           gfortran   -O2

The source spawnintra.c is from the MPICH test suite (modified to be stand-alone).

/*
 * Copyright (C) by Argonne National Laboratory
 *     See COPYRIGHT in top-level directory
 */

#include <mpi.h>
#include <stdio.h>

/*
static char MTEST_Descrip[] = "A simple test of Comm_spawn, followed by
intercomm merge";
*/

int main(int argc, char *argv[]) {
  int errs = 0, err;
  int rank, size, rsize, i;
  int np = 2;
  int errcodes[2];
  MPI_Comm parentcomm, intercomm, intracomm, intracomm2, intracomm3;
  int isChild = 0;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  // MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

  MPI_Comm_get_parent(&parentcomm);

  if (parentcomm == MPI_COMM_NULL) {
    /* Create 2 more processes */
    MPI_Comm_spawn((char *)"./spawnintra", MPI_ARGV_NULL, np, MPI_INFO_NULL, 0,
                   MPI_COMM_WORLD, &intercomm, errcodes);
  } else
    intercomm = parentcomm;

  /* We now have a valid intercomm */

  MPI_Comm_remote_size(intercomm, &rsize);
  MPI_Comm_size(intercomm, &size);
  MPI_Comm_rank(intercomm, &rank);

  if (parentcomm == MPI_COMM_NULL) {
    /* Parent */
    if (rsize != np) {
      errs++;
      printf("Did not create %d processes (got %d)\n", np, rsize);
    }
    if (rank == 0) {
      for (i = 0; i < rsize; i++) {
        MPI_Send(&i, 1, MPI_INT, i, 0, intercomm);
      }
    }
  } else {
    /* Child */
    isChild = 1;
    if (size != np) {
      errs++;
      printf("(Child) Did not create %d processes (got %d)\n", np, size);
    }
    MPI_Recv(&i, 1, MPI_INT, 0, 0, intercomm, &status);
    if (i != rank) {
      errs++;
      printf("Unexpected rank on child %d (%d)\n", rank, i);
    }
  }

  /* At this point, try to form the intracommunicator */
  MPI_Intercomm_merge(intercomm, isChild, &intracomm);

  /* Check on the intra comm */
  {
    int icsize, icrank, wrank;

    MPI_Comm_size(intracomm, &icsize);
    MPI_Comm_rank(intracomm, &icrank);
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);

    if (icsize != rsize + size) {
      errs++;
      printf("Intracomm rank %d thinks size is %d, not %d\n", icrank, icsize,
             rsize + size);
    }
    /* Make sure that the processes are ordered correctly */
    if (isChild) {
      int psize;
      MPI_Comm_remote_size(parentcomm, &psize);
      if (icrank != psize + wrank) {
        errs++;
        printf("Intracomm rank %d (from child) should have rank %d\n", icrank,
               psize + wrank);
      }
    } else {
      if (icrank != wrank) {
        errs++;
        printf("Intracomm rank %d (from parent) should have rank %d\n", icrank,
               wrank);
      }
    }
  }

  /* At this point, try to form the intracommunicator, with the other
   * processes first */
  MPI_Intercomm_merge(intercomm, !isChild, &intracomm2);

  /* Check on the intra comm */
  {
    int icsize, icrank, wrank;

    MPI_Comm_size(intracomm2, &icsize);
    MPI_Comm_rank(intracomm2, &icrank);
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);

    if (icsize != rsize + size) {
      errs++;
      printf("(2)Intracomm rank %d thinks size is %d, not %d\n", icrank, icsize,
             rsize + size);
    }
    /* Make sure that the processes are ordered correctly */
    if (isChild) {
      if (icrank != wrank) {
        errs++;
        printf("(2)Intracomm rank %d (from child) should have rank %d\n",
               icrank, wrank);
      }
    } else {
      int csize;
      MPI_Comm_remote_size(intercomm, &csize);
      if (icrank != wrank + csize) {
        errs++;
        printf("(2)Intracomm rank %d (from parent) should have rank %d\n",
               icrank, wrank + csize);
      }
    }
  }

  /* At this point, try to form the intracommunicator, with an
   * arbitrary choice for the first group of processes */
  MPI_Intercomm_merge(intercomm, 0, &intracomm3);
  /* Check on the intra comm */
  {
    int icsize, icrank, wrank;

    MPI_Comm_size(intracomm3, &icsize);
    MPI_Comm_rank(intracomm3, &icrank);
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);

    if (icsize != rsize + size) {
      errs++;
      printf("(3)Intracomm rank %d thinks size is %d, not %d\n", icrank, icsize,
             rsize + size);
    }
    /* Eventually, we should test that the processes are ordered
     * correctly, by groups (must be one of the two cases above) */
  }

  /* Update error count */
  if (isChild) {
    /* Send the errs back to the parent process */
    MPI_Ssend(&errs, 1, MPI_INT, 0, 1, intercomm);
  } else {
    if (rank == 0) {
      /* We could use intercomm reduce to get the errors from the
       * children, but we'll use a simpler loop to make sure that
       * we get valid data */
      for (i = 0; i < rsize; i++) {
        MPI_Recv(&err, 1, MPI_INT, i, 1, intercomm, MPI_STATUS_IGNORE);
        errs += err;
      }
    }
  }

  /* It isn't necessary to free the intracomms, but it should not hurt */
  MPI_Comm_free(&intracomm);
  MPI_Comm_free(&intracomm2);
  MPI_Comm_free(&intracomm3);

  /* It isn't necessary to free the intercomm, but it should not hurt */
  MPI_Comm_free(&intercomm);

  MPI_Finalize();
  return 0;
}
@hzhou
Copy link
Contributor

hzhou commented Dec 18, 2022

Since this is singleton init, how would the program find out the correct mpiexec if it is neither in the PATH nor existing running?

@hzhou
Copy link
Contributor

hzhou commented Dec 18, 2022

It needs the corresponding mpiexec because there is no standard interface between the PMI server and PMI client. For example, OMPI mpiexec does not understand the option names used by MPICH.

@jedbrown
Copy link
Contributor Author

I don't follow why the mpiexec executable needs to be used, versus code within libmpich.so, but perhaps relative path can be used? On Debian, mpiexec.mpich may be the installed name for mpiexec.

A reliable spawn enables IDE-integrated unit testing and doc testing of MPI-based software, which improves reliability and accessibility of the whole ecosystem, so I think it's worth some attention.

@jedbrown
Copy link
Contributor Author

For MPI-enabled unit testing, the other thing I'm exploring is to fork and use accept/connect to get parents and children connected. That might be more reliable regarding environment, but it's posix-only.

@hzhou
Copy link
Contributor

hzhou commented Dec 18, 2022

I don't follow why the mpiexec executable needs to be used, versus code within libmpich.so

Spawn mpi processes needs the functionality of a process manager. Because it is not just launching processes, but the processes need be coordinated. It is in principle possible to embed the process manager into the program itself, but we lose the flexibility of a system process manager that may be specifically designed for a particular HPC environment. Either way, users will be expected for some configurations, e.g. telling MPICH that the embedded process manager should be used. The default may make sense on a laptop, but does not apply to other systems. This versus the current option that require users to set PATH, I don't think the trouble of embedding process manager is worth it.

But to enhance, I think we can accept other option such as specify an MPIEXEC environment variable or employ some heuristics to find appropriate mpiexec, such as mpiexec.mpich on debian.

A reliable spawn enables IDE-integrated unit testing and doc testing of MPI-based software, which improves reliability and accessibility of the whole ecosystem, so I think it's worth some attention.

I do not disagree. But I think the solution is cleaner on the IDE side rather than asking MPI to figure out its system. While convenience is important, it shouldn't shadow the first priority of ensuring the best performance on an HPC system, which default and fallback rarely works due to its cutting-edge nature.

For MPI-enabled unit testing, the other thing I'm exploring is to fork and use accept/connect to get parents and children connected. That might be more reliable regarding environment, but it's posix-only.

Be careful that you are moving toward a more remote area of the MPI world. While the functionality is there in both the standard and implementation, it is the least significant and the least performant or reliable. Even the comm spawn is in the rural area due to few usages by HPC applications. We are starting to pay more attention and effort on comm spawn as more untraditional applications start to show up more often.

EDIT: The concern is, your tests may not reflect eventual production.

@jedbrown
Copy link
Contributor Author

Thanks. I don't follow why linking the process manager into libmpi would break functionality when an external process manager is in use.

@hzhou
Copy link
Contributor

hzhou commented Dec 18, 2022

Thanks. I don't follow why linking the process manager into libmpi would break functionality when an external process manager is in use.

For example, on most HPC systems, the process manager is a separate program that is different from e.g. hydra. using the embedded process manager may not work with the system job scheduler very well or at all. I am not saying it is not feasible. I am saying that the extra trouble of embedding process manager may not make much sense for most HPC applications.

@jedbrown
Copy link
Contributor Author

Okay, but you know at MPI_Init whether the job is a singleton init or has a process manager. The built-in PM only needs to be used for singleton.

But I think the solution is cleaner on the IDE side rather than asking MPI to figure out its system.

How is the IDE/language server better positioned to know how to spawn processes than the MPI library itself? This sounds like extra configuration and special case handling, which means lots of potential developers won't get it right and will dislike developing MPI-based software/write software more slowly with more bugs. We should think about development and documentation (doc testing) experience as essential features of any programming model. Reliable, portable spawn is one of the most basic building blocks.

@hzhou
Copy link
Contributor

hzhou commented Dec 19, 2022

IDE only needs to set the correct path for finding mpiexec.

@jedbrown
Copy link
Contributor Author

That's an extra configuration that needs to be exposed and kept in sync with which implementation is chosen by the build system. This is pushing incidental complexity onto tool builders and users, that could be avoided if MPI_Comm_spawn from a singleton init could be more robust.

@hzhou
Copy link
Contributor

hzhou commented Dec 19, 2022

IDE also needs to configure for finding mpicc or find the correct mpi library to link with. I don't see how setting the binary path for running is anything different.

@jedbrown
Copy link
Contributor Author

That is done automatically with existing tooling, and nothing specific to the IDE. For example,

$ ./configure --whatever-args
$ bear -- make -j60     # or ninja, etc.; works with any build system

This creates compile_commands.json, which is automatically picked up by clangd so you get auto-completion, code actions, etc., across any IDE you choose. Nowhere in this process does the IDE or clangd know anything specifically about mpicc or MPI. It's even more transparent for rust-analyzer, which doesn't need bear or to have compiled first. Determining and synchronizing mpiexec is error-prone plumbing and labor that no other library needs.

@hzhou
Copy link
Contributor

hzhou commented Dec 19, 2022

I am sure bear need have special code to recognize mpicc as a valid compiler command. The IDEs currently do not have a special understanding of running code as multiple processes. This is a deficiency of IDEs, and it is easy for them to accommodate, such as setting an environment variable or run using mpiexec, once they understand what is needed. Yes, users have to wrestle with the IDE if IDE does not explicitly support what the user wants to do, e.g. running multiple processes. That's the nature of IDE solutions. Convenience is not magic; it is an evolution and compromises.

The solution to embed process manager is feasible. But it will be a much more complex solution, both in the aspect of implementation and policy, than simply figuring out a way of setting PATH or finding the correct MPIEXEC. We'll need more evidence of necessity before committing effort.

Determining and synchronizing mpiexec is error-prone plumbing and labor that no other library needs.

Just want to note that this is a faulty comparison. No other library does distributed parallel processes as MPI does. I understand the desire of pretending MPI as just a normal single process application. But the pretension need be managed by the tools. We will do everything reasonable to facilitate the tools.

PS: The F5 in my vim builds the code with mpicc and runs the app with mpirun -- just a background for how I hardly see this is not an IDE issue.

@jedbrown
Copy link
Contributor Author

Actually bear traces the execution and has no special support for mpicc or other wrappers, nor any special knowledge of MPI.

It's pretty common (outside of HPC anyway) to have unit tests/doctests that launch extra processes (client-server, etc.) from an editor interface like this.
image

The main reason I'm working with spawn is that I want to make a test harness for Rust that handles MPI jobs as reliably and transparently as other parts of the ecosystem. So I start in a singleton context and want to spawn a job wired so I can collate output. If I use mpiexec, I have to find the right one and can't have collective assertions MPI_Gather results to produce collated output. I'm gathering that this is just reality for the near future because MPI_Comm_spawn is too fragile (also with OMPI, albeit for different reasons). But the reliability and richness of this tooling would be better if MPI_Comm_spawn could be made reliable for singleton init.

Of course for any specific app, I can also have a single keypress do the right thing. But I constantly switch across the software stack, with packages built with several different MPI implementations, and I'm honestly more concerned about reliability for students and collaborators. Each bit of configuration is a place for accidents to creep in, and bad experiences early on discourage people from getting into/staying in HPC.

@hzhou
Copy link
Contributor

hzhou commented Dec 19, 2022

Thanks for the background. I do not disagree with the motivation but on the solution. Maybe we can have an offline discussion to better understand the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants