Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for oversubscription of threads with MPI in Kokkos::initialize #149

Closed
mhoemmen opened this issue Nov 30, 2015 · 9 comments
Closed
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting

Comments

@mhoemmen
Copy link
Contributor

Many Trilinos developers have been reporting problems with unit tests timing out or taking a long time. The cause is oversubscription of the node with too many threads. Kokkos::initialize can intervene and prevent this by checking the following:

  1. Number of cores per node
  2. Number of threads that the user requested
  3. Number of MPI processes per node

Kokkos already knows (2). For MVAPICH2 and OpenMPI, it can check environment variables to get (3). (For example, MV2_COMM_WORLD_LOCAL_SIZE or OMPI_COMM_WORLD_LOCAL_SIZE.) In fact, it already checks similar environment variables. This does not require a compile-time dependency on MPI and it works whether or not MPI is enabled.

For (1), the number of cores per node, we will need to research a bit.

@nmhamster
Copy link
Contributor

Mark,

I think the best way to solve this is to use a KokkosP connector which at initialization time checks the number of threads requested against the associated process CPU mask. Much of the support for this is already in place including multi-device. I think a lot of debugging can be done without needing to put code directly into the Kokkos runtime itself. This will help maintenance since we can add to the tools after a program has been compiled (without needing updates of Kokkos in production) and catch additional errors and reduce performance overheads associated with the checks when running in "real" production.

If the users are using MPI/job subsmission correctly the CPU mask will be generated to ensure that over-subscription does not occur. In OpenMPI and MPICH a user has to actively oversubscribe or the MPI runtime gets upset so it shouldn't happen by accident. Similarly, using SLURM etc from a scheduler perspective will also set masks.

This means we could just perform the check locally within a KokkosP connector and avoid MPI communications. It also gets the problem not just of oversubscribed nodes but incorrectly set affinities on jobs which need to/want to use less than a complete node (for instance, using one thread per core for a parallel run of STREAM gives best performance but would not lead to over-subscription).

Does you think that might work for you?

S.

@mhoemmen
Copy link
Contributor Author

Hi Si! Christian was thinking of the environment variable approach as a low-overhead fix. Kokkos already checks environment variables like those so it would be pretty easy to add. I'm OK with the KokkosP approach as long as it is enabled by default. The issue is that by default, Trilinos will oversubscribe. It's easy to fix that (set OMP_NUM_THREADS environment variable manually), but users don't know that, and you know how hard it is to educate users ;-) Also, the Pthreads back-end won't respect that environment variable.

@nmhamster
Copy link
Contributor

I think the best way forward is to have this caught by external tools and avoid the checks in the Kokkos runtime because it gives us operational freedom to change how we decide on runs in the future without requiring users to constantly be getting an updated Kokkos and recompile everything. If we use the connector hooks we can modify the debugging environment for them in updates to tools and the applications/libraries etc do not need to recompiled. Given how long compiles are taking now and the constant problem of reproducibility, anything we can do to eliminate this in the critical path is a big win. I think the long term plan is to have KokkosP on by default anyway since it will allow low overhead profiles and some environment-detect-error debugging to take place.

@crtrott
Copy link
Member

crtrott commented Nov 30, 2015

The thing is we need a check which will always happen at initialization. We had a few folks running into this problem that they inadvertently oversubscribed. You actually will get this by default, since OpenMPI binds to sockets while the default OpenMP number of threads is all the threads in the system. With a typical unittest in Trilinos you also get more than one MPI rank per socket, so you don't even have exclusive process masks.

What I was thinking of doing is simply discovering how many hyperthreads are on the node, query the LOCAL_SIZE environment variables and spit out a warning if LocalSize*Threads is larger than available threads. This has not really that much to do with debugging, its just something we need to get to people.

@mhoemmen
Copy link
Contributor Author

This should also have a minimal effect on compile times and library sizes. Kokkos::initialize already calls getenv, so we wouldn't need to include new header files. The environment query code lives in a .cpp file, so users wouldn't need to compile it.

@nmhamster
Copy link
Contributor

Sorry I wasn't clear, my issue is that we want to be able to change things in the environment queries over time (as needed) without requiring applications recompile or relink - i.e. we want to compile the binary once and push as much as possible to dynamic insertions that we can selective enable/disable. Do you think that makes sense?

OpenMPI does not default to sockets, since 1.8 it defaults to cores.

This isn't a Kokkos problem its more of a scheduler and MPI parameter issue we need to fix with users.

I think they way forward is to have a "Kokkos debug" connector that can check multiple issues in a single shot/script once and when it says that's OK the user knows they are good to go. We can selectively add/remove things from the debug connector over time if we route this through the KokkosP connectors.

@nmhamster
Copy link
Contributor

BTW - you're right since mid 1.8.X series if number of processors > 2 it will do sockets.

@crtrott crtrott added the Enhancement Improve existing capability; will potentially require voting label Dec 2, 2015
@crtrott crtrott added this to the Pre Christmas Push milestone Dec 2, 2015
@crtrott crtrott self-assigned this Dec 11, 2015
crtrott added a commit that referenced this issue Dec 11, 2015
This adds a check for oversubscription of CPUs with too many threads
which also works in most common MPI environments. In particular it
works with OpenMPI, MVAPICH2 as well as with SLURM dispatch. It was
tested on Cray.

It checks that the per node MPI ranks times the number of threads per
process does not exceed the total number of cores (or hyperthreads).

This check happens for both OpenMP and Pthreads initialization.

This addesses issue #149.
@nmhamster
Copy link
Contributor

See #159. Will try to get you some more info.

@crtrott crtrott closed this as completed Jan 14, 2016
@crtrott
Copy link
Member

crtrott commented Jan 14, 2016

In master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

3 participants