Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core Oversubscription Detection Broken? #159

Closed
nmhamster opened this issue Dec 17, 2015 · 8 comments
Closed

Core Oversubscription Detection Broken? #159

nmhamster opened this issue Dec 17, 2015 · 8 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@nmhamster
Copy link
Contributor

Was working on some code this week and was getting warnings printed by Kokkos about number of cores being oversubscribed. Not sure the integration of how many MPI ranks in the environment is correct? Will try to recreate and send back details.

@nmhamster nmhamster added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Dec 17, 2015
@crtrott
Copy link
Member

crtrott commented Dec 17, 2015

Which machine did you do this on?

@crtrott
Copy link
Member

crtrott commented Dec 17, 2015

Any update? It seems to work for me.

@nmhamster
Copy link
Contributor Author

We just talked about this but for the issue record: here's what I do:

salloc -N 1 --time=04:00:00
<Run my Kokkos code>

And I get:

Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores.
                                    Detected: 64 cores per node.
                                    Detected: 32 MPI_ranks per node.
                                    Requested: 64 threads per process.

@crtrott
Copy link
Member

crtrott commented Dec 20, 2015

Ok the issue is that salloc -N 1 still says you have 32 tasks on the machine. This number is reset if you actually use srun -n or mpirun -np after doing the salloc, but if you simply run a Kokkos code directly (i.e. ./my_kokkos_code instead of srun -n 1 ./my_kokkos_code or mpirun -np 1 ./my_kokkos_code) this triggers the issue. It looks like we can fix that by checking more SLURM variables. In fact I initially checked other slurm variables which wouldn't have triggered the warning in this case, but those variables don't exist on Cray.

@crtrott crtrott self-assigned this Jan 25, 2016
@crtrott
Copy link
Member

crtrott commented Feb 11, 2016

I looked a bit more into it. It is kind of hard to fix. When you use mpirun the OMPI variables will be used preferential over the SLURM variables. But the Slurm variables before you run mpirun really just tell you that there are 32 tasks running. There doesn't seem to be a way to figure out that I have a serial run going.

@crtrott
Copy link
Member

crtrott commented Mar 14, 2016

So what do we do about this? Is it enough to just be able to disable the test at compile time/and or runtime via command line option?

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

I am closing this now. It seems to me that the thing is basically working as intended, and if the scheduler is lying to me its not my fault :-)

@crtrott crtrott closed this as completed Jun 14, 2016
@crtrott
Copy link
Member

crtrott commented Sep 19, 2016

I disabled the SLURM detection now, since it caused more issues on other machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

3 participants