-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core Oversubscription Detection Broken? #159
Comments
Which machine did you do this on? |
Any update? It seems to work for me. |
We just talked about this but for the issue record: here's what I do:
And I get:
|
Ok the issue is that salloc -N 1 still says you have 32 tasks on the machine. This number is reset if you actually use srun -n or mpirun -np after doing the salloc, but if you simply run a Kokkos code directly (i.e. ./my_kokkos_code instead of srun -n 1 ./my_kokkos_code or mpirun -np 1 ./my_kokkos_code) this triggers the issue. It looks like we can fix that by checking more SLURM variables. In fact I initially checked other slurm variables which wouldn't have triggered the warning in this case, but those variables don't exist on Cray. |
I looked a bit more into it. It is kind of hard to fix. When you use mpirun the OMPI variables will be used preferential over the SLURM variables. But the Slurm variables before you run mpirun really just tell you that there are 32 tasks running. There doesn't seem to be a way to figure out that I have a serial run going. |
So what do we do about this? Is it enough to just be able to disable the test at compile time/and or runtime via command line option? |
I am closing this now. It seems to me that the thing is basically working as intended, and if the scheduler is lying to me its not my fault :-) |
I disabled the SLURM detection now, since it caused more issues on other machines. |
Was working on some code this week and was getting warnings printed by Kokkos about number of cores being oversubscribed. Not sure the integration of how many MPI ranks in the environment is correct? Will try to recreate and send back details.
The text was updated successfully, but these errors were encountered: