-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi_scratch test with hwloc and pthreads seg-faults. #504
Comments
I think this may also be failing with OpenMP on POWER8 with PGI. |
Ok. It did pass with all compilers using OpenMP tonight on HSW and KNL, just the pthread-hwloc tests failed. My suspicion is that it will fail with Pthreads without hwloc as well if I crank up the number of threads. Without hwloc pthreads defaults to 4 thread testing, while OpenMP would use more. |
Can't replicate on my workstation ... |
Can't seem to replicate on kokkos-dev either (which was the machine which failed tonight ...) |
Ok not sure what is going on. If I do test_all_sandia gcc/4.8.4 the test fails with pthread and hwloc, if I do test_all_sandia gcc/4.8.4 --build-list=Threads it doesn't. And after building I can go into the build dircectories just run the tests and they fail/don't fail reliably ... |
On my workstation I can't replicate either way. |
The only difference between your workstation and kokkos-dev is that you've sync'd the TPLs to local storage? |
No its slightly different OS, more memory, more cores, Haswell instead of Ivy Bridge. |
Do you think this is a problem with NFS or test_all_sandia? If not, I'll ignore this ticket. |
No its not. |
Tracked it down to running this: If I run all tests except team_lambda_shared_request, everything passes. So it looks like it might be a problem in that instead of others. |
Ok individually those two tests come back clean in valgrind, but segfault if they are executed one after the other ... |
@crtrott sounds like you have some clean up which isn't performed properly. |
I believe there is something funky going on with the resize of the internal scratch array stuff. Because both tests use it but with different team sizes. If I reduce team_size to 2 for the second test it passes. |
Also if I run with mpirun -np -map-by socket:PE=8 it passes, with mpirun -np -map-by socket:PE=10 it fails (using team_size = 4 in either case). |
mpirun -np -map-by node:PE=16 passes mpirun -np -map-by node:PE=20 fails. And now when I run with that restriction on my machine (i.e. -map-by socket:PE=10) it also fails there. Not quite sure yet whats going on, on my machine with -map-by node:PE=20 it passes which should actually match what kokkos-dev runs. |
This was somewhat a red herring. The main issue was not the scratch, but that the dynamic scheduling managed to run past the maximum workset size under certain special circumstances. If you do that, the scratch memory was not reallocated, because that place had an independent check on validness. The scratch memory access was the first memory access in the kernel though and managed to crash then. This should now be fixed.
Got it :-) |
No description provided.
The text was updated successfully, but these errors were encountered: