Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_scratch test with hwloc and pthreads seg-faults. #504

Closed
crtrott opened this issue Oct 24, 2016 · 17 comments
Closed

multi_scratch test with hwloc and pthreads seg-faults. #504

crtrott opened this issue Oct 24, 2016 · 17 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@crtrott
Copy link
Member

crtrott commented Oct 24, 2016

No description provided.

@crtrott crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Oct 24, 2016
@crtrott crtrott added this to the Fall 2016 milestone Oct 24, 2016
@crtrott crtrott self-assigned this Oct 24, 2016
@nmhamster
Copy link
Contributor

I think this may also be failing with OpenMP on POWER8 with PGI.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Ok. It did pass with all compilers using OpenMP tonight on HSW and KNL, just the pthread-hwloc tests failed. My suspicion is that it will fail with Pthreads without hwloc as well if I crank up the number of threads. Without hwloc pthreads defaults to 4 thread testing, while OpenMP would use more.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Can't replicate on my workstation ...

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Can't seem to replicate on kokkos-dev either (which was the machine which failed tonight ...)

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Ok not sure what is going on. If I do test_all_sandia gcc/4.8.4 the test fails with pthread and hwloc, if I do test_all_sandia gcc/4.8.4 --build-list=Threads it doesn't. And after building I can go into the build dircectories just run the tests and they fail/don't fail reliably ...

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

On my workstation I can't replicate either way.

@jgfouca
Copy link
Contributor

jgfouca commented Oct 24, 2016

The only difference between your workstation and kokkos-dev is that you've sync'd the TPLs to local storage?

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

No its slightly different OS, more memory, more cores, Haswell instead of Ivy Bridge.

@jgfouca
Copy link
Contributor

jgfouca commented Oct 24, 2016

Do you think this is a problem with NFS or test_all_sandia? If not, I'll ignore this ticket.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

No its not.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Tracked it down to running this:
./KokkosCore_UnitTest_Threads --gtest_filter=threads.team_lambda_shared_request:threads.multi_level_scratch

If I run all tests except team_lambda_shared_request, everything passes. So it looks like it might be a problem in that instead of others.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Ok individually those two tests come back clean in valgrind, but segfault if they are executed one after the other ...

@nmhamster
Copy link
Contributor

@crtrott sounds like you have some clean up which isn't performed properly.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

I believe there is something funky going on with the resize of the internal scratch array stuff. Because both tests use it but with different team sizes. If I reduce team_size to 2 for the second test it passes.

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

Also if I run with mpirun -np -map-by socket:PE=8 it passes, with mpirun -np -map-by socket:PE=10 it fails (using team_size = 4 in either case).

@crtrott
Copy link
Member Author

crtrott commented Oct 24, 2016

mpirun -np -map-by node:PE=16 passes mpirun -np -map-by node:PE=20 fails. And now when I run with that restriction on my machine (i.e. -map-by socket:PE=10) it also fails there. Not quite sure yet whats going on, on my machine with -map-by node:PE=20 it passes which should actually match what kokkos-dev runs.

crtrott added a commit that referenced this issue Oct 26, 2016
This was somewhat a red herring. The main issue was not the scratch,
but that the dynamic scheduling managed to run past the maximum
workset size under certain special circumstances.
If you do that, the scratch memory was not reallocated, because
that place had an independent check on validness.
The scratch memory access was the first memory access in the kernel though
and managed to crash then.
This should now be fixed.
@crtrott
Copy link
Member Author

crtrott commented Oct 26, 2016

Got it :-)

@crtrott crtrott closed this as completed Oct 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

3 participants