multi_scratch test with hwloc and pthreads seg-faults. #504

crtrott · 2016-10-24T17:06:20Z

No description provided.

nmhamster · 2016-10-24T17:07:15Z

I think this may also be failing with OpenMP on POWER8 with PGI.

crtrott · 2016-10-24T17:08:56Z

Ok. It did pass with all compilers using OpenMP tonight on HSW and KNL, just the pthread-hwloc tests failed. My suspicion is that it will fail with Pthreads without hwloc as well if I crank up the number of threads. Without hwloc pthreads defaults to 4 thread testing, while OpenMP would use more.

crtrott · 2016-10-24T18:43:44Z

Can't replicate on my workstation ...

crtrott · 2016-10-24T18:49:23Z

Can't seem to replicate on kokkos-dev either (which was the machine which failed tonight ...)

crtrott · 2016-10-24T19:14:40Z

Ok not sure what is going on. If I do test_all_sandia gcc/4.8.4 the test fails with pthread and hwloc, if I do test_all_sandia gcc/4.8.4 --build-list=Threads it doesn't. And after building I can go into the build dircectories just run the tests and they fail/don't fail reliably ...

crtrott · 2016-10-24T19:25:39Z

On my workstation I can't replicate either way.

jgfouca · 2016-10-24T20:14:16Z

The only difference between your workstation and kokkos-dev is that you've sync'd the TPLs to local storage?

crtrott · 2016-10-24T20:16:39Z

No its slightly different OS, more memory, more cores, Haswell instead of Ivy Bridge.

jgfouca · 2016-10-24T20:18:42Z

Do you think this is a problem with NFS or test_all_sandia? If not, I'll ignore this ticket.

crtrott · 2016-10-24T20:33:27Z

No its not.

crtrott · 2016-10-24T20:35:01Z

Tracked it down to running this:
./KokkosCore_UnitTest_Threads --gtest_filter=threads.team_lambda_shared_request:threads.multi_level_scratch

If I run all tests except team_lambda_shared_request, everything passes. So it looks like it might be a problem in that instead of others.

crtrott · 2016-10-24T20:43:10Z

Ok individually those two tests come back clean in valgrind, but segfault if they are executed one after the other ...

nmhamster · 2016-10-24T20:46:19Z

@crtrott sounds like you have some clean up which isn't performed properly.

crtrott · 2016-10-24T21:05:48Z

I believe there is something funky going on with the resize of the internal scratch array stuff. Because both tests use it but with different team sizes. If I reduce team_size to 2 for the second test it passes.

crtrott · 2016-10-24T21:08:00Z

Also if I run with mpirun -np -map-by socket:PE=8 it passes, with mpirun -np -map-by socket:PE=10 it fails (using team_size = 4 in either case).

crtrott · 2016-10-24T21:11:55Z

mpirun -np -map-by node:PE=16 passes mpirun -np -map-by node:PE=20 fails. And now when I run with that restriction on my machine (i.e. -map-by socket:PE=10) it also fails there. Not quite sure yet whats going on, on my machine with -map-by node:PE=20 it passes which should actually match what kokkos-dev runs.

This was somewhat a red herring. The main issue was not the scratch, but that the dynamic scheduling managed to run past the maximum workset size under certain special circumstances. If you do that, the scratch memory was not reallocated, because that place had an independent check on validness. The scratch memory access was the first memory access in the kernel though and managed to crash then. This should now be fixed.

crtrott · 2016-10-26T05:19:08Z

Got it :-)

crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Oct 24, 2016

crtrott added this to the Fall 2016 milestone Oct 24, 2016

crtrott self-assigned this Oct 24, 2016

crtrott added bug - fix pushed to develop branch labels Oct 26, 2016

crtrott closed this as completed Oct 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_scratch test with hwloc and pthreads seg-faults. #504

multi_scratch test with hwloc and pthreads seg-faults. #504

crtrott commented Oct 24, 2016

nmhamster commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

jgfouca commented Oct 24, 2016

crtrott commented Oct 24, 2016

jgfouca commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

nmhamster commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 26, 2016

multi_scratch test with hwloc and pthreads seg-faults. #504

multi_scratch test with hwloc and pthreads seg-faults. #504

Comments

crtrott commented Oct 24, 2016

nmhamster commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

jgfouca commented Oct 24, 2016

crtrott commented Oct 24, 2016

jgfouca commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

nmhamster commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 24, 2016

crtrott commented Oct 26, 2016