Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression test fail on Spock/Crusher/Frontier/HIP #659

Closed
pgrete opened this issue Mar 24, 2022 · 13 comments · Fixed by #800 or #805
Closed

Regression test fail on Spock/Crusher/Frontier/HIP #659

pgrete opened this issue Mar 24, 2022 · 13 comments · Fixed by #800 or #805
Labels
bug Something isn't working

Comments

@pgrete
Copy link
Collaborator

pgrete commented Mar 24, 2022

$ ctest -L regression -LE "mpi-no|perf-reg" --output-on-failure
Test project /gpfs/alpine/ast146/world-shared/parth/build-hip-mpi
      Start 42: regression_mpi_test:particle_leapfrog
 1/10 Test #42: regression_mpi_test:particle_leapfrog ...........   Passed    1.30 sec
      Start 44: regression_mpi_test:particle_leapfrog_outflow
 2/10 Test #44: regression_mpi_test:particle_leapfrog_outflow ...   Passed    0.97 sec
      Start 46: regression_mpi_test:restart
 3/10 Test #46: regression_mpi_test:restart .....................   Passed   32.23 sec
      Start 48: regression_mpi_test:calculate_pi
 4/10 Test #48: regression_mpi_test:calculate_pi ................   Passed    0.94 sec
      Start 50: regression_mpi_test:advection_convergence
 5/10 Test #50: regression_mpi_test:advection_convergence .......   Passed   49.83 sec
      Start 52: regression_mpi_test:output_hdf5
 6/10 Test #52: regression_mpi_test:output_hdf5 .................***Failed   22.07 sec


test_dir=['/ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5']
output_dir='/gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/tst/regression/outputs/output_hdf5_mpi'
driver=['/gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/example/advection/advection-example']
driver_input=['/ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5/parthinput.advection']
kokkos_args=['--kokkos-num-devices=4']
num_steps=4
mpirun=['srun']
mpirun_ranks_flag='-n'
mpirun_ranks_num=4
mpirun_opts=[]
coverage=False
analyze=False
sparse_disabled=False
*****************************************************************
Beginning Python regression testing script
*****************************************************************

Initializing Test Case
Using:
driver at:       /gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/example/advection/advection-example
driver input at: /ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5/parthinput.advection
test folder:     /ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5
output sent to:  /gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/tst/regression/outputs/output_hdf5_mpi

Make output folder in test if does not exist
*****************************************************************
Preparing Test Case Step 1
*****************************************************************

*****************************************************************
Running Driver
*****************************************************************

Command to execute driver
srun -n 4 /gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/example/advection/advection-example -i /ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5/parthinput.advection --kokkos-num-devices=4
*****************************************************************
Preparing Test Case Step 2
*****************************************************************

*****************************************************************
Running Driver
*****************************************************************

Command to execute driver
srun -n 4 /gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/example/advection/advection-example -i /ccs/home/pgrete/src/parthenon/tst/regression/test_suites/output_hdf5/parthinput.advection parthenon/job/problem_id=advection_3d parthenon/mesh/numlevel=2 parthenon/mesh/nx1=32 parthenon/meshblock/nx1=8 parthenon/mesh/nx2=32 parthenon/meshblock/nx2=8 parthenon/mesh/nx3=32 parthenon/meshblock/nx3=8 parthenon/time/integrator=rk1 Advection/cfl=0.3 --kokkos-num-devices=4
*****************************************************************
Preparing Test Case Step 3
*****************************************************************

*****************************************************************
Preparing Test Case Step 4
*****************************************************************

Running with coverage
False
*****************************************************************
Analysing Driver Output
*****************************************************************

/gpfs/alpine/ast146/world-shared/parth/build-hip-mpi/tst/regression/outputs/output_hdf5_mpi

-------------------------------------------
    Filename=advection_2d.out0.final.phdf
-------------------------------------------
                Time=1.0000
              NCycle=569
             NumDims=2
            MaxLevel=2
           NumBlocks=130
       MeshBlockSize=[16 16  1]
       CellsPerBlock=256
          TotalCells=33280
      TotalCellsReal=33280
               NumPE=4
         BlocksPerPE=[32 32 33 33]
              NGhost=2
       IncludesGhost=0
         Coordinates=UniformCartesian
--------------------------------------------
           Variables=['Blocks', 'Info', 'Input', 'Levels', 'Locations', 'LogicalLocations', 'Params', 'SparseInfo', 'VolumeLocations', 'advected', 'one_minus_advected', 'one_minus_advected_sq', 'one_minus_sqrt_one_minus_advected_sq_12', 'one_minus_sqrt_one_minus_advected_sq_37']
--------------------------------------------


-------------------------------------------
    Filename=/autofs/nccs-svm1_home1/pgrete/src/parthenon/tst/regression/gold_standard/advection_2d.out0.final.phdf
-------------------------------------------
                Time=1.0000
              NCycle=569
             NumDims=2
            MaxLevel=2
           NumBlocks=130
       MeshBlockSize=[16 16  1]
       CellsPerBlock=256
          TotalCells=33280
      TotalCellsReal=33280
               NumPE=0
         BlocksPerPE=130
              NGhost=2
       IncludesGhost=0
         Coordinates=UniformCartesian
--------------------------------------------
           Variables=['Blocks', 'Info', 'Input', 'Levels', 'Locations', 'LogicalLocations', 'Params', 'SparseInfo', 'VolumeLocations', 'advected', 'one_minus_advected', 'one_minus_advected_sq', 'one_minus_sqrt_one_minus_advected_sq_12', 'one_minus_sqrt_one_minus_advected_sq_37']
--------------------------------------------

Checking metadata
                  Info: no diffs
                Params: no diffs
              Blocks/loc.lx123: no diffs
              Blocks/xmin: no diffs
                Levels: no diffs
     VolumeLocations/x: no diffs
     VolumeLocations/y: no diffs
     VolumeLocations/z: no diffs
      LogicalLocations: no diffs
            SparseInfo: no diffs
           Locations/x: no diffs
           Locations/y: no diffs
           Locations/z: no diffs
Metadata matches
____Comparing on a per variable basis with tolerance 1e-12
Tolerance = 1e-12
  advected            : no diffs
  one_minus_advected  : no diffs
  one_minus_sqrt_one_minus_advected_sq_12: no diffs
  one_minus_advected_sq: no diffs
  one_minus_sqrt_one_minus_advected_sq_37: no diffs

-------------------------------------------
    Filename=advection_3d.out0.final.phdf
-------------------------------------------
                Time=1.0000
              NCycle=214
             NumDims=3
            MaxLevel=1
           NumBlocks=204
       MeshBlockSize=[8 8 8]
       CellsPerBlock=512
          TotalCells=104448
      TotalCellsReal=104448
               NumPE=4
         BlocksPerPE=[51 51 51 51]
              NGhost=2
       IncludesGhost=0
         Coordinates=UniformCartesian
--------------------------------------------
           Variables=['Blocks', 'Info', 'Input', 'Levels', 'Locations', 'LogicalLocations', 'Params', 'SparseInfo', 'VolumeLocations', 'advected', 'one_minus_advected', 'one_minus_advected_sq', 'one_minus_sqrt_one_minus_advected_sq_12', 'one_minus_sqrt_one_minus_advected_sq_37']
--------------------------------------------


-------------------------------------------
    Filename=/autofs/nccs-svm1_home1/pgrete/src/parthenon/tst/regression/gold_standard/advection_3d.out0.final.phdf
-------------------------------------------
                Time=1.0000
              NCycle=214
             NumDims=3
            MaxLevel=1
           NumBlocks=204
       MeshBlockSize=[8 8 8]
       CellsPerBlock=512
          TotalCells=104448
      TotalCellsReal=104448
               NumPE=0
         BlocksPerPE=204
              NGhost=2
       IncludesGhost=0
         Coordinates=UniformCartesian
--------------------------------------------
           Variables=['Blocks', 'Info', 'Input', 'Levels', 'Locations', 'LogicalLocations', 'Params', 'SparseInfo', 'VolumeLocations', 'advected', 'one_minus_advected', 'one_minus_advected_sq', 'one_minus_sqrt_one_minus_advected_sq_12', 'one_minus_sqrt_one_minus_advected_sq_37']
--------------------------------------------

Checking metadata
                  Info: no diffs
                Params: no diffs
              Blocks/loc.lx123: no diffs
              Blocks/xmin: no diffs
                Levels: no diffs
     VolumeLocations/x: no diffs
     VolumeLocations/y: no diffs
     VolumeLocations/z: no diffs
      LogicalLocations: no diffs
            SparseInfo: no diffs
           Locations/x: no diffs
           Locations/y: no diffs
           Locations/z: no diffs
Metadata matches
____Comparing on a per variable basis with tolerance 1e-12
Tolerance = 1e-12
Diff in advected            
    bkji: ( 190,   7,   0,   3)
    zyx: (0.117188,0.007812,0.179688)
    err_mag: 0.008055
    f0: 3.5469e-02
    f1: 2.7414e-02
  advected            : differs
Diff in one_minus_advected  
    bkji: ( 190,   7,   0,   3)
    zyx: (0.117188,0.007812,0.179688)
    err_mag: 0.008055
    f0: 9.6453e-01
    f1: 9.7259e-01
  one_minus_advected  : differs
Diff in one_minus_sqrt_one_minus_advected_sq_12
    bkji: ( 190,   7,   0,   3)
    zyx: (0.117188,0.007812,0.179688)
    err_mag: 0.008055
    f0: 3.5469e-02
    f1: 2.7414e-02
  one_minus_sqrt_one_minus_advected_sq_12: differs
Diff in one_minus_advected_sq
    bkji: ( 190,   7,   0,   3)
    zyx: (0.117188,0.007812,0.179688)
    err_mag: 0.015603
    f0: 9.3032e-01
    f1: 9.4592e-01
  one_minus_advected_sq: differs
Diff in one_minus_sqrt_one_minus_advected_sq_37
    bkji: ( 190,   7,   0,   3)
    zyx: (0.117188,0.007812,0.179688)
    err_mag: 0.008055
    f0: 9.6453e-01
    f1: 9.7259e-01
  one_minus_sqrt_one_minus_advected_sq_37: differs
Wrong total in hst output of 3D problem
Traceback (most recent call last):
  File "/ccs/home/pgrete/src/parthenon/tst/regression/run_test.py", line 204, in <module>
    main(**vars(args))
  File "/ccs/home/pgrete/src/parthenon/tst/regression/run_test.py", line 93, in main
    raise TestError("Test " + test_manager.test + " failed")
__main__.TestError: Test output_hdf5 failed

      Start 54: regression_mpi_test:advection_outflow
 7/10 Test #54: regression_mpi_test:advection_outflow ...........   Passed    1.80 sec
      Start 56: regression_mpi_test:bvals
 8/10 Test #56: regression_mpi_test:bvals .......................   Passed    2.79 sec
      Start 58: regression_mpi_test:poisson
 9/10 Test #58: regression_mpi_test:poisson .....................   Passed    4.72 sec
      Start 60: regression_mpi_test:sparse_advection
10/10 Test #60: regression_mpi_test:sparse_advection ............   Passed   18.31 sec

90% tests passed, 1 tests failed out of 10

Label Time Summary:
cuda          = 134.95 sec*proc (10 tests)
mpi-yes       = 134.95 sec*proc (10 tests)
poisson       =   4.72 sec*proc (1 test)
regression    = 134.95 sec*proc (10 tests)

Total Test time (real) = 134.96 sec

The following tests FAILED:
	 52 - regression_mpi_test:output_hdf5 (Failed)
Errors while running CTest

@pgrete pgrete added the bug Something isn't working label Mar 24, 2022
@pgrete
Copy link
Collaborator Author

pgrete commented Mar 24, 2022

All tests pass for non-MPI runs.

@pgrete pgrete changed the title Regression test fail on Spock/HIP Regression test fail on Spock/Crusher/Frontier/HIP Oct 27, 2022
@pgrete pgrete pinned this issue Oct 27, 2022
@pgrete
Copy link
Collaborator Author

pgrete commented Oct 27, 2022

@lroberts36 I think I identified that the root cause of the issue is related to sparse variables.

When I run the regression test manually

./example/advection/advection-example -i ../tst/regression/test_suites/advection_convergence/parthinput.advection parthenon/mesh/nx1=80 parthenon/meshblock/nx1=40 parthenon/mesh/nx2=60 parthenon/meshblock/nx2=30 parthenon/mesh/nx3=72 parthenon/meshblock/nx3=36 parthenon/mesh/x1min=-1.5 parthenon/mesh/x1max=1.5 parthenon/mesh/x2min=-0.75 parthenon/mesh/x2max=0.75 parthenon/mesh/x3min=-1.0 parthenon/mesh/x3max=1.0 Advection/vx=3.0 Advection/vy=1.5 Advection/vz=2.0 Advection/profile=smooth_gaussian Advection/amp=1.0

The test will fail at random times (sometime cycles 4, sometimes 153, sometimes during init) with the following error :0: : Device-side assertion Advected not properly initialized.' failed.`.
(Soft) disabling sparse make the code run without problems.

I still don't know what the root cause is, so if it's a bug in the sparse machinery or if it's a bug/problem with HIP/AMD GPUs.
Note that this

  • happens without MPI
  • happens with fewer optimizations, e.g., opmi-O2 or even -O0
  • happens with launch and api blocking enabled
  • does not happen any more when there are fewer than 8 blocks for the mesh (I varied the block/nx# above for various combinations and with a split to just 4 blocks, independent of whether just in one or two dims, it works)
  • does not happen with pack_size=1 (rather than -1)
  • does not happen in debug mode

@Yurlungur
Copy link
Collaborator

I wonder if this is related to a failure to identify whether or not a pack should be rebuilt. @lroberts36 and I discussed this and a solution at some point. I think there's a version in riot with the fix.

@lroberts36
Copy link
Collaborator

lroberts36 commented Nov 1, 2022

Tried to reproduce on a Darwin amd-epyc-gpu node with an MI25 GPU. Compiled and ran using

Currently Loaded Modules:
  1) rocm/4.5.2   2) cmake/3.22.2
  
$ cmake .. -DKokkos_ENABLE_HIP=ON -DCMAKE_CXX_COMPILER=hipcc -DPARTHENON_DISABLE_HDF5=ON -DPARTHENON_DISABLE_MPI=ON -DKokkos_ARCH_VEGA900=ON

$ for X in $(seq 100); do echo RUN $X; ./example/advection/advection-example -i ../tst/regression/test_suites/advection_convergence/parthinput.advection parthenon/mesh/nx1=80 parthenon/meshblock/nx1=40 parthenon/mesh/nx2=60 parthenon/meshblock/nx2=30 parthenon/mesh/nx3=72 parthenon/meshblock/nx3=36 parthenon/mesh/x1min=-1.5 parthenon/mesh/x1max=1.5 parthenon/mesh/x2min=-0.75 parthenon/mesh/x2max=0.75 parthenon/mesh/x3min=-1.0 parthenon/mesh/x3max=1.0 Advection/vx=3.0 Advection/vy=1.5 Advection/vz=2.0 Advection/profile=smooth_gaussian Advection/amp=1.0 |grep ERROR; done

but was unable to reproduce the issue.

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 1, 2022

I also tried to reproduce on Spock with MI100 and was not able to.

Now the question is whether this is an issue of performance (MI250[x] vs slower devices), libs or hardware...

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 9, 2022

Alright, "good" news. I got one more data point. On a different machine and a different environment the test similarly fails on MI250X (so we can rule out the OLCF environment).
As before, disabling sparse makes the code work.

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 9, 2022

Also #699 does not fix the issue.

@pgrete
Copy link
Collaborator Author

pgrete commented Dec 12, 2022

Alright, here's what went wrong after a long, joint debugging session with @Yurlungur

The offending kernel was related to checking/setting the allocation status.
A minimal reproducer is:

auto main(int argc, char *argv[]) -> int {
  Kokkos::initialize(argc, argv);
  {

    size_t num_b = 26 * 4;
    Kokkos::View<bool *> nonzero_flags("nonzero_flags", num_b);

    size_t num_idx = 16;

    Kokkos::View<double **> work("work", num_b, num_idx);

    Kokkos::parallel_for(
        "set flags", Kokkos::TeamPolicy<>(num_b, Kokkos::AUTO),
        KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type &member) {
          const int b = member.league_rank();

          nonzero_flags(b) = false;

          Kokkos::parallel_for(Kokkos::TeamThreadRange<>(member, num_idx),
                               [&](const int idx) {
                                 work(b, idx) = 1.0;
                                 if (std::abs(work(b, idx)) >= 0.0) {
                                   nonzero_flags(b) = true;
                                 }
                               });
        });

    auto nonzero_flags_h =
        Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), nonzero_flags);

    for (auto b = 0; b < num_b; b++) {
      if (!nonzero_flags_h(b)) {
        std::cerr << "HEEEEEEEEEEEEEELP!!!\n";
      }
    }
  }
  Kokkos::finalize();

There are actually two issues:

  1. The more obvious one: having multiple thread in parallel write to nonzero_flags(b) = true; is technically undefined behavior. However, this is practically not the issue here, because for the given kernel we were just interested if any thread, at any point, set that variable to true, which seems to work on all architectures (but again may not be guaranteed)
  2. The more subtle one: there may not be enough work for all threads in the team for the inner loop (particularly on devices with high concurrency like the MI250x). Therefore, a first fraction of the threads in a team may reach the inner loop and finish it, and only then other fractions/the remainder of threads start the outer loop, reset nonzero_flags(b) = false; but never enter the inner loop (as it's already done) so that the value remains false when the entire team is done.

pgrete added a commit that referenced this issue Dec 12, 2022
pgrete added a commit that referenced this issue Dec 15, 2022
…tus) (#800)

* Fixes #659

* Replace lambda with if

* Typo: BOr to LOr

* Apply suggestions from code review

Co-authored-by: Luke Roberts <lfroberts@lanl.gov>
Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com>

* format

* Fix dealloc

Co-authored-by: Luke Roberts <lfroberts@lanl.gov>
Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com>
@pgrete pgrete reopened this Dec 16, 2022
@pgrete
Copy link
Collaborator Author

pgrete commented Dec 16, 2022

Serial runs seem to work now. The test suite passed 10 times in a row

Test project /gpfs/alpine/ast146/world-shared/crusher-debug/parthenon/build-frontier-fix-next-nompi
      Start 45: regression_test:particle_leapfrog
 1/10 Test #45: regression_test:particle_leapfrog ...........   Passed    1.25 sec
      Start 46: regression_test:particle_leapfrog_outflow
 2/10 Test #46: regression_test:particle_leapfrog_outflow ...   Passed    0.75 sec
      Start 47: regression_test:restart
 3/10 Test #47: regression_test:restart .....................   Passed   43.83 sec
      Start 48: regression_test:calculate_pi
 4/10 Test #48: regression_test:calculate_pi ................   Passed    0.77 sec
      Start 49: regression_test:advection_convergence
 5/10 Test #49: regression_test:advection_convergence .......   Passed   53.88 sec
      Start 50: regression_test:output_hdf5
 6/10 Test #50: regression_test:output_hdf5 .................   Passed   26.22 sec
      Start 51: regression_test:advection_outflow
 7/10 Test #51: regression_test:advection_outflow ...........   Passed    1.80 sec
      Start 52: regression_test:bvals
 8/10 Test #52: regression_test:bvals .......................   Passed    1.92 sec
      Start 53: regression_test:poisson
 9/10 Test #53: regression_test:poisson .....................   Passed    2.58 sec
      Start 54: regression_test:sparse_advection
10/10 Test #54: regression_test:sparse_advection ............   Passed   28.42 sec

100% tests passed, 0 tests failed out of 10

Label Time Summary:
poisson       =   2.58 sec*proc (1 test)
regression    = 161.41 sec*proc (10 tests)
serial        = 161.41 sec*proc (10 tests)

Total Test time (real) = 161.42 sec

with MPI, less so

$ ctest -L regression -L mpi -LE perf-reg
Test project /gpfs/alpine/ast146/world-shared/crusher-debug/parthenon/build-frontier-fix-next-mpi-rocm54
      Start 47: regression_mpi_test:particle_leapfrog
 1/10 Test #47: regression_mpi_test:particle_leapfrog ...........***Failed    2.25 sec
      Start 49: regression_mpi_test:particle_leapfrog_outflow
 2/10 Test #49: regression_mpi_test:particle_leapfrog_outflow ...***Failed    2.13 sec
      Start 51: regression_mpi_test:restart
 3/10 Test #51: regression_mpi_test:restart .....................   Passed   25.15 sec
      Start 53: regression_mpi_test:calculate_pi
 4/10 Test #53: regression_mpi_test:calculate_pi ................   Passed    2.14 sec
      Start 55: regression_mpi_test:advection_convergence
 5/10 Test #55: regression_mpi_test:advection_convergence .......***Failed   55.81 sec
      Start 57: regression_mpi_test:output_hdf5
 6/10 Test #57: regression_mpi_test:output_hdf5 .................***Failed   15.51 sec
      Start 59: regression_mpi_test:advection_outflow
 7/10 Test #59: regression_mpi_test:advection_outflow ...........   Passed    2.73 sec
      Start 61: regression_mpi_test:bvals
 8/10 Test #61: regression_mpi_test:bvals .......................   Passed    6.57 sec
      Start 63: regression_mpi_test:poisson
 9/10 Test #63: regression_mpi_test:poisson .....................   Passed    5.09 sec
      Start 65: regression_mpi_test:sparse_advection
10/10 Test #65: regression_mpi_test:sparse_advection ............***Failed   11.33 sec

50% tests passed, 5 tests failed out of 10

Label Time Summary:
cuda          = 128.72 sec*proc (10 tests)
mpi           = 128.72 sec*proc (10 tests)
poisson       =   5.09 sec*proc (1 test)
regression    = 128.72 sec*proc (10 tests)

Total Test time (real) = 128.73 sec

The following tests FAILED:
	 47 - regression_mpi_test:particle_leapfrog (Failed)
	 49 - regression_mpi_test:particle_leapfrog_outflow (Failed)
	 55 - regression_mpi_test:advection_convergence (Failed)
	 57 - regression_mpi_test:output_hdf5 (Failed)
	 65 - regression_mpi_test:sparse_advection (Failed)
Errors while running CTest

@pgrete
Copy link
Collaborator Author

pgrete commented Dec 16, 2022

https://github.com/parthenon-hpc-lab/parthenon/blob/develop/src/interface/update.cpp#L162

  Kokkos::parallel_for(
      "SparseDealloc",
      Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), pack.GetNBlocks(), Kokkos::AUTO),
      KOKKOS_LAMBDA(parthenon::team_mbr_t team_member) {
        const int b = team_member.league_rank();

        const int lo = pack.GetLowerBound(b);
        const int hi = pack.GetUpperBound(b);

        for (int v = lo; v <= hi; ++v) {
          const auto &var = pack(b, v);
          const Real threshold = var.deallocation_threshold;
          bool all_zero = true;
          Kokkos::parallel_reduce(
              Kokkos::TeamThreadRange<>(team_member, NkNjNi),
              [&](const int idx, bool &lall_zero) {
                const int k = kb.s + idx / NjNi;
                const int j = jb.s + (idx % NjNi) / Ni;
                const int i = ib.s + idx % Ni;
                if (std::abs(var(k, j, i)) > threshold) {
                  lall_zero = false;
                  return;
                }
              },
              Kokkos::LAnd<bool, DevMemSpace>(all_zero));
          Kokkos::single(Kokkos::PerTeam(team_member),
                         [&]() { is_zero(b, v) = all_zero; });
        }
      });

@lroberts36 what does this check mean for vector or tensor sparse variable (as the fourth index is extracted from a pack and but the threshold is for the var)?

@pgrete
Copy link
Collaborator Author

pgrete commented Dec 16, 2022

Good news! Looks like this was actually a (fixed) Kokkos bug.
After some back and forth updating the Kokkos submodule to 3.7.01 now also makes the tests pass on Crusher with MPI enabled.
I'll open a PR next week (including a machine file for Crusher and Frontier).

Independently, the question above on the SparseDealloc kernel remains open.

@lroberts36
Copy link
Collaborator

https://github.com/parthenon-hpc-lab/parthenon/blob/develop/src/interface/update.cpp#L162

  Kokkos::parallel_for(
      "SparseDealloc",
      Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), pack.GetNBlocks(), Kokkos::AUTO),
      KOKKOS_LAMBDA(parthenon::team_mbr_t team_member) {
        const int b = team_member.league_rank();

        const int lo = pack.GetLowerBound(b);
        const int hi = pack.GetUpperBound(b);

        for (int v = lo; v <= hi; ++v) {
          const auto &var = pack(b, v);
          const Real threshold = var.deallocation_threshold;
          bool all_zero = true;
          Kokkos::parallel_reduce(
              Kokkos::TeamThreadRange<>(team_member, NkNjNi),
              [&](const int idx, bool &lall_zero) {
                const int k = kb.s + idx / NjNi;
                const int j = jb.s + (idx % NjNi) / Ni;
                const int i = ib.s + idx % Ni;
                if (std::abs(var(k, j, i)) > threshold) {
                  lall_zero = false;
                  return;
                }
              },
              Kokkos::LAnd<bool, DevMemSpace>(all_zero));
          Kokkos::single(Kokkos::PerTeam(team_member),
                         [&]() { is_zero(b, v) = all_zero; });
        }
      });

@lroberts36 what does this check mean for vector or tensor sparse variable (as the fourth index is extracted from a pack and but the threshold is for the var)?

@pgrete: I believe this should work as expected (i.e. if any component of a vector or tensor valued field is above threshold it should not trigger deallocation). This loop checks every (scalar/vector/tensor) element of the control fields for being above zero and stores in is_zero. Then in the next set of loops, we iterate over all components of a given variable and check that is_zero_h is true for all of them before incrementing the deallocation counter.

@pgrete
Copy link
Collaborator Author

pgrete commented Jan 17, 2023

Fixed in #805

@pgrete pgrete closed this as completed Jan 17, 2023
@pgrete pgrete linked a pull request Jan 17, 2023 that will close this issue
7 tasks
@pgrete pgrete unpinned this issue Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants