Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse_serial timeout issues in some nightly builds, PR testing #1918

Closed
ndellingwood opened this issue Jul 27, 2023 · 7 comments
Closed

sparse_serial timeout issues in some nightly builds, PR testing #1918

ndellingwood opened this issue Jul 27, 2023 · 7 comments
Assignees

Comments

@ndellingwood
Copy link
Contributor

The sparse_serial unit test duration is exceeding the default 1500 sec duration in some builds, for example in nightly builds with debugging and boundschecking enabled, and also occurred with PR testing e.g. #1916 (comment)

This began occurring after merge of these commits:
ODE: changing layout of temp mem in RK algorithms (detail)
ODE: fix unnecessary test overload (detail)
Improve performance of the native BsrMatrix SpMV, especially for single-vector cases. (detail)

hinting that the additional tests in #1740 may need to be adjusted to reduce test time, @cwpearson can you look into adjusting parameters of the added bsr tests to reduce test time while maintaining appropriate coverage?

Reproducer (Weaver, rhel8 queue):

source /etc/profile.d/modules.sh
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module purge
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1


$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-cuda --with-serial --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --cxxstandard=17 --debug --boundscheck
@ndellingwood
Copy link
Contributor Author

These tests seem to be the most expensive additions:

14: [ RUN      ] serial.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestExecSpace
14: [       OK ] serial.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestExecSpace (152885 ms)
14: [ RUN      ] serial.sparse_bsr_spmmv_double_int_size_t_LayoutLeft_TestExecSpace
14: [       OK ] serial.sparse_bsr_spmmv_double_int_size_t_LayoutLeft_TestExecSpace (133544 ms)
14: [ RUN      ] serial.sparse_bsr_spmmv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
14: [       OK ] serial.sparse_bsr_spmmv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (223667 ms)
14: [ RUN      ] serial.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace
14: [       OK ] serial.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (175229 ms)

Prior to merge #1740, only sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace was present from the list above, which took

14: [       OK ] serial.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (115676 ms)

@cwpearson
Copy link
Contributor

After PR #1922, these tests take a total of 10s on my laptop (almost 700s above). If 1922 does not resolve it I can look further.

@ndellingwood
Copy link
Contributor Author

Thanks @cwpearson , your PR helped improve the test time and get past the bsr_spmmv tests.
There are still timeouts of sparse_serial in the cuda/11.2+gcc/8.3.0 debug+boundscheck nightly job, I'll take a closer look and update the issue regarding which tests the higher distribution of time has shifted toward

@cwpearson
Copy link
Contributor

I also found some timeouts on the KokkosKernels_PullRequest_A64FX_GCC1020 build

The biggest offenders I've seen are these:

https://jenkins-son.sandia.gov/view/KokkosKernels/job/KokkosKernels_PullRequest_A64FX_GCC1020/568/consoleFull

serial.sparse_block_spgemm_kokkos_complex_double_int_size_t_TestExecSpace (1615063 ms)

And from a different run
https://jenkins-son.sandia.gov/view/KokkosKernels/job/KokkosKernels_PullRequest_A64FX_GCC1020/570/consoleFull

openmp.sparse_bsr_gauss_seidel_rank1_double_int_int_TestExecSpace (747130 ms)

Interestingly, it's not always the same tests which take a long time. For example in 568 that openmp test above only takes a couple seconds

@ndellingwood
Copy link
Contributor Author

In the cuda/11.2.2+gcc/8.3.0 debug+boundscheck nightly build on Weaver, these sub-tests of sparse_serial run over 100000 ms:

serial.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestExecSpace (104317 ms) 
serial.sparse_bsr_spmmv_double_int_size_t_LayoutLeft_TestExecSpace (110742 ms) 
serial.sparse_bsr_spmmv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (134272 ms) 
serial.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (156282 ms) 

@ndellingwood
Copy link
Contributor Author

Since the long-running tests are inconsistent among machines/arches, we should consider splitting the sparse unit test apart into separate executables

@ndellingwood
Copy link
Contributor Author

To quickly address current CI timeouts we decided in the meeting that

  1. the timeout will be temporarily increases,
  2. followed up by splitting the test apart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants