Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos nightly test failures (all backends) in Ifpack2 #1901

Closed
ndellingwood opened this issue Jul 13, 2023 · 4 comments
Closed

Trilinos nightly test failures (all backends) in Ifpack2 #1901

ndellingwood opened this issue Jul 13, 2023 · 4 comments

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Jul 13, 2023

Trilinos nightly test failures began occurring in the Ifpack2_unit_tests_MPI_4 and Ifpack2_MDF_MPI_4 unit tests following merge of #1893 . Failure occurs with all backends (e.g. nightly tests with intel/2021.4 and Serial build, gcc/8.3.0 and OpenMP, cuda/11.2 with and without UVM enabled)

Ifpack2_unit_tests_MPI_4 Failure snip from cuda/11.2 no-uvm build:

23:47:31 p=2 | 35. Ifpack2MDF_double_int_longlong_Test1_UnitTest ... [Passed] (0.0888 sec)
23:47:31 p=1 | 36. Ifpack2MDF_double_int_longlong_Test2_UnitTest ... 
23:47:31 p=1 |  Test that code {prec.setParameters(params);} does not throw : passed
23:47:31 p=1 |  Comparing prec.getReversePermutations() == refPermuations ... 
23:47:31 p=1 |  Error, prec.getReversePermutations()[0] = 3 == refPermuations[0] = 0: failed!
23:47:31 p=1 |  
23:47:31 p=1 |  Error, prec.getReversePermutations()[1] = 4 == refPermuations[1] = 3: failed!
23:47:31 p=1 |  
23:47:31 p=1 |  Error, prec.getReversePermutations()[2] = 7 == refPermuations[2] = 6: failed!
23:47:31 p=1 |  
23:47:31 p=1 |  Error, prec.getReversePermutations()[3] = 2 == refPermuations[3] = 1: failed!
23:47:31 p=1 |  
23:47:31 p=0 | 36. Ifpack2MDF_double_int_longlong_Test2_UnitTest ... 
23:47:31 p=0 |  Test that code {prec.setParameters(params);} does not throw : passed
23:47:31 p=0 |  Comparing prec.getReversePermutations() == refPermuations ... 
23:47:31 p=0 |  Error, prec.getReversePermutations()[0] = 3 == refPermuations[0] = 0: failed!
23:47:31 p=0 |  
23:47:31 p=0 |  Error, prec.getReversePermutations()[1] = 4 == refPermuations[1] = 3: failed!
23:47:31 p=0 |  
23:47:31 p=0 |  Error, prec.getReversePermutations()[2] = 7 == refPermuations[2] = 6: failed!
23:47:31 p=0 |  
23:47:31 p=0 |  Error, prec.getReversePermutations()[3] = 2 == refPermuations[3] = 1: failed!
23:47:31 p=0 |  
23:47:31 p=0 |  Error, prec.getReversePermutations()[4] = 1 == refPermuations[4] = 4: failed!
23:47:31 p=3 | 36. Ifpack2MDF_double_int_longlong_Test2_UnitTest ... 
23:47:31 p=3 |  Test that code {prec.setParameters(params);} does not throw : passed
23:47:31 p=3 |  Comparing prec.getReversePermutations() == refPermuations ... 
23:47:31 p=3 |  Error, prec.getReversePermutations()[0] = 3 == refPermuations[0] = 0: failed!
23:47:31 p=3 |  
23:47:31 p=3 |  Error, prec.getReversePermutations()[1] = 4 == refPermuations[1] = 3: failed!
23:47:31 p=3 |  
23:47:31 p=3 |  Error, prec.getReversePermutations()[2] = 7 == refPermuations[2] = 6: failed!
23:47:31 p=3 |  
23:47:31 p=3 |  Error, prec.getReversePermutations()[3] = 2 == refPermuations[3] = 1: failed!
23:47:31 p=3 |  
23:47:31 p=3 |  Error, prec.getReversePermutations()[4] = 1 == refPermuations[4] = 4: failed!
23:47:31 p=3 |  
23:47:31 p=2 | 36. Ifpack2MDF_double_int_longlong_Test2_UnitTest ... 
23:47:31 p=2 |  Test that code {prec.setParameters(params);} does not throw : passed
23:47:31 p=2 |  Comparing prec.getReversePermutations() == refPermuations ... 
23:47:31 p=2 |  Error, prec.getReversePermutations()[0] = 3 == refPermuations[0] = 0: failed!
23:47:31 p=2 |  
23:47:31 p=2 |  Error, prec.getReversePermutations()[1] = 4 == refPermuations[1] = 3: failed!
23:47:31 p=2 |  
23:47:31 p=2 |  Error, prec.getReversePermutations()[2] = 7 == refPermuations[2] = 6: failed!
23:47:31 p=2 |  
23:47:31 p=2 |  Error, prec.getReversePermutations()[3] = 2 == refPermuations[3] = 1: failed!
23:47:31 p=2 |  
23:47:31 p=2 |  Error, prec.getReversePermutations()[4] = 1 == refPermuations[4] = 4: failed!
23:47:31 p=2 |  
...

Ifpack2_MDF_MPI_4 failure snip:

23:47:50 2. Ifpack2MDF_double_int_longlong_Test2_UnitTest ... 
23:47:50  Test that code {prec.setParameters(params);} does not throw : passed
23:47:50  Comparing prec.getReversePermutations() == refPermuations ... 
23:47:50  Error, prec.getReversePermutations()[0] = 3 == refPermuations[0] = 0: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[1] = 4 == refPermuations[1] = 3: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[2] = 7 == refPermuations[2] = 6: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[3] = 2 == refPermuations[3] = 1: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[4] = 1 == refPermuations[4] = 4: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[5] = 0 == refPermuations[5] = 5: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[6] = 5 == refPermuations[6] = 7: failed!
23:47:50  
23:47:50  Error, prec.getReversePermutations()[8] = 6 == refPermuations[8] = 2: failed!
23:47:50  Comparing prec.getPermutations() == refPermuationsInv ... 
23:47:50  Error, prec.getPermutations()[0] = 5 == refPermuationsInv[0] = 0: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[1] = 4 == refPermuationsInv[1] = 3: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[2] = 3 == refPermuationsInv[2] = 8: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[3] = 0 == refPermuationsInv[3] = 1: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[4] = 1 == refPermuationsInv[4] = 4: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[5] = 6 == refPermuationsInv[5] = 5: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[6] = 8 == refPermuationsInv[6] = 2: failed!
23:47:50  
23:47:50  Error, prec.getPermutations()[7] = 2 == refPermuationsInv[7] = 6: failed!
23:47:50  
23:47:50  p=0: *** Caught standard std::exception of type 'std::runtime_error' :
23:47:50  
23:47:50   /home/jenkins/weaver/workspace/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/Trilinos/packages/ifpack2/src/Ifpack2_LocalSparseTriangularSolver_def.hpp:725:
23:47:50   
23:47:50   Throw number = 1
23:47:50   
23:47:50   Throw test that evaluated to true: (A_crs_->getLocalNumRows() > 0 && this->uplo_ == "N")
23:47:50   
23:47:50   Ifpack2::LocalSparseTriangularSolver<Tpetra::RowMatrix<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaSpace> > >::localTriangularSolve: The matrix is neither upper triangular or lower triangular.  You may only call this method if the matrix is triangular.  Remember that this is a local (per MPI process) property, and that Tpetra only knows how to do a local (per process) triangular solve.
23:47:50  [FAILED]  (0.0681 sec) Ifpack2MDF_double_int_longlong_Test2_UnitTest
...

Reproducer (Weaver, rhel8 queue):

# Repos
git clone -b develop https://github.com/trilinos/Trilinos.git
git clone -b develop https://github.com/kokkos/kokkos.git
git clone -b develop https://github.com/kokkos/kokkos-kernels.git
# Symbolic link to your kokkos and kokkos-kernels repos in Trilinos source directory for source override
cd Trilinos
ln -s <path-to-your-repo>/kokkos kokkos
ln -s <path-to-your-repo>/kokkos-kernels kokkos-kernels

cd $HOME
mkdir -p build
cd build

# Get an interactive node
bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

# Environment
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_PATH/bin/nvcc_wrapper"


# Configure
cmake \
      -DCMAKE_CXX_FLAGS='-g' \
      -DCMAKE_CXX_STANDARD="17" \
      -DCMAKE_INSTALL_PREFIX=$PWD/install \
      -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
      -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
      -DTrilinos_ENABLE_TESTS=OFF \
      -DTrilinos_ENABLE_ALL_PACKAGES=ON \
      -DTPL_ENABLE_CUSPARSE:BOOL=ON \
      -DFC_FN_UNDERSCORE=UNDER \
      \
      -D Trilinos_ENABLE_Kokkos=ON \
      -D Kokkos_ARCH_VOLTA70=ON \
      -D Kokkos_ARCH_POWER9=ON \
      -D Kokkos_ENABLE_CUDA=ON \
      -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
      -D Kokkos_ENABLE_CUDA_UVM=OFF \
      -DTrilinos_ENABLE_Ifpack2=ON \
      -DIfpack2_ENABLE_TESTS=ON \
     -DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
     -DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
      \
$TRILINOS_DIR

Adding @lucbv @tmranse

Edit: Added steps to use source override for testing with kokkos and kokkos-kernels develop branches in Trilinos

@ndellingwood
Copy link
Contributor Author

@lucbv @tmranse is one of you able to investigate?

@lucbv
Copy link
Contributor

lucbv commented Jul 17, 2023

Thanks for reporting this Nathan, Tom and I will have a look.

@tmranse
Copy link
Contributor

tmranse commented Jul 25, 2023

@ndellingwood with trilinos/Trilinos#12078 and #1916 the issue is fixed, testing both locally and on weaver.

@tmranse
Copy link
Contributor

tmranse commented Aug 18, 2023

@ndellingwood with trilinos/Trilinos#12078 and #1916 the issue is fixed, testing both locally and on weaver.

Both are now be merged, so we should hopefully start seeing this fix reflected in nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants