Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos nightly failures in cuda/11.2 builds, no-uvm (various packages) #6450

Closed
ndellingwood opened this issue Sep 15, 2023 · 5 comments
Closed
Labels
Failure - Nightly Nightly Build Failure Failure - Trilinos Continuous Integration Build Failure

Comments

@ndellingwood
Copy link
Contributor

Describe the bug
Several tests are failing in cuda/11.2 builds of Trilinos (no-uvm)

Test failures

01:53:50 The following tests FAILED:
01:53:50 	1759 - MueLu_PerformanceModel_MPI_4 (SEGFAULT)
01:53:50 	1762 - MueLu_ComboPTest_MPI_4 (SEGFAULT)
01:53:50 	1783 - MueLu_CalcRotations_MPI_4 (SEGFAULT)
01:53:50 	1827 - MueLu_Driver_TogglePFactory_sa_tent_Tpetra_MPI_4 (SEGFAULT)
01:53:50 	1831 - MueLu_Driver_TogglePFactory_semi_tent_line_Tpetra_MPI_4 (SEGFAULT)
01:53:50 	1963 - Zoltan2_TpetraCrsColorer_galeri2_MPI_4 (SEGFAULT)
01:53:50 	2314 - PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 (Failed)
01:53:50 	2321 - PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 (Failed)
01:53:50 	2337 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell2D_MPI_4 (SEGFAULT)

The failure and more extensive output can be found for example here: cdash experimental track

This follows from changes in this list of commits:
Changes:
Git (git https://github.com/kokkos/kokkos.git)

Deprecate Cuda(cudaStream_t, bool) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Fixup checked interger operations death test ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Deprecate HIP(hipStream_t, bool) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Let Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC be ON by default ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Print whether KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC is defined ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Introduce disable_malloc_async Cuda option with generated makefiles ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Preserve one build that disables Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Use archive extraction time for timestamps ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
Disable performance benchmarks in AppVeyor CI ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))
team-level std algos: part 6 (#6210) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail))

Please also include the following items to support reproducing the bug
Reproducer configuration (Weaver rhel8):

# Repos
git clone -b develop https://github.com/trilinos/Trilinos.git
git clone -b develop https://github.com/kokkos/kokkos.git
git clone -b develop https://github.com/kokkos/kokkos-kernels.git
# Symbolic link to external kokkos and kokkos-kernels repos in Trilinos source directory for source override
cd Trilinos
ln -s <path-to-your-repo>/kokkos kokkos
ln -s <path-to-your-repo>/kokkos-kernels kokkos-kernels

cd $HOME
mkdir -p build
cd build

# Interactive Weaver session
bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

# Environment
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX=$KOKKOS_DIR/bin/nvcc_wrapper

cmake \
-G"Unix Makefiles" \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DCMAKE_INSTALL_PREFIX=$TRILINOS_INSTALL_DIR \
-DCMAKE_CXX_STANDARD="17" \
-DFC_FN_UNDERSCORE=UNDER \
-DTPL_ENABLE_CUSPARSE=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTrilinos_ENABLE_Stokhos=ON \
-DKokkos_ENABLE_CUDA_UVM=OFF \
-DKokkos_ARCH_VOLTA70=ON \
-DKokkos_ARCH_POWER9=ON \
-DKokkos_CoreUnitTest_CudaTimingBased_MPI_1_DISABLE=ON \
-DKokkos_CoreUnitTest_Default_MPI_1_SET_RUN_SERIAL=ON \
-DIntrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1_SET_RUN_SERIAL=ON \
-DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
-DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
  -DTrilinos_ENABLE_INSTALLATION_TESTING=OFF \
  -DCTEST_BUILD_FLAGS=-j16 \
  -DCTEST_PROJECT_NAME="KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm" \
  -DCTEST_BUILD_NAME="KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm" \
$TRILINOS_DIR
@ndellingwood ndellingwood added Blocks Promotion Overview issue for release-blocking bugs Failure - Nightly Nightly Build Failure Failure - Trilinos Continuous Integration Build Failure labels Sep 15, 2023
@ndellingwood
Copy link
Contributor Author

ndellingwood commented Sep 15, 2023

Out of curiosity I'm testing the build with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF to see if that impacts the tests

@ndellingwood
Copy link
Contributor Author

Retesting with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF the MueLu_ComboPTest_MPI_4 failed spuriously (maybe 50% pass rate); the remaining MueLu tests passed reliably.
Building panzer now in this configuration
I have a separate build of Trilinos develop (no "external" kokkos or kokkos-kernels) configured with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON. I'll open Trilinos issues if I can reproduce the above failures on Trilinos

@ndellingwood
Copy link
Contributor Author

The Panzer tests also passed with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF
Nn Trilinos develop branch (no "external" kokkos and kokkos-kernels usage) the same set of tests fail with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON

A common output among the failing tests:

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x32042b200
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.

The failures correlate with the Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON change, but based on the cuIpcGetMemHandle related output, I'm not sure this indicates bugs in Trilinos code, as opposed to some Power system or openmpi install incompatibilities

@masterleinad
Copy link
Contributor

See openucx/ucx#7110. I would guess that the ucx version used is incompatible and using a newer MPI implementation might help. Such an incompatibility was the reason for initially having Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF by default.

@ndellingwood
Copy link
Contributor Author

Thanks @masterleinad , I set Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF for the build on this machine when using the older module installs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Failure - Nightly Nightly Build Failure Failure - Trilinos Continuous Integration Build Failure
Projects
None yet
Development

No branches or pull requests

2 participants