Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm 5.6.0 PR testing build failing - module not found #2137

Closed
brian-kelley opened this issue Mar 11, 2024 · 14 comments
Closed

rocm 5.6.0 PR testing build failing - module not found #2137

brian-kelley opened this issue Mar 11, 2024 · 14 comments

Comments

@brian-kelley
Copy link
Contributor

@ndellingwood

On a PR testing run from today, the builds KokkosKernels_PullRequest_VEGA90A_ROCM560 and KokkosKernels_PullRequest_VEGA90A_Tpls_ROCM560 failed because rocm 5.6.0 was not found:

'Going to test compilers:  rocm/5.6.0'
'Testing compiler rocm/5.6.0'
'Unrecognized compiler rocm/5.6.0 when looking for Spack variants'
'Unrecognized compiler rocm/5.6.0 when looking for Spack variants'
Unrecognized compiler rocm/5.6.0 when looking for Spack variants'
'  FAILED rocm-5.6.0-Hip_Serial-release'
'SETUP_ENV: compiler=rocm/5.6.0 modules=cmake rocm/5.6.0 openblas/0.3.20/rocm/5.2.0'
'Lmod has detected the following error: The following module(s) are unknown:'
'"rocm/5.6.0"'

I just checked on the MI210 and MI250 nodes and I don't see 5.6.0 anywhere, but there is rocm/5.6.1.

@ndellingwood
Copy link
Contributor

@brian-kelley I just hopped on and checked modules on MI210 and rocm/5.6.0 was available:

[ndellin@caraway ~]$ salloc -N 1 -p MI210
salloc: Granted job allocation 1009777
[ndellin@lean1 ~]$ module spider rocm

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        rocm/5.2.0
        rocm/5.2.3
        rocm/5.3.3
        rocm/5.4.3
        rocm/5.5.1
        rocm/5.6.0
        rocm/5.6.1
     Other possible modules matches:
...

[ndellin@lean1 ~]$ module load rocm/5.6.0
[ndellin@lean1 ~]$ module list

Currently Loaded Modules:
  1) rocm/5.6.0

I also manually launched a cm_test_all_sandia build with rocm/5.6.0 and the build is proceeding without issue

[ndellin@lean1 Caraway-rocm560-MI210]$ ../../scripts/cm_test_all_sandia rocm/5.6.0 --with-hip
Running on machine: vega90a_caraway
KokkosKernels Repository Status:  8f2945d0c99791345053fc839b1ea453354e03f9 Kokkos Kernels: update version guards to drop old version of Kokkos (#2133)

Kokkos Repository Status:  d78a7d4383786359ee8692af5b30aac973fca0da Added in the explicit deduction guides for RangePolicy: • Correctness when passing in an execution space • Workaround for nvcc as RangePolicy<...> doesn't have any template parameters that can be deduced, so gcc/clang assume that a matching ctor in the primary template deduces to RangePolicy<> while nvcc assumes it is a bug.


Going to test compilers:  rocm/5.6.0
Testing compiler rocm/5.6.0
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
  Starting job rocm-5.6.0-Hip_Serial-release
Hip IS THE KOKKOS DEVICE
kokkos devices: Hip,Serial
kokkos arch: VEGA90A
kokkos options: 
kokkos cuda options: 
kokkos cxxflags: -O3  
extra_args: 
kokkoskernels scalars: 'double,complex_double'
kokkoskernels ordinals: int
kokkoskernels offsets: int,size_t
kokkoskernels layouts: LayoutLeft
kokkoskernels tpls list: 
...

Maybe there was an update in progress that temporarily disrupted the modules? Let's keep an eye on whether this occurs again, there may be a change occurring soon once rocm/6.0 is available

@ndellingwood
Copy link
Contributor

I just checked the MI250 queue and it looks like rocm/5.6.0 is not available there:

[ndellin@caraway Caraway-rocm560-MI250]$ salloc -N 1 -p MI250
salloc: Granted job allocation 1009778
[ndellin@fat2 Caraway-rocm560-MI250]$ module spider rocm

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        rocm/5.2.0
        rocm/5.6.1
        rocm/6.0.0

@ndellingwood
Copy link
Contributor

I relaunched one of the Jenkins PR jobs running on MI210 and it looks like it is proceeding without issue with rocm/5.6.0, but we'll need to test and then update the jobs to use rocm/5.6.1 to hopefully avoid any bumps if the modules are permanently modified on MI210 like those on MI250

@ndellingwood
Copy link
Contributor

Hm, looks like there is some issue with the rocm/5.6.1 module on MI250, configure issues just trying to build kokkos

-- Check for working CXX compiler: /usr/bin/hipcc
-- Check for working CXX compiler: /usr/bin/hipcc - broken
CMake Error at /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
  The C++ compiler

    "/usr/bin/hipcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    
    Run Build Command(s): /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_31082/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_31082.dir/build.make CMakeFiles/cmTC_31082.dir/build
    gmake[1]: Entering directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    Building CXX object CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o
    /usr/bin/hipcc    -o CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o -c /home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP/testCXXCompiler.cxx
    sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
    Can't exec "/opt/rocm-5.6.1/bin/rocm_agent_enumerator": No such file or directory at /usr/bin//hipcc.pl line 488.
    Use of uninitialized value $targetsStr in substitution (s///) at /usr/bin//hipcc.pl line 489.
    Use of uninitialized value $targetsStr in split at /usr/bin//hipcc.pl line 495.
    sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
    gmake[1]: *** [CMakeFiles/cmTC_31082.dir/build.make:78: CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o] Error 127
    gmake[1]: Leaving directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    gmake: *** [Makefile:127: cmTC_31082/fast] Error 2
    
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:121 (PROJECT)

@ndellingwood
Copy link
Contributor

Kokkos configures fine with rocm/5.2.0 and rocm/6.0.0 on MI250. I'll open an issue with the sys admins regarding rocm/5.6.1 problems

@ndellingwood
Copy link
Contributor

So my MI210 test was on lean1 where I was able to load rocm/5.6.0, but a nightly just failed on lean2 due to being unable to find rocm/5.6.0

22:12:22 Hostname:
22:12:22 lean2
22:12:24 Lmod has detected the following error: The following module(s) are unknown:
22:12:24 "rocm/5.6.0"

I'll follow up with sys admins tomorrow

@brian-kelley
Copy link
Contributor Author

OK I see, the modules are just different on different nodes of the MI210 queue. Hopefully the admins make them consistent soon, I know they were still testing 6.0.0 on just one of the nodes before applying it to the others.

@ndellingwood
Copy link
Contributor

Yeah, I opened an issue. Hopefully it can get sorted out quickly. There are problems with the rocm/5.6.1 install, so for the time being shifting to that rocm version isn't a helpful option unfortunately

@brian-kelley
Copy link
Contributor Author

Can we restrict the jenkins job to run on lean1 for now?

@lucbv
Copy link
Contributor

lucbv commented Mar 12, 2024

Yeah we can request a specific node list with salloc when launching the job in the jenkins script I believe?

@ndellingwood
Copy link
Contributor

@brian-kelley they're rebooting lean1 which will update to the recent image the other nodes are using, but that only leaves rocm/5.6.1 as the closest replacement for 5.6.0 but that module is problematic (hipcc fails during the cmake check)

@ndellingwood
Copy link
Contributor

@lucbv @brian-kelley lots of progress with the updated rocm modules, sounds like one image update on the nodes may have us in a good state. I'll put in a PR with cm_test_all_sandia updates and modify the PR jobs to use rocm/5.6.1 once I confirm tests are passing

@ndellingwood
Copy link
Contributor

@lucbv @brian-kelley I updated the Caraway CI jobs to test with rocm/5.6.1, and testing of #2142 confirmed it all worked. I merged the cm_test_all_sandia updates, so CI should be good to go again (though PRs may need to rebase on top of develop to ensure the cm_test_all_sandia are present)

@ndellingwood
Copy link
Contributor

Nightly and Jenkins CI are running properly again using rocm/5.6.1, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants