-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocm 5.6.0 PR testing build failing - module not found #2137
Comments
@brian-kelley I just hopped on and checked modules on MI210 and rocm/5.6.0 was available:
I also manually launched a cm_test_all_sandia build with rocm/5.6.0 and the build is proceeding without issue
Maybe there was an update in progress that temporarily disrupted the modules? Let's keep an eye on whether this occurs again, there may be a change occurring soon once rocm/6.0 is available |
I just checked the MI250 queue and it looks like rocm/5.6.0 is not available there:
|
I relaunched one of the Jenkins PR jobs running on MI210 and it looks like it is proceeding without issue with rocm/5.6.0, but we'll need to test and then update the jobs to use rocm/5.6.1 to hopefully avoid any bumps if the modules are permanently modified on MI210 like those on MI250 |
Hm, looks like there is some issue with the rocm/5.6.1 module on MI250, configure issues just trying to build kokkos
|
Kokkos configures fine with rocm/5.2.0 and rocm/6.0.0 on MI250. I'll open an issue with the sys admins regarding rocm/5.6.1 problems |
So my MI210 test was on lean1 where I was able to load rocm/5.6.0, but a nightly just failed on lean2 due to being unable to find rocm/5.6.0
I'll follow up with sys admins tomorrow |
OK I see, the modules are just different on different nodes of the MI210 queue. Hopefully the admins make them consistent soon, I know they were still testing 6.0.0 on just one of the nodes before applying it to the others. |
Yeah, I opened an issue. Hopefully it can get sorted out quickly. There are problems with the rocm/5.6.1 install, so for the time being shifting to that rocm version isn't a helpful option unfortunately |
Can we restrict the jenkins job to run on lean1 for now? |
Yeah we can request a specific node list with salloc when launching the job in the jenkins script I believe? |
@brian-kelley they're rebooting lean1 which will update to the recent image the other nodes are using, but that only leaves rocm/5.6.1 as the closest replacement for 5.6.0 but that module is problematic (hipcc fails during the cmake check) |
@lucbv @brian-kelley lots of progress with the updated rocm modules, sounds like one image update on the nodes may have us in a good state. I'll put in a PR with cm_test_all_sandia updates and modify the PR jobs to use rocm/5.6.1 once I confirm tests are passing |
@lucbv @brian-kelley I updated the Caraway CI jobs to test with rocm/5.6.1, and testing of #2142 confirmed it all worked. I merged the cm_test_all_sandia updates, so CI should be good to go again (though PRs may need to rebase on top of develop to ensure the cm_test_all_sandia are present) |
Nightly and Jenkins CI are running properly again using rocm/5.6.1, closing |
@ndellingwood
On a PR testing run from today, the builds
KokkosKernels_PullRequest_VEGA90A_ROCM560
andKokkosKernels_PullRequest_VEGA90A_Tpls_ROCM560
failed because rocm 5.6.0 was not found:I just checked on the MI210 and MI250 nodes and I don't see 5.6.0 anywhere, but there is
rocm/5.6.1
.The text was updated successfully, but these errors were encountered: