Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for new HIP compiler / runtime #3067

Merged

Conversation

skyreflectedinmirrors
Copy link
Contributor

Changes required to compile / run with the new compiler/runtime.

Major changes:

  • HIP-Clang matches CUDA-Clang's semantics when it comes to __host__ / __device__ overloading, while HCC was closer to NVCC (i.e., this document is largely applicable). This requires implementing a new KOKKOS_ENABLE_OVERLOAD_HOST_DEVICE which mostly serves the same purpose as the KOKKOS_IMPL_CUDA_CLANG_WORKAROUND. It may be worth discussing whether to unify these two.
  • As a consequence of this, I had to implement some device stubs for some of the SharedAllocationRecord that are essentially no-ops. I am not 100% why these are not required for your testing of CUDA-Clang. I suspect it is because our compiler tracks fairly close to upstream master (i.e., we're reporting clang 11.0), while from the Jenkins CI it appears you guys are typically using LLVM_VERSION=8.0 for CUDA-Clang.

Known issues:

  • There is a known issue in copying (some) structs by value over to the device that our compiler team is still actively debugging -- a08358b can be used as a workaround. Mallocing the struct on the device and explicitly copying over mitigates the issue, though performance will be negatively impacted for real apps.
  • I am showing the hip.triple_nested_parallelism as failing with "C++ exception with description "Kokkos::Impl::ParallelReduce< HIP > requested too large team size.". This is showing a pass on my ROCm 3.3 system as well. Not if the new runtime is calculating something about the function attributes wrong, or we're using more registers, or what -- I am actively digging here while the review gets started, but I didn't want to hold this up anymore as the new version is starting to appear on HPE systems.

Minor changes:

  • Some HIP API changes, adding of CRS / clock tick implementations for HIP
  • Fixup CMake to detect HIP-Clang vs HCC, and use the --cuda-host-only for GTest compilation.
  • __constant__ must be global.

Minor notes:

  • confusingly, __HIP__ implies HIP-Clang is the front-end compiler

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@skyreflectedinmirrors skyreflectedinmirrors changed the title Hipclang vdi with copy workaround Changes for new HIP compiler / runtime May 27, 2020
@dalg24
Copy link
Member

dalg24 commented May 27, 2020

OK to test

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented May 27, 2020

I should also note that this is PR was made to be compatible w/ the current HCC / ROCm 3.3 build. In the future, we may want to simply remove KOKKOS_ENABLE_OVERLOAD_HOST_DEVICE entirely, as this will be required from here on out.

core/src/Kokkos_Macros.hpp Outdated Show resolved Hide resolved
Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. There are a lot less changes than what I expected.

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp Outdated Show resolved Hide resolved
@Rombur
Copy link
Member

Rombur commented May 27, 2020

I am showing the hip.triple_nested_parallelism as failing with "C++ exception with description "Kokkos::Impl::ParallelReduce< HIP > requested too large team size.".

If you can't find the problem, I am fine with disabling this test to get this PR merged. I know that we have problems with VectorTeam and ParallelReduce, I was planning to revisit this as soon as rocm 3.5 was made available.

@masterleinad
Copy link
Contributor

You also need to run clang-format 8 (or apply the patch in the clang-format step in https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/1912/pipeline). 🙂

CMakeLists.txt Show resolved Hide resolved
core/src/impl/Kokkos_Memory_Fence.hpp Show resolved Hide resolved
core/unit_test/CMakeLists.txt Show resolved Hide resolved
@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented May 27, 2020

@masterleinad -- any particular reason it has to be clang == 8 for clangformat? Just wondering, since our compilers come with v10 or 11 :)

either way, I applied it in 775f9d4

@masterleinad
Copy link
Contributor

@masterleinad -- any particular reason it has to be clang == 8 for clangformat? Just wondering, since our compilers come with v10 or 11 :)

No, we just picked one version when we started indenting with clang-format and it turns out that there are differences when using newer versions. 🙂

masterleinad
masterleinad previously approved these changes May 27, 2020
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

@skyreflectedinmirrors skyreflectedinmirrors changed the title Changes for new HIP compiler / runtime WIP: Changes for new HIP compiler / runtime May 27, 2020
@skyreflectedinmirrors
Copy link
Contributor Author

Marking this as WIP, because even though HCC accepts --cuda-host-only, it doesn't actually do anything with it, i.e., hipcc still gives you warnings about:

clang-10: error: ROCm agent detector could not identify any valid targets; please specify the target explicitly by passing a valid value to -amdgpu-target
clang-10: warning: -amdgpu-target argument 'gfx000' is not recognized; using gfx803 instead [-Winvalid-command-line-argument]

I need to test on a HPE machine where we actually have the compiler w/o a corresponding GPU. Will fix / finish tomorrow.

@masterleinad masterleinad dismissed their stale review May 28, 2020 00:43

There were more changes.

@masterleinad masterleinad self-requested a review May 28, 2020 00:43
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more minor things.

CMakeLists.txt Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_Macros.hpp Outdated Show resolved Hide resolved
@skyreflectedinmirrors skyreflectedinmirrors changed the title WIP: Changes for new HIP compiler / runtime Changes for new HIP compiler / runtime May 28, 2020
@masterleinad
Copy link
Contributor

I am seeing

 #0 0x0000000001a1ea1a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm-3.3.0/hcc/bin/opt+0x1a1ea1a)
 #1 0x0000000001a1c994 llvm::sys::RunSignalHandlers() (/opt/rocm-3.3.0/hcc/bin/opt+0x1a1c994)
 #2 0x0000000001a1cad3 SignalHandler(int) (/opt/rocm-3.3.0/hcc/bin/opt+0x1a1cad3)
 #3 0x00002aaaaacde5d0 __restore_rt (/lib64/libpthread.so.0+0xf5d0)
 #4 0x000000000132c890 llvm::ReplaceableMetadataImpl::getIfExists(llvm::Metadata&) (/opt/rocm-3.3.0/hcc/bin/opt+0x132c890)
 #5 0x000000000132c8f4 llvm::MetadataTracking::untrack(void*, llvm::Metadata&) (/opt/rocm-3.3.0/hcc/bin/opt+0x132c8f4)
 #6 0x00002aaaac016c29 __run_exit_handlers (/lib64/libc.so.6+0x39c29)
 #7 0x00002aaaac016c77 (/lib64/libc.so.6+0x39c77)
 #8 0x00002aaaabfff49c __libc_start_main (/lib64/libc.so.6+0x2249c)
 #9 0x000000000071da55 _start (/opt/rocm-3.3.0/hcc/bin/opt+0x71da55)
/opt/rocm-3.3.0/hcc/bin/clamp-device: line 225: 20961 Segmentation fault      $OPT $HCC_OPT -mtriple amdgcn-amd-amdhsa -mcpu=$AMDGPU_TARGET -load $LIB/LLVMSelectAcceleratorCode.so -load $LIB/LLVMPromotePointerKernArgsToGlobal.so -select-accelerator-code -sac-enable-function-calls=$AMDGPU_FUNC_CALLS -promote-pointer-kernargs-to-global -infer-address-spaces -verify -o $2.opt.bc < $2.linked.bc
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx906
Error: hc-kernel-assemble[166]: failed with status -1
clang-10: error: HC assembler command failed with exit code 255 (use -v to see invocation)

when compiling in Debug mode with this pull request.

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented May 28, 2020

@masterleinad -- with -DCMAKE_BUILD_TYPE=Debug? Or manually passing -g to CXXFLAGS?

Strangely I can seem to get both to work.

@masterleinad
Copy link
Contributor

I see it when passing -g.

@masterleinad
Copy link
Contributor

The LocalDeepCopy tests were always failing for me and with this pull request, I also see everything else working. Could you try to just disable it here as well?

@skyreflectedinmirrors
Copy link
Contributor Author

Interestingly that test passes on my 3.5 system (hence why I didn't see it):

[ RUN      ] hip.local_deepcopy_teampolicy_layoutleft
[       OK ] hip.local_deepcopy_teampolicy_layoutleft (1499 ms)
[ RUN      ] hip.local_deepcopy_rangepolicy_layoutleft
[       OK ] hip.local_deepcopy_rangepolicy_layoutleft (2561 ms)
[ RUN      ] hip.local_deepcopy_teampolicy_layoutright
[       OK ] hip.local_deepcopy_teampolicy_layoutright (1380 ms)
[ RUN      ] hip.local_deepcopy_rangepolicy_layoutright
[       OK ] hip.local_deepcopy_rangepolicy_layoutright (2302 ms)

I'll disable for now, and we can re-evaluate skipped tests when you guys get access to a 3.5 install.

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2020

Codecov Report

Merging #3067 into develop will decrease coverage by 0.2%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #3067     +/-   ##
=========================================
- Coverage     82.6%   82.4%   -0.3%     
=========================================
  Files          122     122             
  Lines         8074    8095     +21     
=========================================
- Hits          6673    6672      -1     
- Misses        1401    1423     +22     
Flag Coverage Δ
#clang 81.4% <ø> (ø)
#gcc 82.6% <ø> (-0.3%) ⬇️
Impacted Files Coverage Δ
core/src/Kokkos_Crs.hpp 100.0% <ø> (ø)
core/src/impl/Kokkos_ClockTic.hpp 100.0% <ø> (ø)
core/src/impl/Kokkos_Memory_Fence.hpp 100.0% <ø> (ø)
core/src/impl/Kokkos_SharedAlloc.hpp 92.6% <ø> (ø)
core/src/impl/Kokkos_Core.cpp 36.4% <0.0%> (-1.6%) ⬇️
core/src/Kokkos_MemoryPool.hpp 89.4% <0.0%> (-0.5%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e526f7e...0332a0b. Read the comment docs.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. Since the LocalDeepCopy tests never worked for me anyway, disabling them here, for now, is fine with me.

@dalg24 dalg24 dismissed crtrott’s stale review June 1, 2020 17:53

All comments have been addressed

@dalg24 dalg24 merged commit 6881c64 into kokkos:develop Jun 1, 2020
@skyreflectedinmirrors skyreflectedinmirrors deleted the hipclang_vdi_with_copy_workaround branch June 17, 2020 13:57
@msimberg
Copy link
Contributor

Sorry for bumping a merged PR, but I was just wondering if there has been any progress on the HIP side for asynchronous kernel launches (same question as over here: ROCm/HIP#2066 (comment))? Are there possibly WIP changes to HIP that already make this possible?

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented Nov 13, 2020 via email

@msimberg
Copy link
Contributor

Hi Mikael, we’ve implemented a scheme to make a good portion of launches asynchronous internally at AMD. I’m anticipating upstreaming it in the next week or so.

-Nick
On Fri, Nov 13, 2020 at 6:28 AM Mikael Simberg @.***> wrote: Message sent from a system outside of UConn. Sorry for bumping a merged PR, but I was just wondering if there has been any progress on the HIP side for asynchronous kernel launches (same question as over here: ROCm-Developer-Tools/HIP#2066 (comment) <ROCm-Developer-Tools/HIP#2066 (comment)>)? Are there possibly WIP changes to HIP that already make this possible? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3067 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRKDCJHLRH25BG3ONN2B33SPUQ47ANCNFSM4NMOEVJQ .

@arghdos thanks for the quick reply, and that's great news! Looking forward to seeing that in action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants