New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Don't optimize IR before sending it to NVVM #6030
Conversation
There are various issues created by optimizing the IR before sending it to NVVM, in particular: - Issue numba#5576: Optimizing the IR before sending it to NVVM results in code that makes out-of-bounds accesses. - Issue numba#6022: Optimization generates memset intrinsics incompatible with LLVM 3.4 (NVVM). - IR auto-upgrade breaks the IR for LLVM 3.4 due to changes in atomic instructions. This requires a patch to the LLVM version used for llvmlite, which is problematic because many distributions won't carry this patch. (ref: numba/llvmlite#593) Optimizing IR prior to sending it to NVVM was originally added to work around an NVVM bug documented as Issue numba#1341. The cause of issue numba#1341 is now resolved in NVVM, and has been since CUDA 7.5 - the test script at https://gist.github.com/sklam/f62f1f48bb0be78f9ceb executes in a fraction of a second and consumes around 100MB of RAM - the bug manifested itself as a long compilation time with gigabytes of RAM consumed. There are some side-effects to not optimizing before sending to NVVM, which are reflected in the tests: - `TestCudaConstantMemory.test_const_array_3d`: Instead of reading the complex value as a vector of u32s, there are two individual loads of the real and imaginary parts as f32s. This is not expected to have a negative impact on performance, since coalesced reading of f32s will use memory bandwidth effectively. - `TestCudaConstantMemory.test_const_record`: The generated code here appears less optimal than it previously was - instead of loading three 8-bit members in a single 32-bit read, the reads are now individual 8-bit reads. - `TestCudaConstantMemory.test_const_record_align`: A similar issue to test_const_record. - `TestNvvmWithoutCuda.test_nvvm_memset_fixup`: now that IR is not optimized prior to sending to NVVM, memsets no longer appear in the IR. It is desirable to keep the memset fixup functionality and test so that Numba and its extensions will be able to use memsets (e.g. from `cgutils.memset`), so this test is modified to use strings of IR rather than relying on memsets naturally being produced as part of the compilation process. It is possible to generate a memset without an alignment attribute on the destination with `cgutils.memset`, so a check for this error and a test of the check are also added. For the issues with less optimal code being generated in `TestCudaConstantMemory`, the tests for the optimal code have been moved into new test cases that are marked as expected fails - it would be better to keep these tests and note the sub-optimal behaviour, with the expectation that a future version of NVVM may improve on the code it currently generates.
This reverts commit 3f66129. Once numba/numba#6030 is merged, it will no longer be necessary to disable the autoupgrade of atomic intrinsics for NVPTX, because LLVM from llvmlite will not be used to optimize the IR before sending it to NVVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, couple of minor things to resolve else looks good.
BFID: |
- Add comments about why we don't optimize LLVM IR in the CUDA target. - Change wording of error about requiring alignment on memset destination and update test accordingly.
Re-build BFID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Waiting on buildfarm to rerun
BF Report for BFID: Cuda 8 and 9 tests fail with:
And one unexpected success which is unfortunately not displayed in the log. Cuda 10 tests fail with two unexpected successes, which are also unfortunately not displayed in the log. |
Many thanks @esc |
This passes on some toolkits (e.g. 9.0), so it will be modified so the expectation is conditional on the toolkit version in subsequent commits
@esc With the added commits I get no fails on Linux with CUDA 8.0, 9.0, 9.1, 9.2, 10.0, 10.1, 10.2, or 11.0 (as long as I've made no mistakes in testing) - could you run through the buildfarm again please? |
BFID: |
The windows tests failed on CUDA 8 & 9 with unexpected successes, but the testing script doesn't seem to output these. |
Switching from an expected failure to checking the behaviour for various toolkits and platforms.
@esc Many thanks - this is ready for another buildfarm run. |
|
BF reports that CUDA 10 tests on Windows fail with:
|
Conda-forge Numba feedstock issue that will also be resolved by this PR: conda-forge/numba-feedstock#55 |
@esc Could this have another buildfarm run please? |
|
Reckon this is good to go now. |
This reverts commit 3f66129. Once numba/numba#6030 is merged, it will no longer be necessary to disable the autoupgrade of atomic intrinsics for NVPTX, because LLVM from llvmlite will not be used to optimize the IR before sending it to NVVM.
* Use std::make_unique on LLVM 10 LLVM 10 removes llvm::make_unique in favor of std::make_unique. However, this requires C++14 and is therefore unsuitable for LLVM 9 that forces -std=c++11. Update the code to use both conditionally. This fixes all issues with LLVM 10. * grab updated SVML patch and new fastmath patch https://github.com/conda-forge/llvmdev-feedstock/blob/c706309/recipe/patches/intel-D47188-svml-VF.patch https://github.com/conda-forge/llvmdev-feedstock/blob/c706309/recipe/patches/expect-fastmath-entrypoints-in-add-TLI-mappings.ll.patch * updte to LLVM 10 in recipe and tests As title * update documented version numbers in README.rst As title. * update appveyor.yml comment to include LLVM 10 As title. * update to llvmdev 10.0* on Azure pipelines As title. * update to llvmdev 10.0 in buildscripts As title * update llvmdev_manylinux1 conda-recipe As title * update Sphinx documentation for correct LLVM version As title. * temporarily set the VERSION_SUFFIX for CI This is needed as the llvmlite package we are building needs to differentiate itself somehow from the 'other' dev builds. * specify the correct compiler in the build requirements * workaround potentially broken LLVM CMake setup With LLVM 10 the following CMake line no longer appears to work. ``` llvm_map_components_to_libnames(llvm_libs all) ``` ... and this commit is a suitable workaround. Current working hypothesis is that this might have been caused by the change in: llvm/llvm-project@ab41180#diff-cebfde545aa260d20ca6b0cdc3da08daR270 However the docs at https://llvm.org/docs/CMake.html are not very clear about this. And in addition the output of `llvm-config --components` does list `all` as an option. Many thanks to @angloyna for debugging this!! * update the comment and links for the svml patch * add a note about broken LLVM cmake setup As title * update compatability matrix As title. * install cmake on armv7l too * remove TODO * bump to LLVM 10.0.1 As title. * Make sure llvmlite build is unique. As title. * remove outdated comment As title. * implement LLVM version split The next llvmlite release will have a split LLVM dependency. The `aarch64` target will use LLVM 9 and all other targets will use LLVM 10. * Revert "Fix CUDA with LLVM9" This reverts commit 3f66129. Once numba/numba#6030 is merged, it will no longer be necessary to disable the autoupgrade of atomic intrinsics for NVPTX, because LLVM from llvmlite will not be used to optimize the IR before sending it to NVVM. * revert LLVM documentation link to 10.0.0 The page at: https://releases.llvm.org/10.0.1/docs/ Currently (Tue Aug 4 10:48:42 2020 GMT+2) returns a 404 - page not found and this breaks our build as Sphinx can't setup the intersphinx links. * accept LLVM 9.0 once again As we are doing a split release for 0.34.0 and `aarch64` will need LLVM 9.0.* -- we allow to build with this version again. * Apply suggestions from code review Fix missing word and missing punctuation. Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com> * adding link to LLVM bug report * adding link to potential LLVM Cmake bug As title. * stop testing older LLVMs We only support 10.0.* and 9.0.* from now on. * getting ready to release 0.34.0 remove VERSION_SUFFIX Final builds should not have a version suffix. Co-authored-by: Michał Górny <mgorny@gentoo.org> Co-authored-by: Graham Markall <graham@big-grey.co.uk> Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
Starting with CUDA 11.2, a new version of NVVM is provided that is based on LLVM 7.0. This requires a number of changes to support, which must be maintained in parallel with the existing support for NVVM based on LLVM 3.4. This PR adds these changes, which consist of: - Addition of a function to query the NVVM IR version, and a property indicating whether the NVVM in use is based on LLVM 3.4 or 7.0 (`is_nvvm70`). - The CAS hack (inserting a text-based implementation of `cmpxchg` with pre-LLVM 3.5 semantics in a function) is only needed with NVVM 3.4 - on NVVM 7.0, llvmlite is used to build `cmpxchg` instructions directly instead. - Templates for other atomics (inc, dec, min, max) have the right form of the `cmpxchg` instruction inserted depending on the NVVM version. - The datalayout shorthand is now only replaced for NVVM 3.4. - There are now two variants of the functions to rewrite the IR - `llvm100_to_70_ir` and `llvm100_to_34_ir`. `llvm100_to_34_ir` is the old `llvm_39_to_34_ir` with a name reflecting what it currently does. - `llvm100_to_70_ir` removes the `willreturn` attribute from functions, as it is not supported by LLVM 7.0. It also converts DISPFlags to main subprogram DIFlags. For example, `spflags: DISPFlagDefinition | DISPFlagOptimized` is rewritten as `isDefinition: true, isOptimized: true`. - For NVVM 7.0, the `DIBuilder` also used for the CPU target can be used, instead of the `NvvmDIBuilder` that was needed to support NVVM 3.4. - Some tests are updated to support modified function names, and also to expect a CUDA version of 11.2. - `test_nvvm_driver` is updated to include appropriate IR for both NVVM 3.4 and 7.0. Some refactoring also makes its code clearer (e.g. renaming `get_ptx()` to `get_nvvimir()`, because it returns NVVM IR and not PTX). - Some optimizations in LLVM 7.0 result in different code generation in `test_constmem`, so alternative expected results are added for when NVVM 7.0 is used. Note that this recovers some optimizations that were lost when IR optimization using llvmlite was switched off (PR numba#6030, "Don't optimize IR before sending it to NVVM"). - `test_debuginfo` is updated to match the format of the debuginfo section produced by both NVVM 3.4 and 7.0 (there is some variation in whitespace between these versions).
Starting with CUDA 11.2, a new version of NVVM is provided that is based on LLVM 7.0. This requires a number of changes to support, which must be maintained in parallel with the existing support for NVVM based on LLVM 3.4. This PR adds these changes, which consist of: - Addition of a function to query the NVVM IR version, and a property indicating whether the NVVM in use is based on LLVM 3.4 or 7.0 (`is_nvvm70`). - The CAS hack (inserting a text-based implementation of `cmpxchg` with pre-LLVM 3.5 semantics in a function) is only needed with NVVM 3.4 - on NVVM 7.0, llvmlite is used to build `cmpxchg` instructions directly instead. - Templates for other atomics (inc, dec, min, max) have the right form of the `cmpxchg` instruction inserted depending on the NVVM version. - The datalayout shorthand is now only replaced for NVVM 3.4. - There are now two variants of the functions to rewrite the IR - `llvm100_to_70_ir` and `llvm100_to_34_ir`. `llvm100_to_34_ir` is the old `llvm_39_to_34_ir` with a name reflecting what it currently does. - `llvm100_to_70_ir` removes the `willreturn` attribute from functions, as it is not supported by LLVM 7.0. It also converts DISPFlags to main subprogram DIFlags. For example, `spflags: DISPFlagDefinition | DISPFlagOptimized` is rewritten as `isDefinition: true, isOptimized: true`. - For NVVM 7.0, the `DIBuilder` also used for the CPU target can be used, instead of the `NvvmDIBuilder` that was needed to support NVVM 3.4. - Some tests are updated to support modified function names, and also to expect a CUDA version of 11.2. - `test_nvvm_driver` is updated to include appropriate IR for both NVVM 3.4 and 7.0. Some refactoring also makes its code clearer (e.g. renaming `get_ptx()` to `get_nvvimir()`, because it returns NVVM IR and not PTX). - Some optimizations in LLVM 7.0 result in different code generation in `test_constmem`, so alternative expected results are added for when NVVM 7.0 is used. Note that this recovers some optimizations that were lost when IR optimization using llvmlite was switched off (PR numba#6030, "Don't optimize IR before sending it to NVVM"). - `test_debuginfo` is updated to match the format of the debuginfo section produced by both NVVM 3.4 and 7.0 (there is some variation in whitespace between these versions).
There are various issues created by optimizing the IR before sending it to NVVM, in particular:
Optimizing IR prior to sending it to NVVM was originally added to work around an NVVM bug documented as Issue #1341. The cause of issue #1341 is now resolved in NVVM, and has been since CUDA 7.5 - the test script at https://gist.github.com/sklam/f62f1f48bb0be78f9ceb executes in a fraction of a second and consumes around 100MB of RAM - the bug manifested itself as a long compilation time with gigabytes of RAM consumed.
There are some side-effects to not optimizing before sending to NVVM, which are reflected in the tests:
TestCudaConstantMemory.test_const_array_3d
: Instead of reading the complex value as a vector of u32s, there are two individual loads of the real and imaginary parts as f32s. This is not expected to have a negative impact on performance, since coalesced reading of f32s will use memory bandwidth effectively.TestCudaConstantMemory.test_const_record
: The generated code here appears less optimal than it previously was - instead of loading three 8-bit members in a single 32-bit read, the reads are now individual 8-bit reads.TestCudaConstantMemory.test_const_record_align
: A similar issue to test_const_record.TestNvvmWithoutCuda.test_nvvm_memset_fixup
: now that IR is not optimized prior to sending to NVVM, memsets no longer appear in the IR. It is desirable to keep the memset fixup functionality and test so that Numba and its extensions will be able to use memsets (e.g. fromcgutils.memset
), so this test is modified to use strings of IR rather than relying on memsets naturally being produced as part of the compilation process. It is possible to generate a memset without an alignment attribute on the destination withcgutils.memset
, so a check for this error and a test of the check are also added.For the issues with less optimal code being generated in
TestCudaConstantMemory
, the tests for the optimal code have been moved into new test cases that are marked as expected fails - it would be better to keep these tests and note the sub-optimal behaviour, with the expectation that a future version of NVVM may improve on the code it currently generates.