Improved fastmath code generation for trig, log, and exp/pow.#6619
Improved fastmath code generation for trig, log, and exp/pow.#6619sklam merged 10 commits intonumba:masterfrom
Conversation
|
@testhound thank you for submitting this! I have added it to the queue for review. |
stuartarchibald
left a comment
There was a problem hiding this comment.
Thanks for the patch, great to see these fast math functions added. I've made a couple of suggestions, in general looks good. I think this probably needs documenting, both in the CUDA docs and also the cuda.jit doc strings. Thanks again!
numba/cuda/mathimpl.py
Outdated
| def lower_unary_impl(context, builder, sig, args): | ||
| libfunc_impl = context.get_function(libfunc, typing.signature(ty, ty)) | ||
| return libfunc_impl(builder, args) | ||
| if ty == float32 and context.fastmath is True: |
There was a problem hiding this comment.
| if ty == float32 and context.fastmath is True: | |
| if ty == float32 and context.fastmath: |
There was a problem hiding this comment.
@stuartarchibald thanks for the review, I have made the code changes locally. Can you point me to the documentation files that need to be changed?
numba/cuda/mathimpl.py
Outdated
| fast_replacement = unarys_fastmath.get(libfunc.__name__) | ||
| if fast_replacement is None: | ||
| libfunc_impl = context.get_function(libfunc, | ||
| typing.signature(ty, ty)) | ||
| else: | ||
| new_libfunc = getattr(libdevice, fast_replacement) | ||
| libfunc_impl = context.get_function(new_libfunc, | ||
| typing.signature(ty, ty)) | ||
| return libfunc_impl(builder, args) | ||
| else: | ||
| libfunc_impl = context.get_function(libfunc, | ||
| typing.signature(ty, ty)) |
There was a problem hiding this comment.
Think the 3x libfunc_impl = context.get_function... can be pulled out as all paths use it, then the branch is just on 32bit float with fastmath and switching the library libfunc = getattr(libdevice, fast_replacement), something like the (entirely untested!) code below perhaps, what do you think?
if ty == float32 and context.fastmath:
fast_replacement = unarys_fastmath.get(libfunc.__name__)
if fast_replacement is not None:
libfunc = getattr(libdevice, fast_replacement)
libfunc_impl = context.get_function(libfunc,
typing.signature(ty, ty))
numba/cuda/mathimpl.py
Outdated
| if ty == float32 and context.fastmath is True: | ||
| fast_replacement = binarys_fastmath.get(libfunc.__name__) | ||
| if fast_replacement is None: | ||
| libfunc_impl = context.get_function(libfunc, | ||
| typing.signature(ty, | ||
| ty, ty)) | ||
| else: | ||
| new_libfunc = getattr(libdevice, fast_replacement) | ||
| libfunc_impl = context.get_function(new_libfunc, | ||
| typing.signature(ty, | ||
| ty, ty)) |
There was a problem hiding this comment.
Same comment/refactor suggestion as above.
|
RE #6619 (comment): I've taken a look through the current CUDA docs and I can't seem to find a section talking about the various options available to the numba/numba/cuda/decorators.py Lines 45 to 48 in 8a4099a Having reviewed the Numba documentation, I've also looked at the On the CPU target, numba/numba/core/cpu_options.py Lines 6 to 40 in 1976c66 What do you think? |
@stuartarchibald interesting suggestion. Let me digest and respond later. |
As discussed out-of-band earlier, I think this is bringing Numba's behaviour into line with NVCC. An example of NVCC using a less precise cos implementation with the fast math flag can be seen in: https://github.com/gmarkall/nvcc-fastmath
I think any code that was passing
In light of the above, would you agree that the fastmath dictionary suggestion wouldn't be required now? |
gmarkall
left a comment
There was a problem hiding this comment.
This is looking great so far!
-
The code changes look good to me.
-
Tests run fine for me locally, after merging in master, with both NVVM70 (CUDA >= 11.2) and NVVM34 (CUDA < 11.2)
-
For the documentation, I'd suggest two changes:
- Adding a section on "fast math" underneath the "Numba for CUDA GPUs" section: https://numba.readthedocs.io/en/stable/cuda/index.html - this section should refer to the CPU fastmath section (https://numba.readthedocs.io/en/stable/user/performance-tips.html#fastmath) and note the following differences for the CUDA target:
- The flag can only be
TrueorFalsefor CUDA - The flag also enables some fast approximate trigonometric functions, and provide the list of them.
- The flag can only be
- Updating the
jitdecorator documentation to explain thatfastmathalso substitutes the fast versions of some functions: https://numba.readthedocs.io/en/stable/cuda-reference/kernel.html#numba.cuda.jit, and refer to the new section on "fast math" for CUDA GPUs.
- Adding a section on "fast math" underneath the "Numba for CUDA GPUs" section: https://numba.readthedocs.io/en/stable/cuda/index.html - this section should refer to the CPU fastmath section (https://numba.readthedocs.io/en/stable/user/performance-tips.html#fastmath) and note the following differences for the CUDA target:
|
|
||
| fastver = cuda.jit("void(float32[::1], float32)", fastmath=True)(f4) | ||
| slowver = cuda.jit("void(float32[::1], float32)")(f4) | ||
| self.assertNotIn('fma.rn.f32 ', fastver.ptx) |
There was a problem hiding this comment.
Noting here that this checks that the fast version doesn't have any FMA instructions in it. I noticed that the fast version makes use of ex2.approx.ftz.f32, which could be checked for instead, but I also think the test as-is is sufficient, because there shouldn't be any FMA instructions in the fast version.
There was a problem hiding this comment.
@gmarkall I have made and verified the first of the two changes you requested. I am unclear how to make the change to numba/cuda/decorators.py that will update the documentation; specifically I updated 'decorators.py' but after generating the documentation with 'make html', I do not get a updated local page for 'CUDA Kernel API. Is there another command to update this portion of the documentation or did I update the wrong file?
There was a problem hiding this comment.
I think sphinx isn't very good at detecting when docstrings change. Every time I change something documentation related, I always run make clean html, which is a bit inconvenient because it takes a while to build from scratch, but I haven't found a better way.
If you do this, does your change now show up?
|
|
||
| fastver = cuda.jit("void(float32[::1], float32)", fastmath=True)(f5) | ||
| slowver = cuda.jit("void(float32[::1], float32)")(f5) | ||
| self.assertIn('lg2.approx.ftz.f32 ', fastver.ptx) |
There was a problem hiding this comment.
(I guess this test is taking the approach that could have been used for exp)
gmarkall
left a comment
There was a problem hiding this comment.
Many thanks for the documentation additions - I have some further suggestions for the docs, which I've pushed in the commit gmarkall@f6e735f. To summarise the suggestions, they are:
- Fixing the docs build by moving the
cuda-fast-mathlabel ahead of the section title, - Add links to documentation for the underlying libdevice functions and NVVM optimizations,
- Explicitly state the
mathmodule functions affected by the transformation, - Refer to the docs in the docstring of the
jitdecorator forfastmath- it's got to the point where it does a bit much to easily summarise in the docstring.
Feel free to pull in any / all of the changes in the linked commit, or do let me know what you think of the suggestions.
Following resolution of the documentation suggestions, I think this will be all looking good!
Also refer to documentation for fastmath in jit docstring.
|
Running on Farm as: |
|
Build farm was fine: |
|
@stuartarchibald @sklam Are you happy with this going RTM? |
| - :func:`math.cos`: Implemented using `__nv_fast_cosf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_cosf.html>`_. | ||
| - :func:`math.sin`: Implemented using `__nv_fast_sinf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_sinf.html>`_. | ||
| - :func:`math.tan`: Implemented using `__nv_fast_tanf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_tanf.html>`_. | ||
| - :func:`math.exp`: Implemented using `__nv_fast_expf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_expf.html>`_. | ||
| - :func:`math.log2`: Implemented using `__nv_fast_log2f <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_log2f.html>`_. | ||
| - :func:`math.log10`: Implemented using `__nv_fast_log10f <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_log10f.html>`_. | ||
| - :func:`math.log`: Implemented using `__nv_fast_logf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_logf.html>`_. | ||
| - :func:`math.pow`: Implemented using `__nv_fast_powf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_powf.html>`_. |
There was a problem hiding this comment.
most of the :func:``math.FOO`` link to the developer/autogen_math_listing references, and a couple to the intended python doc. Not sure what can be done apart for explicit linking. See rendered doc: https://numba--6619.org.readthedocs.build/en/6619/cuda/fastmath.html
There was a problem hiding this comment.
This is a Numba doc problem, not for fixing in this PR. Deferring to ticket #6737.
Once #6619 (comment) is resolved, patch looks good. Thanks for working on this @testhound, thanks for reviewing @gmarkall. |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@testhound please could you resolve the conflicts when you have a moment? Many Thanks. |
588ce3d
@stuartarchibald I just resolved the conflicts. |
stuartarchibald
left a comment
There was a problem hiding this comment.
Thanks for resolving conflicts, looks good.
This pull request adds fastmath code generation support for the following functions:
sin, cos, tan, log, log2, log10, exp, and pow when the fastmath option is used with 32-bit types.
This pull request address poor code generation identified in: #6183