New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved fastmath code generation for trig, log, and exp/pow. #6619
Conversation
@testhound thank you for submitting this! I have added it to the queue for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, great to see these fast math functions added. I've made a couple of suggestions, in general looks good. I think this probably needs documenting, both in the CUDA docs and also the cuda.jit
doc strings. Thanks again!
numba/cuda/mathimpl.py
Outdated
@@ -76,8 +88,20 @@ def lower_boolean_impl(context, builder, sig, args): | |||
|
|||
def impl_unary(key, ty, libfunc): | |||
def lower_unary_impl(context, builder, sig, args): | |||
libfunc_impl = context.get_function(libfunc, typing.signature(ty, ty)) | |||
return libfunc_impl(builder, args) | |||
if ty == float32 and context.fastmath is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ty == float32 and context.fastmath is True: | |
if ty == float32 and context.fastmath: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuartarchibald thanks for the review, I have made the code changes locally. Can you point me to the documentation files that need to be changed?
numba/cuda/mathimpl.py
Outdated
fast_replacement = unarys_fastmath.get(libfunc.__name__) | ||
if fast_replacement is None: | ||
libfunc_impl = context.get_function(libfunc, | ||
typing.signature(ty, ty)) | ||
else: | ||
new_libfunc = getattr(libdevice, fast_replacement) | ||
libfunc_impl = context.get_function(new_libfunc, | ||
typing.signature(ty, ty)) | ||
return libfunc_impl(builder, args) | ||
else: | ||
libfunc_impl = context.get_function(libfunc, | ||
typing.signature(ty, ty)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the 3x libfunc_impl = context.get_function...
can be pulled out as all paths use it, then the branch is just on 32bit float with fastmath and switching the library libfunc = getattr(libdevice, fast_replacement)
, something like the (entirely untested!) code below perhaps, what do you think?
if ty == float32 and context.fastmath:
fast_replacement = unarys_fastmath.get(libfunc.__name__)
if fast_replacement is not None:
libfunc = getattr(libdevice, fast_replacement)
libfunc_impl = context.get_function(libfunc,
typing.signature(ty, ty))
numba/cuda/mathimpl.py
Outdated
if ty == float32 and context.fastmath is True: | ||
fast_replacement = binarys_fastmath.get(libfunc.__name__) | ||
if fast_replacement is None: | ||
libfunc_impl = context.get_function(libfunc, | ||
typing.signature(ty, | ||
ty, ty)) | ||
else: | ||
new_libfunc = getattr(libdevice, fast_replacement) | ||
libfunc_impl = context.get_function(new_libfunc, | ||
typing.signature(ty, | ||
ty, ty)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment/refactor suggestion as above.
RE #6619 (comment): I've taken a look through the current CUDA docs and I can't seem to find a section talking about the various options available to the numba/numba/cuda/decorators.py Lines 45 to 48 in 8a4099a
Having reviewed the Numba documentation, I've also looked at the
On the CPU target, numba/numba/core/cpu_options.py Lines 6 to 40 in 1976c66
What do you think? |
@stuartarchibald interesting suggestion. Let me digest and respond later. |
As discussed out-of-band earlier, I think this is bringing Numba's behaviour into line with NVCC. An example of NVCC using a less precise cos implementation with the fast math flag can be seen in: https://github.com/gmarkall/nvcc-fastmath
I think any code that was passing
In light of the above, would you agree that the fastmath dictionary suggestion wouldn't be required now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great so far!
-
The code changes look good to me.
-
Tests run fine for me locally, after merging in master, with both NVVM70 (CUDA >= 11.2) and NVVM34 (CUDA < 11.2)
-
For the documentation, I'd suggest two changes:
- Adding a section on "fast math" underneath the "Numba for CUDA GPUs" section: https://numba.readthedocs.io/en/stable/cuda/index.html - this section should refer to the CPU fastmath section (https://numba.readthedocs.io/en/stable/user/performance-tips.html#fastmath) and note the following differences for the CUDA target:
- The flag can only be
True
orFalse
for CUDA - The flag also enables some fast approximate trigonometric functions, and provide the list of them.
- The flag can only be
- Updating the
jit
decorator documentation to explain thatfastmath
also substitutes the fast versions of some functions: https://numba.readthedocs.io/en/stable/cuda-reference/kernel.html#numba.cuda.jit, and refer to the new section on "fast math" for CUDA GPUs.
- Adding a section on "fast math" underneath the "Numba for CUDA GPUs" section: https://numba.readthedocs.io/en/stable/cuda/index.html - this section should refer to the CPU fastmath section (https://numba.readthedocs.io/en/stable/user/performance-tips.html#fastmath) and note the following differences for the CUDA target:
|
||
fastver = cuda.jit("void(float32[::1], float32)", fastmath=True)(f4) | ||
slowver = cuda.jit("void(float32[::1], float32)")(f4) | ||
self.assertNotIn('fma.rn.f32 ', fastver.ptx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting here that this checks that the fast version doesn't have any FMA instructions in it. I noticed that the fast version makes use of ex2.approx.ftz.f32
, which could be checked for instead, but I also think the test as-is is sufficient, because there shouldn't be any FMA instructions in the fast version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmarkall I have made and verified the first of the two changes you requested. I am unclear how to make the change to numba/cuda/decorators.py that will update the documentation; specifically I updated 'decorators.py' but after generating the documentation with 'make html', I do not get a updated local page for 'CUDA Kernel API. Is there another command to update this portion of the documentation or did I update the wrong file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sphinx isn't very good at detecting when docstrings change. Every time I change something documentation related, I always run make clean html
, which is a bit inconvenient because it takes a while to build from scratch, but I haven't found a better way.
If you do this, does your change now show up?
|
||
fastver = cuda.jit("void(float32[::1], float32)", fastmath=True)(f5) | ||
slowver = cuda.jit("void(float32[::1], float32)")(f5) | ||
self.assertIn('lg2.approx.ftz.f32 ', fastver.ptx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I guess this test is taking the approach that could have been used for exp
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the documentation additions - I have some further suggestions for the docs, which I've pushed in the commit gmarkall@f6e735f. To summarise the suggestions, they are:
- Fixing the docs build by moving the
cuda-fast-math
label ahead of the section title, - Add links to documentation for the underlying libdevice functions and NVVM optimizations,
- Explicitly state the
math
module functions affected by the transformation, - Refer to the docs in the docstring of the
jit
decorator forfastmath
- it's got to the point where it does a bit much to easily summarise in the docstring.
Feel free to pull in any / all of the changes in the linked commit, or do let me know what you think of the suggestions.
Following resolution of the documentation suggestions, I think this will be all looking good!
Also refer to documentation for fastmath in jit docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the update! This looks good to me!
@esc could this have a buildfarm run please? (the typeguard fail is a general issue not related to this branch)
Running on Farm as: |
Build farm was fine: |
@stuartarchibald @sklam Are you happy with this going RTM? |
- :func:`math.cos`: Implemented using `__nv_fast_cosf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_cosf.html>`_. | ||
- :func:`math.sin`: Implemented using `__nv_fast_sinf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_sinf.html>`_. | ||
- :func:`math.tan`: Implemented using `__nv_fast_tanf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_tanf.html>`_. | ||
- :func:`math.exp`: Implemented using `__nv_fast_expf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_expf.html>`_. | ||
- :func:`math.log2`: Implemented using `__nv_fast_log2f <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_log2f.html>`_. | ||
- :func:`math.log10`: Implemented using `__nv_fast_log10f <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_log10f.html>`_. | ||
- :func:`math.log`: Implemented using `__nv_fast_logf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_logf.html>`_. | ||
- :func:`math.pow`: Implemented using `__nv_fast_powf <https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_powf.html>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the :func:``math.FOO``
link to the developer/autogen_math_listing
references, and a couple to the intended python doc. Not sure what can be done apart for explicit linking. See rendered doc: https://numba--6619.org.readthedocs.build/en/6619/cuda/fastmath.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Numba doc problem, not for fixing in this PR. Deferring to ticket #6737.
Once #6619 (comment) is resolved, patch looks good. Thanks for working on this @testhound, thanks for reviewing @gmarkall. |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
@testhound please could you resolve the conflicts when you have a moment? Many Thanks. |
588ce3d
@stuartarchibald I just resolved the conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for resolving conflicts, looks good.
This pull request adds fastmath code generation support for the following functions:
sin, cos, tan, log, log2, log10, exp, and pow when the fastmath option is used with 32-bit types.
This pull request address poor code generation identified in: #6183