You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following on from numerous reports and at least two lengthy discussions at different times during the weekly Numba public meetings, this ticket is a meta-issue to promote and record discussion on the following:
The "default" optimisation level Numba should use for compilation.
How to expose more and more fine grained optimisation options into user space through
the existing Numba APIs.
At present 1. is implemented approximately as follows:
Run a "cheap" optimisation pass with view of inlining as much as possible.
This is so as to expose as many Numba reference counting operations as
possible to Numba's custom reference count pruning pass. This currently
comprises:
Running something like -O0 with "loop rotation", "loop invariant code
motion" and "CFG simplification" passes added. It turns out in practice
that these are commonly needed to help transform the LLVM IR that Numba
generates into something that will perform well under the "expensive"
optimisation pass, particularly with respect to vectorisation.
Running the aforementioned reference count pruning pass.
Run an "expensive" optimisation pass, which is something like -O3 cf. clang -O3, with loop and SLP vectorisation enabled.
Historically (prior to Numba 0.55) what eventually became the "cheap" pass was running at -O3 and there was a less sophisticated reference count pruner running (it could only analyse operations within a basic block). Essentially, Numba ran the -O3 passes over the code twice!
The reason for the change between 0.54 and 0.55 was that a new Numba-reference-count operation pruning pass was developed. These reference counts a) impact runtime performance and b) prevent certain classes of optimisations), therefore a strategy to do as much as possible to remove Numba specific reference counting operations was employed as described above. Further, for a lot of code, it was observed that running -O3 twice had little benefit, it can end up making negligible difference to performance but at a much increased compilation cost, this further informed the strategy above.
As has been noted in various open issues, there have been cases where a single -O3 pass has missed optimisations which can be undertaken by running a subsequent -O3 pass. See issues: #8398, #8172, #8314, #6547.
Input on what a "better" default would be is welcomed!
With regards to 2. a brief summary from prior discussions (this is from memory, so please do correct as needed).
Commonly described use cases:
Users who are perhaps not explicitly concerned about a certain performance characteristic, they want something that is "reasonable" in terms of compilation and execution time by default.
Users in HPC/high performance situations where any compilation cost is accepted if it reduces runtime. i.e. compilation time is dwarfed by the run time.
Users that are compilation time constrained and know which functions are worth optimising. e.g. dynamic code generation situations/"interactive" applications.
Users wanting to do incredibly fine grained tuning of optimisation pipelines for some purpose.
Users researching compilers wanting fine grained specification of the optimisation pipelines.
Expose some more colloquial terms like "hot", "cold" etc to govern the amount
of effort Numba should put in to compiling a given function.
Setting the option per-function as part of the @njit decoration options.
This feature of allowing control over the compilation effect on each function could be useful from my first thought! In this way, we can fine-tune the compilation speed, and try not compromise the runtime.
Following on from numerous reports and at least two lengthy discussions at different times during the weekly Numba public meetings, this ticket is a meta-issue to promote and record discussion on the following:
the existing Numba APIs.
At present 1. is implemented approximately as follows:
Run a "cheap" optimisation pass with view of inlining as much as possible.
This is so as to expose as many Numba reference counting operations as
possible to Numba's custom reference count pruning pass. This currently
comprises:
-O0
with "loop rotation", "loop invariant codemotion" and "CFG simplification" passes added. It turns out in practice
that these are commonly needed to help transform the LLVM IR that Numba
generates into something that will perform well under the "expensive"
optimisation pass, particularly with respect to vectorisation.
Run an "expensive" optimisation pass, which is something like
-O3
cf.clang -O3
, with loop and SLP vectorisation enabled.Historically (prior to Numba 0.55) what eventually became the "cheap" pass was running at
-O3
and there was a less sophisticated reference count pruner running (it could only analyse operations within a basic block). Essentially, Numba ran the-O3
passes over the code twice!The reason for the change between 0.54 and 0.55 was that a new Numba-reference-count operation pruning pass was developed. These reference counts a) impact runtime performance and b) prevent certain classes of optimisations), therefore a strategy to do as much as possible to remove Numba specific reference counting operations was employed as described above. Further, for a lot of code, it was observed that running
-O3
twice had little benefit, it can end up making negligible difference to performance but at a much increased compilation cost, this further informed the strategy above.As has been noted in various open issues, there have been cases where a single
-O3
pass has missed optimisations which can be undertaken by running a subsequent-O3
pass. See issues: #8398, #8172, #8314, #6547.Input on what a "better" default would be is welcomed!
With regards to 2. a brief summary from prior discussions (this is from memory, so please do correct as needed).
Commonly described use cases:
Previously discussed options (not mutually exclusive):
O0/O1/O2/O3
etc.of effort Numba should put in to compiling a given function.
Previously discussed method(s) of exposing the options (not mutually exclusive):
@njit
decoration options.Input is welcomed on use cases, optimisation options, and their method of exposure into user space.
The text was updated successfully, but these errors were encountered: