New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Support for Numpy BitGenerators #7900
Conversation
Test are failing on NumPy 1.18 as I think the
Note and then |
@kc611 Many thanks for creating a patch from #4499 (comment) and updating it with @brandonwillard any chance you could please give this a first review once CI passes? Thanks in advance! |
Of course. |
numba/np/npy_randomgen.py
Outdated
return out | ||
return impl | ||
|
||
# Overload the Generator().standard_gamma() method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Overload the Generator().standard_gamma() method | |
# Overload the Generator().standard_gamma() method |
think this should fix flake8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kc611 ^ This should fix flake8 and let public CI run fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kc611 mentioned that there were some issues with the time taken by the new TestRandomGenerators.test_standard_gamma
test (and that those tests might not have actually been run in CI). I can reproduce the issue locally, and it looks like numba.np.npy_randomgen.random_standard_gamma
is not exiting its inner loop.
The port of random_standard_gamma
and the function it calls in the inner-loop, random_standard_normal
, look like faithful adaptations, and random_standard_normal
checks out independently (i.e. it matches the samples produced by NumPy).
The test parameters seem to imply that the break condition V <= 0.0
might not be met very quickly (e.g. scipy.stats.norm.cdf(0, loc=1, scale=0.2)
is ~2.8e-7
), so perhaps what we're seeing is a performance problem.
The following experiment appears to confirm that there is a noticeable performance difference when sampling from standard_normal
alone:
import numpy as np
from numba import njit
@njit
def numba_standard_normal(rng):
return rng.standard_normal(10000)
res = numba_standard_normal(np.random.default_rng(seed=230923))
%timeit _ = numba_standard_normal(np.random.default_rng(seed=230923))
# 304 µs ± 4.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = np.random.default_rng(seed=230923).standard_normal(10000)
# 157 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
exp_res = np.random.default_rng(seed=230923).standard_normal(10000)
assert np.array_equal(res, exp_res)
It still doesn't seem like taking twice as long to draw a standard normal is enough to explain the divergence of that loop, though. Perhaps, some aspects of the call to random_standard_normal
within the while
loop are compounding the latency issue observed above; otherwise, I'm not sure what else could be causing this issue right now.
numba/np/ziggurat_constants.py
Outdated
@@ -0,0 +1,1212 @@ | |||
|
|||
|
|||
ki_double = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: #7900 (review)
I suggest making these large constant pools into np.array()
. In the current implementation, this is a tuple so it will appear as an aggregate type in the LLVM IR and also have to adhere to Python tuple semantics. This has two effects, the first is that indexing ends up going via a massive switch table to jump to the right value (opposed to a cheap LLVM GetElementPointer()
as is present in NumPy array indexing), and the second is that tuple indexing semantics like negative indexing and raising on invalid index have to be dealt with. The combination of these two things is probably what's causing the slow down in comparison to a relatively direct non-bounds-checked read-only NumPy array access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried making the constant pools an Numpy array, it did make the code run a bit faster, but that difference in performance was little when compared to the overall difference. Even if we remove all the extra logic and just focus on the CFFI bindings. For instance:
import numpy as np
from numba import njit
@njit
def numba_standard_normal(rng):
return rng.random(10000)
res = numba_standard_normal(np.random.default_rng(seed=230923))
%timeit _ = numba_standard_normal(np.random.default_rng(seed=230923))
# 151 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit _ = np.random.default_rng(seed=230923).random(10000)
# 99.8 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
exp_res = np.random.default_rng(seed=230923).random(10000)
assert np.array_equal(res, exp_res)
Which internally simply calls the next_float
or next_double
bindings, it's still slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if the output is 100_000 in size opposed to 10_000 does it make a difference? I guess I'll need to go and look at the machine code for both implementations. Nothing obvious jumps out from a quick skim of the code generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using this:
import numpy as np
from IPython import get_ipython
from numba import njit
ipython = get_ipython()
for sz in [0, *(10**(np.arange(7)))]:
print(f"Testing size: {sz}".center(80, '-'))
@njit
def numba_standard_normal(rng):
return rng.random(sz)
res = numba_standard_normal(np.random.default_rng(seed=230923))
ipython.run_line_magic('timeit', f'_ = numba_standard_normal(np.random.default_rng(seed=230923))')
ipython.run_line_magic('timeit', f'_ = np.random.default_rng(seed=230923).random({sz})')
exp_res = np.random.default_rng(seed=230923).random(sz)
assert np.array_equal(res, exp_res)
and running with:
ipython <scriptname.py>
Some results:
--------------------------------Testing size: 0---------------------------------
116 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.8 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 1---------------------------------
115 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 10--------------------------------
117 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 100--------------------------------
116 µs ± 638 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
59.1 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 1000-------------------------------
122 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66.8 µs ± 885 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
------------------------------Testing size: 10000-------------------------------
185 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
130 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
------------------------------Testing size: 100000------------------------------
810 µs ± 6.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
742 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-----------------------------Testing size: 1000000------------------------------
7.37 ms ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.22 ms ± 7.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
seems like cost of dispatch in Numba is quite high.
@kc611 in the interests of ensuring that this functionality is supported, what do you think to deferring concerns about dispatch performance (#7900 (comment)) with view of focusing on getting the support/implementation complete? Whilst it'd be nice to have faster dispatch for |
The real problem here isn't the lack of performance itself, but it's manifestation in derived sampling algorithms. For instance, Try running |
I see, thanks for the explanation. I think the first thing that needs working out is whether the performance slow down is coming from the calls to e.g. |
When I walked through it, it looked like it was following the exact same path as NumPy. For instance, with the same seed and distribution parameters, that loop took a lot of iterations in pure Python/NumPy as well, but, since it was significantly faster, it wasn't a problem. |
In attempting to work out where the issue lies, this sample code: import numpy as np
from numba import njit
@njit
def numba_standard_uniform_fill(rng):
return rng.random(10000)
res = numba_standard_uniform_fill(np.random.default_rng(seed=230923))
print(numba_standard_uniform_fill.inspect_asm(numba_standard_uniform_fill.signatures[0])) ends up calling this implementation in Numba: numba/numba/np/npy_randomgen.py Lines 605 to 609 in ddbbc2a
which is doing approximately the same thing as This gives: _ZN8__main__27numba_standard_uniform_fill<abi>NumPyRandomGeneratorType:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
pushq %rax
.cfi_def_cfa_offset 64
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
movq %r8, %r13
movq %rdi, %r14
movq 72(%rsp), %r12
movabsq $NRT_MemInfo_alloc_safe_aligned, %rax
movl $80000, %edi
movl $32, %esi
callq *%rax
movq %rax, (%rsp)
movq 24(%rax), %rbx
xorl %ebp, %ebp
.p2align 4, 0x90
.LBB0_1: ; <---------------------------- This is the loop filling the array
leaq 1(%rbp), %r15 ; |
movq %r13, %rdi ; |
callq *%r12 ; | Call to `next_double`
movsd %xmm0, (%rbx,%rbp,8) ; |
movq %r15, %rbp ; |
cmpq $10000, %r15 ; |
jne .LBB0_1 ; --------------
movq (%rsp), %rax
movq %rax, (%r14)
movabsq $.LCPI0_0, %rax
movaps (%rax), %xmm0
movups %xmm0, 8(%r14)
movq $8, 24(%r14)
movq %rbx, 32(%r14)
movabsq $.LCPI0_1, %rax
movaps (%rax), %xmm0
movups %xmm0, 40(%r14)
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 56
popq %rbx
.cfi_def_cfa_offset 48
popq %r12
.cfi_def_cfa_offset 40
popq %r13
.cfi_def_cfa_offset 32
popq %r14
.cfi_def_cfa_offset 24
popq %r15
.cfi_def_cfa_offset 16
popq %rbp
.cfi_def_cfa_offset 8
retq
Looking at the machine code for 000000000006df20 <random_standard_uniform_fill>:
6df20: 48 85 f6 test %rsi,%rsi
6df23: 7e 33 jle 6df58 <random_standard_uniform_fill+0x38>
6df25: 41 54 push %r12
6df27: 4c 8d 24 f2 lea (%rdx,%rsi,8),%r12
6df2b: 55 push %rbp
6df2c: 48 89 fd mov %rdi,%rbp
6df2f: 53 push %rbx
6df30: 48 89 d3 mov %rdx,%rbx
6df33: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
6df38: 48 8b 7d 00 mov 0x0(%rbp),%rdi ; <---------------------------- This is the loop filling the array
6df3c: 48 83 c3 08 add $0x8,%rbx ; |
6df40: ff 55 18 callq *0x18(%rbp) ; | Call to `next_double`
6df43: f2 0f 11 43 f8 movsd %xmm0,-0x8(%rbx) ; |
6df48: 4c 39 e3 cmp %r12,%rbx ; |
6df4b: 75 eb jne 6df38 <random_standard_uniform_fill+0x18> ;---
6df4d: 5b pop %rbx
6df4e: 5d pop %rbp
6df4f: 41 5c pop %r12
6df51: c3 retq
6df52: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
6df58: c3 retq
6df59: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) the loop is basically the same. |
Adding in the proposed change to the Ziggurat constants here #7900 (comment) (i.e. make them arrays), alongside the potential fix to import numpy as np
from IPython import get_ipython
from numba import njit
ipython = get_ipython()
for sz in [0, *(10**(np.arange(7)))]:
print(f"Testing size: {sz}".center(80, '-'))
@njit
def numba_gamma(rng):
return rng.gamma(sz, size=sz)
res = numba_gamma(np.random.default_rng(seed=230923))
ipython.run_line_magic('timeit', f'_ = numba_gamma(np.random.default_rng(seed=230923))')
ipython.run_line_magic('timeit', f'_ = np.random.default_rng(seed=230923).gamma({sz}, size=({sz},))')
exp_res = np.random.default_rng(seed=230923).gamma(sz, size=sz)
assert np.array_equal(res, exp_res) I get:
The "size 0" system shows the cost of the dispatch involving a |
Do you mean a full LLVM port of |
I'm not sure right now that would help unless extra runtime performance was needed. The dispatch cost is due to Numba not having a fast path to work out the type of BitGenerators, it essentially goes back into the interpreter to find out. The general fix for this is likely hard. |
ee9e920
to
018a85c
Compare
@kc611 Any chance you could please fix the flake8 violation in the patch:
it'd be great to know that this is passing all tests on public CI, but they won't run unless that flake8 thing is fixed! Many thanks. |
@@ -11,6 +11,11 @@ | |||
# terminal color markup | |||
_termcolor = errors.termcolor() | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably do with a comment, why two code paths are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a discrepancy in module names from version to version. (They renamed the module in a newer version, hence I had to account for both of them)
Resolves #4499
Initial Numba
BitGenerator
implementation by @stuartarchibald (#4499 (comment))Currently Generator methods supported are:
Generator().random
Generator().standard_normal()
Generator().standard_exponential()
Generator().standard_gamma()
Generator().normal()
Generator().exponential()
Generator().gamma()
Generator().beta()
Generator().f()
Generator().chisquare()
Generator().standard_cauchy()
Generator().pareto()
Generator().weibull()
Generator().power()
Generator().laplace()
Generator().gumbel()
Generator().logistic()
Generator().lognormal()
Generator().rayleigh()
Generator().standard_t()
Generator().wald()
Generator().von_mises()
Generator().geometric()
Generator().zipf()
Generator().triangular()
Generator().poisson()
Generator().negative_binomial()
The remaining methods to be ported are:
Generator().integers()
Generator().choice()
Generator().bytes()
Generator().shuffle()
Generator().permutation()
Generator().permuted()
Generator().binomial()
Generator().multivariate_normal()
Generator().multinomial()
Generator().multivariate_hypergeometric()
Generator().dirichlet()