Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Support for Numpy BitGenerators #7900

Closed
wants to merge 2 commits into from

Conversation

kc611
Copy link
Collaborator

@kc611 kc611 commented Mar 10, 2022

Resolves #4499

Initial Numba BitGenerator implementation by @stuartarchibald (#4499 (comment))

Currently Generator methods supported are:

  • Generator().random
  • Generator().standard_normal()
  • Generator().standard_exponential()
  • Generator().standard_gamma()
  • Generator().normal()
  • Generator().exponential()
  • Generator().gamma()
  • Generator().beta()
  • Generator().f()
  • Generator().chisquare()
  • Generator().standard_cauchy()
  • Generator().pareto()
  • Generator().weibull()
  • Generator().power()
  • Generator().laplace()
  • Generator().gumbel()
  • Generator().logistic()
  • Generator().lognormal()
  • Generator().rayleigh()
  • Generator().standard_t()
  • Generator().wald()
  • Generator().von_mises()
  • Generator().geometric()
  • Generator().zipf()
  • Generator().triangular()
  • Generator().poisson()
  • Generator().negative_binomial()

The remaining methods to be ported are:

  • Generator().integers()
  • Generator().choice()
  • Generator().bytes()
  • Generator().shuffle()
  • Generator().permutation()
  • Generator().permuted()
  • Generator().binomial()
  • Generator().multivariate_normal()
  • Generator().multinomial()
  • Generator().multivariate_hypergeometric()
  • Generator().dirichlet()

@stuartarchibald
Copy link
Contributor

Test are failing on NumPy 1.18 as I think the BitGenerator class existed but with an underscore prefix on the module:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.18.5'

In [3]: np.random.BitGenerator
Out[3]: numpy.random._bit_generator.BitGenerator

Note _bit_generator.pxd in 1.18 (has underscore)
https://github.com/numpy/numpy/tree/maintenance/1.18.x/numpy/random

and then bit_generator.pxd in 1.19
https://github.com/numpy/numpy/tree/maintenance/1.19.x/numpy/random

@stuartarchibald
Copy link
Contributor

@kc611 Many thanks for creating a patch from #4499 (comment) and updating it with standard_normal.

@brandonwillard any chance you could please give this a first review once CI passes? Thanks in advance!

@brandonwillard
Copy link
Contributor

@brandonwillard any chance you could please give this a first review once CI passes? Thanks in advance!

Of course.

return out
return impl

# Overload the Generator().standard_gamma() method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Overload the Generator().standard_gamma() method
# Overload the Generator().standard_gamma() method

think this should fix flake8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kc611 ^ This should fix flake8 and let public CI run fully.

Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kc611 mentioned that there were some issues with the time taken by the new TestRandomGenerators.test_standard_gamma test (and that those tests might not have actually been run in CI). I can reproduce the issue locally, and it looks like numba.np.npy_randomgen.random_standard_gamma is not exiting its inner loop.

The port of random_standard_gamma and the function it calls in the inner-loop, random_standard_normal, look like faithful adaptations, and random_standard_normal checks out independently (i.e. it matches the samples produced by NumPy).

The test parameters seem to imply that the break condition V <= 0.0 might not be met very quickly (e.g. scipy.stats.norm.cdf(0, loc=1, scale=0.2) is ~2.8e-7), so perhaps what we're seeing is a performance problem.

The following experiment appears to confirm that there is a noticeable performance difference when sampling from standard_normal alone:

import numpy as np

from numba import njit


@njit
def numba_standard_normal(rng):
    return rng.standard_normal(10000)


res = numba_standard_normal(np.random.default_rng(seed=230923))

%timeit _ = numba_standard_normal(np.random.default_rng(seed=230923))
# 304 µs ± 4.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = np.random.default_rng(seed=230923).standard_normal(10000)
# 157 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

exp_res = np.random.default_rng(seed=230923).standard_normal(10000)
assert np.array_equal(res, exp_res)

It still doesn't seem like taking twice as long to draw a standard normal is enough to explain the divergence of that loop, though. Perhaps, some aspects of the call to random_standard_normal within the while loop are compounding the latency issue observed above; otherwise, I'm not sure what else could be causing this issue right now.

@@ -0,0 +1,1212 @@


ki_double = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: #7900 (review)

I suggest making these large constant pools into np.array(). In the current implementation, this is a tuple so it will appear as an aggregate type in the LLVM IR and also have to adhere to Python tuple semantics. This has two effects, the first is that indexing ends up going via a massive switch table to jump to the right value (opposed to a cheap LLVM GetElementPointer() as is present in NumPy array indexing), and the second is that tuple indexing semantics like negative indexing and raising on invalid index have to be dealt with. The combination of these two things is probably what's causing the slow down in comparison to a relatively direct non-bounds-checked read-only NumPy array access.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried making the constant pools an Numpy array, it did make the code run a bit faster, but that difference in performance was little when compared to the overall difference. Even if we remove all the extra logic and just focus on the CFFI bindings. For instance:

import numpy as np

from numba import njit


@njit
def numba_standard_normal(rng):
    return rng.random(10000)


res = numba_standard_normal(np.random.default_rng(seed=230923))

%timeit _ = numba_standard_normal(np.random.default_rng(seed=230923))
# 151 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit _ = np.random.default_rng(seed=230923).random(10000)
# 99.8 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

exp_res = np.random.default_rng(seed=230923).random(10000)
assert np.array_equal(res, exp_res)

Which internally simply calls the next_float or next_double bindings, it's still slower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if the output is 100_000 in size opposed to 10_000 does it make a difference? I guess I'll need to go and look at the machine code for both implementations. Nothing obvious jumps out from a quick skim of the code generation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using this:

import numpy as np
from IPython import get_ipython
from numba import njit

ipython = get_ipython()


for sz in [0, *(10**(np.arange(7)))]:
    print(f"Testing size: {sz}".center(80, '-'))
    @njit
    def numba_standard_normal(rng):
        return rng.random(sz)


    res = numba_standard_normal(np.random.default_rng(seed=230923))

    ipython.run_line_magic('timeit', f'_ = numba_standard_normal(np.random.default_rng(seed=230923))')
    ipython.run_line_magic('timeit', f'_ = np.random.default_rng(seed=230923).random({sz})')

    exp_res = np.random.default_rng(seed=230923).random(sz)
    assert np.array_equal(res, exp_res)

and running with:

ipython <scriptname.py>

Some results:

--------------------------------Testing size: 0---------------------------------
116 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.8 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 1---------------------------------
115 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 10--------------------------------
117 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 100--------------------------------
116 µs ± 638 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
59.1 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 1000-------------------------------
122 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66.8 µs ± 885 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
------------------------------Testing size: 10000-------------------------------
185 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
130 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
------------------------------Testing size: 100000------------------------------
810 µs ± 6.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
742 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-----------------------------Testing size: 1000000------------------------------
7.37 ms ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.22 ms ± 7.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

seems like cost of dispatch in Numba is quite high.

@stuartarchibald
Copy link
Contributor

@kc611 in the interests of ensuring that this functionality is supported, what do you think to deferring concerns about dispatch performance (#7900 (comment)) with view of focusing on getting the support/implementation complete? Whilst it'd be nice to have faster dispatch for Generator/BitGenerator, this can be done later (I think it also is likely to get quite involved).

@kc611
Copy link
Collaborator Author

kc611 commented Mar 22, 2022

@kc611 in the interests of ensuring that this functionality is supported, what do you think to deferring concerns about dispatch performance (#7900 (comment)) with view of focusing on getting the support/implementation complete? Whilst it'd be nice to have faster dispatch for Generator/BitGenerator, this can be done later (I think it also is likely to get quite involved).

The real problem here isn't the lack of performance itself, but it's manifestation in derived sampling algorithms. For instance, Try running Generator().gamma function. Even though it's a direct port of Numpy algorithm, it's stuck in a near infinite loop (it actually runs but in a quite slow, never ending loop). Though I might be wrong about it being due to lack of performance but it certainly looks like that.

@stuartarchibald
Copy link
Contributor

@kc611 in the interests of ensuring that this functionality is supported, what do you think to deferring concerns about dispatch performance (#7900 (comment)) with view of focusing on getting the support/implementation complete? Whilst it'd be nice to have faster dispatch for Generator/BitGenerator, this can be done later (I think it also is likely to get quite involved).

The real problem here isn't the lack of performance itself, but it's manifestation in derived sampling algorithms. For instance, Try running Generator().gamma function. Even though it's a direct port of Numpy algorithm, it's stuck in a near infinite loop (it actually runs but in a quite slow, never ending loop). Though I might be wrong about it being due to lack of performance but it certainly looks like that.

I see, thanks for the explanation. I think the first thing that needs working out is whether the performance slow down is coming from the calls to e.g. next_double, or if it's because the algorithm isn't following exactly the same numerical path as NumPy (or a combination of both). One experiment might be to investigate if the Numba translation is producing the exact same state as NumPy on each iteration of e.g. the while loops in Generator.gamma() and whether the total number of loop iterations is the same, then that perhaps gives an indication as to where the problem lies?

@brandonwillard
Copy link
Contributor

brandonwillard commented Mar 22, 2022

I see, thanks for the explanation. I think the first thing that needs working out is whether the performance slow down is coming from the calls to e.g. next_double, or if it's because the algorithm isn't following exactly the same numerical path as NumPy (or a combination of both). One experiment might be to investigate if the Numba translation is producing the exact same state as NumPy on each iteration of e.g. the while loops in Generator.gamma() and whether the total number of loop iterations is the same, then that perhaps gives an indication as to where the problem lies?

When I walked through it, it looked like it was following the exact same path as NumPy. For instance, with the same seed and distribution parameters, that loop took a lot of iterations in pure Python/NumPy as well, but, since it was significantly faster, it wasn't a problem.

@stuartarchibald
Copy link
Contributor

In attempting to work out where the issue lies, this sample code:

import numpy as np
from numba import njit

@njit
def numba_standard_uniform_fill(rng):
    return rng.random(10000)

res = numba_standard_uniform_fill(np.random.default_rng(seed=230923))
print(numba_standard_uniform_fill.inspect_asm(numba_standard_uniform_fill.signatures[0]))

ends up calling this implementation in Numba:

def impl(inst, size=None, dtype=np.float64):
out = np.empty(size, dtype=dtype)
for i in range(out.size):
out[i] = dist_func(inst.bit_generator)
return out

which is doing approximately the same thing as random_standard_uniform_fill from here:
https://github.com/numpy/numpy/blob/2f1410d3f2bc1058044ec0f1906e24b7f14e56a5/numpy/random/src/distributions/distributions.c#L32-L37
i.e. there's a simple loop with the next_double called, answers are pushed into some array.

This gives:

_ZN8__main__27numba_standard_uniform_fill<abi>NumPyRandomGeneratorType:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        pushq   %r15
        .cfi_def_cfa_offset 24
        pushq   %r14
        .cfi_def_cfa_offset 32
        pushq   %r13
        .cfi_def_cfa_offset 40
        pushq   %r12
        .cfi_def_cfa_offset 48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        pushq   %rax
        .cfi_def_cfa_offset 64
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        .cfi_offset %rbp, -16
        movq    %r8, %r13
        movq    %rdi, %r14
        movq    72(%rsp), %r12
        movabsq $NRT_MemInfo_alloc_safe_aligned, %rax
        movl    $80000, %edi
        movl    $32, %esi
        callq   *%rax
        movq    %rax, (%rsp)
        movq    24(%rax), %rbx
        xorl    %ebp, %ebp
        .p2align        4, 0x90
.LBB0_1: ; <---------------------------- This is the loop filling the array
        leaq    1(%rbp), %r15 ;        |
        movq    %r13, %rdi ;           |
        callq   *%r12 ;                | Call to `next_double`
        movsd   %xmm0, (%rbx,%rbp,8) ; | 
        movq    %r15, %rbp ;           |
        cmpq    $10000, %r15 ;         |
        jne     .LBB0_1 ; --------------
        movq    (%rsp), %rax
        movq    %rax, (%r14)
        movabsq $.LCPI0_0, %rax
        movaps  (%rax), %xmm0
        movups  %xmm0, 8(%r14)
        movq    $8, 24(%r14)
        movq    %rbx, 32(%r14)
        movabsq $.LCPI0_1, %rax
        movaps  (%rax), %xmm0
        movups  %xmm0, 40(%r14)
        xorl    %eax, %eax
        addq    $8, %rsp
        .cfi_def_cfa_offset 56
        popq    %rbx
        .cfi_def_cfa_offset 48
        popq    %r12
        .cfi_def_cfa_offset 40
        popq    %r13
        .cfi_def_cfa_offset 32
        popq    %r14
        .cfi_def_cfa_offset 24
        popq    %r15
        .cfi_def_cfa_offset 16
        popq    %rbp
        .cfi_def_cfa_offset 8
        retq

Looking at the machine code for random_standard_uniform_fill from the precompiled NumPy DSOs e.g.
site-packages/numpy/random/_generator.cpython-39-x86_64-linux-gnu.so:

000000000006df20 <random_standard_uniform_fill>:
   6df20:       48 85 f6                test   %rsi,%rsi
   6df23:       7e 33                   jle    6df58 <random_standard_uniform_fill+0x38>
   6df25:       41 54                   push   %r12
   6df27:       4c 8d 24 f2             lea    (%rdx,%rsi,8),%r12
   6df2b:       55                      push   %rbp
   6df2c:       48 89 fd                mov    %rdi,%rbp
   6df2f:       53                      push   %rbx
   6df30:       48 89 d3                mov    %rdx,%rbx
   6df33:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   6df38:       48 8b 7d 00             mov    0x0(%rbp),%rdi ; <---------------------------- This is the loop filling the array
   6df3c:       48 83 c3 08             add    $0x8,%rbx ;                                  |
   6df40:       ff 55 18                callq  *0x18(%rbp) ;                                | Call to `next_double`
   6df43:       f2 0f 11 43 f8          movsd  %xmm0,-0x8(%rbx) ;                           |
   6df48:       4c 39 e3                cmp    %r12,%rbx ;                                  |
   6df4b:       75 eb                   jne    6df38 <random_standard_uniform_fill+0x18> ;---
   6df4d:       5b                      pop    %rbx
   6df4e:       5d                      pop    %rbp
   6df4f:       41 5c                   pop    %r12
   6df51:       c3                      retq
   6df52:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
   6df58:       c3                      retq
   6df59:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

the loop is basically the same.

@stuartarchibald
Copy link
Contributor

Adding in the proposed change to the Ziggurat constants here #7900 (comment) (i.e. make them arrays), alongside the potential fix to random_standard_gamma here #7900 (comment). Using this script to test the performance of the gamma method on the Generator object:

import numpy as np
from IPython import get_ipython
from numba import njit
ipython = get_ipython()


for sz in [0, *(10**(np.arange(7)))]:
    print(f"Testing size: {sz}".center(80, '-'))
    @njit
    def numba_gamma(rng):
        return rng.gamma(sz, size=sz)


    res = numba_gamma(np.random.default_rng(seed=230923))

    ipython.run_line_magic('timeit', f'_ = numba_gamma(np.random.default_rng(seed=230923))')
    ipython.run_line_magic('timeit', f'_ = np.random.default_rng(seed=230923).gamma({sz}, size=({sz},))')

    exp_res = np.random.default_rng(seed=230923).gamma(sz, size=sz)
    assert np.array_equal(res, exp_res)

I get:

--------------------------------Testing size: 0---------------------------------
119 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
70 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 1---------------------------------
116 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.9 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
--------------------------------Testing size: 10--------------------------------
118 µs ± 71 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
70.5 µs ± 72.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 100--------------------------------
124 µs ± 63 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
79.2 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
-------------------------------Testing size: 1000-------------------------------
183 µs ± 49.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
141 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
------------------------------Testing size: 10000-------------------------------
784 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
773 µs ± 366 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
------------------------------Testing size: 100000------------------------------
6.78 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.05 ms ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-----------------------------Testing size: 1000000------------------------------
67.4 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
70.4 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The "size 0" system shows the cost of the dispatch involving a Generator object, it's not cheap, I suggest deferring fixing it as it is likely very involved. This "cost" can be seen as a constant 50µs (119µs - 70µs ~= 50µs) overhead. It looks like Numba manages to amortize that somewhere around 10k elements locally.

@kc611
Copy link
Collaborator Author

kc611 commented Mar 23, 2022

I suggest deferring fixing it as it is likely very involved.

Do you mean a full LLVM port of BitGenerator methods (or are there other ways to do that) ? How valuable do you think it would be to do that ?

@stuartarchibald
Copy link
Contributor

I suggest deferring fixing it as it is likely very involved.

Do you mean a full LLVM port of BitGenerator methods ? Do you think it would be valuable to do that ?

I'm not sure right now that would help unless extra runtime performance was needed. The dispatch cost is due to Numba not having a fast path to work out the type of BitGenerators, it essentially goes back into the interpreter to find out. The general fix for this is likely hard.

@stuartarchibald
Copy link
Contributor

@kc611 Any chance you could please fix the flake8 violation in the patch:

numba/np/npy_randomgen.py:757:1: E302 expected 2 blank lines, found 1

it'd be great to know that this is passing all tests on public CI, but they won't run unless that flake8 thing is fixed! Many thanks.

@@ -11,6 +11,11 @@
# terminal color markup
_termcolor = errors.termcolor()

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably do with a comment, why two code paths are needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discrepancy in module names from version to version. (They renamed the module in a newer version, hence I had to account for both of them)

@kc611
Copy link
Collaborator Author

kc611 commented May 12, 2022

This PR has been split into the following PRs due to size constraints:
#8031 , #8038 , #8040, #8041 , #8042

Development and reviews will now continue within these.

@kc611 kc611 closed this May 12, 2022
@esc esc added the abandoned PR is abandoned (no reason required) label May 12, 2022
@esc
Copy link
Member

esc commented May 12, 2022

This PR has been split into the following PRs due to size constraints: #8031 , #8038 , #8040, #8041 , #8042

Development and reviews will now continue within these.

Excellent, thank you for the update. I have updated the labels accordingly. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned PR is abandoned (no reason required) Effort - long Long size effort needed
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Accept Numpy (1.17+) Generator and BitGenerator objects in jitted code
4 participants