Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: add fast integer division intrinsics for all supported platforms #18178

Merged
merged 8 commits into from
Mar 8, 2021

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Jan 16, 2021

Add fast integer division intrinsics for all supported platforms

merge before #18075

Almost all architecture (except Power10) doesn't support integer vector division,
also the cost of scalar division in architectures like x86 is too high it can take
30 to 40 cycles on modern chips and up to 100 on old ones.

Therefore we are using division by multiplying with precomputed reciprocal technique,
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
“Division by invariant integers using multiplication(see [Figure 4.1, 5.1]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)

It shows a good impact for all architectures especially on X86,
however computing divisor parameters is kind of expensive so this implementation
should only works when divisor is a scalar and used multiple of times.

The division process is separated into two intrinsics for each data type

  • npyv_{dtype}x3 npyv_divisor_{dtype} ({dtype} divisor);
    For computing the divisor parameters (multiplier + shifters + sign of divisor(signed only))

  • npyv_{dtype} npyv_divisor_{dtype} (npyv_{dtype} dividend, npyv_{dtype}x3 divisor_parms);
    For performing the final division.

For example:

int vstep = npyv_nlanes_s32;                // number of lanes
int x     = 0x6e70;
npyv_s32x3 divisor = npyv_divisor_s32(x);   // init divisor params
for (; len >= vstep; src += vstep, dst += vstep, len -= vstep) {
   npyv_s32 a = npyv_load_s32(*src);       // load s32 vector from memory
            a = npyv_divc_s32(a, divisor); // divide all elements by x
   npyv_store_s32(dst, a);                 // store s32 vector into memroy
}

NOTES:

  • For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division
    since emulating multiply-high is expensive and both architectures have very fast dividers.

TODO:

  • verify armhf
  • cleanup

@seiko2plus seiko2plus added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jan 16, 2021
@seiko2plus seiko2plus force-pushed the npyv_fast_div branch 3 times, most recently from af0e706 to b99768b Compare January 16, 2021 22:54
@mattip
Copy link
Member

mattip commented Jan 17, 2021

I think this should have an attribution, something like

Based on the VCL library, which is (c) Copyright 2012-2020 Agner Fog and licensed under
the Apache License version 2.0.

@rgommers is that correct?

@rgommers
Copy link
Member

I think this should have an attribution, something like

Based on the VCL library, which is (c) Copyright 2012-2020 Agner Fog and licensed under
the Apache License version 2.0.

@rgommers is that correct?

No that sounds wrong. Reminder, we don't do Apache2, see for example: #13447 (comment).

We could change that at some point, but it's a significant change and needs a strong motivation and decision on the mailing list.

VCL is a little hard to find, so here is a link: https://github.com/vectorclass/version2. If this PR took code from there, that seems like a blocker for accepting it.

@rgommers
Copy link
Member

Now read the PR description, the whole PR is based on that, and it seems valuable. It may be worth considering, especially if there are no good alternatives. We'd be deciding to give up on GPLv2 compatibility.

@seiko2plus
Copy link
Member Author

@rgommers, we can ask for re-licensing, since we're not taking the same exact code plus the original code itself is based on T. Granlund and P. L. Montgomery work.

@Qiyu8
Copy link
Member

Qiyu8 commented Jan 18, 2021

Most of the code was rewritten, while some of the code is very similar to Apache 2.0 based VSL, which is not compatible with current 3-clause BSD License(Apache 2.0 is more restrictive), so we should either rewrite the similar code or ask Agner Fog modify the License.

@seiko2plus
Copy link
Member Author

@ganesh-k13, I added intrinsics for the 64-bit division so we can drop libdiv.
@rgommers, @mattip, I changed the term "based" to "adopted" since we're not using the same code.
I don't think we need to leave any license notice because the original work belong to "T. Granlund and P. L. Montgomery." in the first place.

@seiko2plus seiko2plus force-pushed the npyv_fast_div branch 2 times, most recently from cd96539 to 0a3ea2a Compare March 6, 2021 17:40
@seiko2plus
Copy link
Member Author

@mattip,

Coverage is missing for some of the code paths. Is there a way to increase the testing?

I added a missing case when the divisor equals 2, but still not enough if the CI worker doesn't support avx512.
anyway, the current worker seems to support it and codecov/patch — 97.08% of diff hit (target 82.50%).

@mattip
Copy link
Member

mattip commented Mar 7, 2021

It seems there is not a test for divisor == 0? Or maybe that case is not needed, so maybe add a comment where coverage is complaining that the code path should never be reached

@seiko2plus seiko2plus force-pushed the npyv_fast_div branch 3 times, most recently from c21c61b to 5d2c57f Compare March 8, 2021 05:35
@seiko2plus
Copy link
Member Author

seiko2plus commented Mar 8, 2021

@mattip, calling npyv_divisor_{dtype}(0) from python or c level will raise a hardware arithmetic exception. I added LCOV_EXCL_LINE instead to suppress the warning.

@mattip mattip merged commit df2cd0f into numpy:main Mar 8, 2021
@mattip
Copy link
Member

mattip commented Mar 8, 2021

Thanks @seiko2plus

@ganesh-k13
Copy link
Member

Hey @seiko2plus , I was trying to add dispatch for signed and in the case of signed 64 bits, the "round down" is not happening. Here is the code used to test it: diff.

s8, s16, and s32 seem to work fine.
Simple reproducer

Just confirming from bt:

(gdb) n
simd_divide_by_scalar_contig_s64 (args=0x7fffffffd4b0, len=5) at numpy/core/src/umath/loops_arithmetic.dispatch.c.src:40
40          for (; len >= vstep; len -= vstep, src += vstep, dst += vstep) {
(gdb) n
41              npyv_@sfx@ a = npyv_load_@sfx@(src);
(gdb) 
42              npyv_@sfx@ c = npyv_divc_@sfx@(a, divisor);
(gdb) 
43              npyv_store_@sfx@(dst, c);
(gdb) p a
$1 = {-9223372036854775808, -4611686018427387904, 0, 4611686018427387903}
(gdb) p scalar
$7 = -9223372036854775808
(gdb) p c
$8 = {1, 0, 0, 0}

The last 0 should be -1 as it is 4611686018427387903//-9223372036854775808. This is true for all dividends > 0 with divisors < 0.

@seiko2plus
Copy link
Member Author

@ganesh-k13, this patch follows the C compilers behavior for the signed division which mean:
1- Quotient rounded towards zero(truncate) rather than negative infinity(floor)
2- Wrap around overflow rather than returning zero and raising arithmetic exception.
This case isn't standard but most compilers follow this behavior at least when optimization is enabled

Here an example in Python for implementing floor division on the top truncate division:

"""
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
−dsign = SRL(d, N − 1);
−nsign = (n < −dsign );
/* SLT, signed */
−qsign = EOR(−nsign , −dsign );
q = TRUNC((n − (−dsign ) + (−nsign ))/d) − (−qsign );
"""
from numpy.core._simd import baseline as v

for d in [-100, -1, -50, 100, 1, 100]:
    nsign_d = v.setall_s16(int(d < 0))
    divisor = v.divisor_s16(d)
    noverflow = v.cvt_b16_s16(v.setall_s16(-1))
    for n in [-0x8000, -255, 255, -100, 1, 100]:
        a         = v.load_s16(range(n, n + v.nlanes_s16))
        nsign_a   = v.cvt_s16_b16(v.cmplt_s16(a, nsign_d))
        nsign_a   = v.and_s16(nsign_a, v.setall_s16(1))
        diff_sign = v.sub_s16(nsign_a, nsign_d)
        to_ninf   = v.xor_s16(nsign_a, nsign_d)
        trunc     = v.divc_s16(v.add_s16(a, diff_sign), divisor)
        if d == -1:
            greater_min = v.cmpgt_s16(a, v.setall_s16(-0x8000))
            noverflow = v.and_b16(noverflow, greater_min)
            floor = v.ifsub_s16(greater_min, trunc, to_ninf, v.zero_s16())
        else:
            floor = v.sub_s16(trunc, to_ninf)
        assert floor == [x//d for x in a]

    if v.tobits_b16(noverflow) != (1 << v.nlanes_s16)-1:
        0//0

@ganesh-k13
Copy link
Member

ganesh-k13 commented Apr 10, 2021

Thanks a lot for the info @seiko2plus. As python prefers round to -inf, shall I replace/add the npyv_divc_s* with round to -inf as shown in paper? Will be happy to raise a PR with round to -inf. [EDIT] Or do we build on top of trunc as shown above?

[EDIT 2] My main question is, do we edit npyv_divc_s* or add code in the dispatch to modify the divisor.

@seiko2plus
Copy link
Member Author

@ganesh-k13, just handle it directly within the dispatch-able source file, note we will still need to use truncate division for timedelta so we can completely erase libdivide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants