SIMD: add fast integer division intrinsics for all supported platforms #18178

seiko2plus · 2021-01-16T21:03:59Z

Add fast integer division intrinsics for all supported platforms

merge before #18075

Almost all architecture (except Power10) doesn't support integer vector division,
also the cost of scalar division in architectures like x86 is too high it can take
30 to 40 cycles on modern chips and up to 100 on old ones.

Therefore we are using division by multiplying with precomputed reciprocal technique,
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
“Division by invariant integers using multiplication(see [Figure 4.1, 5.1]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)

It shows a good impact for all architectures especially on X86,
however computing divisor parameters is kind of expensive so this implementation
should only works when divisor is a scalar and used multiple of times.

The division process is separated into two intrinsics for each data type

npyv_{dtype}x3 npyv_divisor_{dtype} ({dtype} divisor);
For computing the divisor parameters (multiplier + shifters + sign of divisor(signed only))
npyv_{dtype} npyv_divisor_{dtype} (npyv_{dtype} dividend, npyv_{dtype}x3 divisor_parms);
For performing the final division.

For example:

int vstep = npyv_nlanes_s32;                // number of lanes
int x     = 0x6e70;
npyv_s32x3 divisor = npyv_divisor_s32(x);   // init divisor params
for (; len >= vstep; src += vstep, dst += vstep, len -= vstep) {
   npyv_s32 a = npyv_load_s32(*src);       // load s32 vector from memory
            a = npyv_divc_s32(a, divisor); // divide all elements by x
   npyv_store_s32(dst, a);                 // store s32 vector into memroy
}

NOTES:

For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division
since emulating multiply-high is expensive and both architectures have very fast dividers.

TODO:

verify armhf
cleanup

mattip · 2021-01-17T07:17:32Z

I think this should have an attribution, something like

Based on the VCL library, which is (c) Copyright 2012-2020 Agner Fog and licensed under
the Apache License version 2.0.

@rgommers is that correct?

numpy/core/src/common/simd/intdiv.h

rgommers · 2021-01-17T11:07:07Z

I think this should have an attribution, something like
Based on the VCL library, which is (c) Copyright 2012-2020 Agner Fog and licensed under
the Apache License version 2.0.
@rgommers is that correct?

No that sounds wrong. Reminder, we don't do Apache2, see for example: #13447 (comment).

We could change that at some point, but it's a significant change and needs a strong motivation and decision on the mailing list.

VCL is a little hard to find, so here is a link: https://github.com/vectorclass/version2. If this PR took code from there, that seems like a blocker for accepting it.

rgommers · 2021-01-17T11:08:51Z

Now read the PR description, the whole PR is based on that, and it seems valuable. It may be worth considering, especially if there are no good alternatives. We'd be deciding to give up on GPLv2 compatibility.

seiko2plus · 2021-01-17T13:26:33Z

@rgommers, we can ask for re-licensing, since we're not taking the same exact code plus the original code itself is based on T. Granlund and P. L. Montgomery work.

Qiyu8 · 2021-01-18T06:32:15Z

Most of the code was rewritten, while some of the code is very similar to Apache 2.0 based VSL, which is not compatible with current 3-clause BSD License(Apache 2.0 is more restrictive), so we should either rewrite the similar code or ask Agner Fog modify the License.

seiko2plus · 2021-02-26T22:58:32Z

@ganesh-k13, I added intrinsics for the 64-bit division so we can drop libdiv.
@rgommers, @mattip, I changed the term "based" to "adopted" since we're not using the same code.
I don't think we need to leave any license notice because the original work belong to "T. Granlund and P. L. Montgomery." in the first place.

numpy/core/src/common/simd/vsx/arithmetic.h

numpy/core/src/common/simd/neon/arithmetic.h

seiko2plus · 2021-03-06T21:58:37Z

@mattip,

Coverage is missing for some of the code paths. Is there a way to increase the testing?

I added a missing case when the divisor equals 2, but still not enough if the CI worker doesn't support avx512.
anyway, the current worker seems to support it and codecov/patch — 97.08% of diff hit (target 82.50%).

mattip · 2021-03-07T07:22:17Z

It seems there is not a test for divisor == 0? Or maybe that case is not needed, so maybe add a comment where coverage is complaining that the code path should never be reached

…nteger division

seiko2plus · 2021-03-08T06:27:51Z

@mattip, calling npyv_divisor_{dtype}(0) from python or c level will raise a hardware arithmetic exception. I added LCOV_EXCL_LINE instead to suppress the warning.

mattip · 2021-03-08T07:25:06Z

Thanks @seiko2plus

ganesh-k13 · 2021-04-06T14:26:29Z

Hey @seiko2plus , I was trying to add dispatch for signed and in the case of signed 64 bits, the "round down" is not happening. Here is the code used to test it: diff.

s8, s16, and s32 seem to work fine.
Simple reproducer

Just confirming from bt:

(gdb) n
simd_divide_by_scalar_contig_s64 (args=0x7fffffffd4b0, len=5) at numpy/core/src/umath/loops_arithmetic.dispatch.c.src:40
40          for (; len >= vstep; len -= vstep, src += vstep, dst += vstep) {
(gdb) n
41              npyv_@sfx@ a = npyv_load_@sfx@(src);
(gdb) 
42              npyv_@sfx@ c = npyv_divc_@sfx@(a, divisor);
(gdb) 
43              npyv_store_@sfx@(dst, c);
(gdb) p a
$1 = {-9223372036854775808, -4611686018427387904, 0, 4611686018427387903}
(gdb) p scalar
$7 = -9223372036854775808
(gdb) p c
$8 = {1, 0, 0, 0}

The last 0 should be -1 as it is 4611686018427387903//-9223372036854775808. This is true for all dividends > 0 with divisors < 0.

seiko2plus · 2021-04-08T15:35:25Z

@ganesh-k13, this patch follows the C compilers behavior for the signed division which mean:
1- Quotient rounded towards zero(truncate) rather than negative infinity(floor)
2- Wrap around overflow rather than returning zero and raising arithmetic exception.
This case isn't standard but most compilers follow this behavior at least when optimization is enabled

Here an example in Python for implementing floor division on the top truncate division:

"""
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
−dsign = SRL(d, N − 1);
−nsign = (n < −dsign );
/* SLT, signed */
−qsign = EOR(−nsign , −dsign );
q = TRUNC((n − (−dsign ) + (−nsign ))/d) − (−qsign );
"""
from numpy.core._simd import baseline as v

for d in [-100, -1, -50, 100, 1, 100]:
    nsign_d = v.setall_s16(int(d < 0))
    divisor = v.divisor_s16(d)
    noverflow = v.cvt_b16_s16(v.setall_s16(-1))
    for n in [-0x8000, -255, 255, -100, 1, 100]:
        a         = v.load_s16(range(n, n + v.nlanes_s16))
        nsign_a   = v.cvt_s16_b16(v.cmplt_s16(a, nsign_d))
        nsign_a   = v.and_s16(nsign_a, v.setall_s16(1))
        diff_sign = v.sub_s16(nsign_a, nsign_d)
        to_ninf   = v.xor_s16(nsign_a, nsign_d)
        trunc     = v.divc_s16(v.add_s16(a, diff_sign), divisor)
        if d == -1:
            greater_min = v.cmpgt_s16(a, v.setall_s16(-0x8000))
            noverflow = v.and_b16(noverflow, greater_min)
            floor = v.ifsub_s16(greater_min, trunc, to_ninf, v.zero_s16())
        else:
            floor = v.sub_s16(trunc, to_ninf)
        assert floor == [x//d for x in a]

    if v.tobits_b16(noverflow) != (1 << v.nlanes_s16)-1:
        0//0

ganesh-k13 · 2021-04-10T17:32:03Z

Thanks a lot for the info @seiko2plus. As python prefers round to -inf, shall I replace/add the npyv_divc_s* with round to -inf as shown in paper? Will be happy to raise a PR with round to -inf. [EDIT] Or do we build on top of trunc as shown above?

[EDIT 2] My main question is, do we edit npyv_divc_s* or add code in the dispatch to modify the divisor.

seiko2plus · 2021-04-10T18:34:36Z

@ganesh-k13, just handle it directly within the dispatch-able source file, note we will still need to use truncate division for timedelta so we can completely erase libdivide.

seiko2plus added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jan 16, 2021

seiko2plus force-pushed the npyv_fast_div branch from 1cee85a to 0f87688 Compare January 16, 2021 21:08

seiko2plus mentioned this pull request Jan 16, 2021

ENH, SIMD: Dispatch for unsigned floor division #18075

Merged

seiko2plus force-pushed the npyv_fast_div branch 3 times, most recently from af0e706 to b99768b Compare January 16, 2021 22:54

ganesh-k13 reviewed Jan 17, 2021

View reviewed changes

numpy/core/src/common/simd/intdiv.h Outdated Show resolved Hide resolved

seiko2plus force-pushed the npyv_fast_div branch from b99768b to d98d927 Compare January 18, 2021 12:12

seiko2plus force-pushed the npyv_fast_div branch 7 times, most recently from 9a6bd67 to c4b67b7 Compare February 5, 2021 00:24

seiko2plus force-pushed the npyv_fast_div branch 4 times, most recently from 261a56c to d4fa7dd Compare February 18, 2021 12:46

seiko2plus force-pushed the npyv_fast_div branch from d4fa7dd to 41a7cdd Compare February 24, 2021 15:42

seiko2plus marked this pull request as ready for review February 24, 2021 15:47

seiko2plus force-pushed the npyv_fast_div branch from 41a7cdd to 6d9d969 Compare February 26, 2021 22:31

ganesh-k13 mentioned this pull request Feb 27, 2021

ENH: libdivide for unsigned integers #18055

Closed

seiko2plus force-pushed the npyv_fast_div branch 2 times, most recently from cd96539 to 0a3ea2a Compare March 6, 2021 17:40

seiko2plus commented Mar 6, 2021

View reviewed changes

numpy/core/src/common/simd/vsx/arithmetic.h Outdated Show resolved Hide resolved

seiko2plus commented Mar 6, 2021

View reviewed changes

numpy/core/src/common/simd/neon/arithmetic.h Outdated Show resolved Hide resolved

seiko2plus force-pushed the npyv_fast_div branch from 0a3ea2a to 48690a0 Compare March 6, 2021 18:06

seiko2plus force-pushed the npyv_fast_div branch 3 times, most recently from c21c61b to 5d2c57f Compare March 8, 2021 05:35

seiko2plus added 8 commits March 8, 2021 08:19

SIMD: add NPYV intrinsics that compute the parameters used for fast i…

6f4ba2a

…nteger division

SIMD: add NPYV fast integer division intrinsics for SSE

f4d06c7

SIMD: add NPYV fast integer division intrinsics for AVX2

a40a353

SIMD: add NPYV fast integer division intrinsics for AVX512

6ab925d

SIMD: add NPYV fast integer division intrinsics for VSX

5c185cc

SIMD: add NPYV fast integer division intrinsics for NEON

2da9858

SIMD, TST: add test cases for NPYV fast integer division

6c94b4c

SIMD, BUG: fix passing immediate values to npyv_setall_u64 on SSE/32-bit

8aae310

seiko2plus force-pushed the npyv_fast_div branch from 5d2c57f to 8aae310 Compare March 8, 2021 06:19

mattip merged commit df2cd0f into numpy:main Mar 8, 2021

ganesh-k13 mentioned this pull request Apr 13, 2021

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

Merged

seiko2plus mentioned this pull request May 19, 2021

BUG: Unexpected result in uint8 division on x86 #19045

Closed

This was referenced Oct 23, 2021

BUG: inaccurate 8-bit unsigned integer division when the divisor is a scalar #20168

Closed

BUG: Bad SIMD integer // on windows in numpy 1.21.2 #20025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: add fast integer division intrinsics for all supported platforms #18178

SIMD: add fast integer division intrinsics for all supported platforms #18178

seiko2plus commented Jan 16, 2021 •

edited

mattip commented Jan 17, 2021

rgommers commented Jan 17, 2021

rgommers commented Jan 17, 2021

seiko2plus commented Jan 17, 2021

Qiyu8 commented Jan 18, 2021

seiko2plus commented Feb 26, 2021

seiko2plus commented Mar 6, 2021

mattip commented Mar 7, 2021 •

edited

seiko2plus commented Mar 8, 2021 •

edited

mattip commented Mar 8, 2021

ganesh-k13 commented Apr 6, 2021

seiko2plus commented Apr 8, 2021

ganesh-k13 commented Apr 10, 2021 •

edited

seiko2plus commented Apr 10, 2021

SIMD: add fast integer division intrinsics for all supported platforms #18178

SIMD: add fast integer division intrinsics for all supported platforms #18178

Conversation

seiko2plus commented Jan 16, 2021 • edited

Add fast integer division intrinsics for all supported platforms

mattip commented Jan 17, 2021

rgommers commented Jan 17, 2021

rgommers commented Jan 17, 2021

seiko2plus commented Jan 17, 2021

Qiyu8 commented Jan 18, 2021

seiko2plus commented Feb 26, 2021

seiko2plus commented Mar 6, 2021

mattip commented Mar 7, 2021 • edited

seiko2plus commented Mar 8, 2021 • edited

mattip commented Mar 8, 2021

ganesh-k13 commented Apr 6, 2021

seiko2plus commented Apr 8, 2021

ganesh-k13 commented Apr 10, 2021 • edited

seiko2plus commented Apr 10, 2021

seiko2plus commented Jan 16, 2021 •

edited

mattip commented Mar 7, 2021 •

edited

seiko2plus commented Mar 8, 2021 •

edited

ganesh-k13 commented Apr 10, 2021 •

edited