optimizing division-by-constant on AArch32(ARM32) #63731

rsaxvc · 2023-07-07T08:53:11Z

I found both GCC12 and Clang11 call __aeabi_uldivmod when dividing a uint64_t by a constant on AArch32, and I found it's possible to emulate the AArch64 umulh instruction which enables division by a constant optimization even on AArch32. I wrote this up here and will put the relevant bits below, there's a GCC bugzilla suggestion as well. This optimization applies only to Arm cores with umull instruction, which is many of them but not present on the smallest microcontrollers and some older cores.

C code for dividing uint64_t by a constant

uint64_t div3(uint64_t x)
{
    return x/3;
}

Clang11( -O3 -mcpu=cortex-m4):

div3(unsigned long long):
        push    {r7, lr}
        movs    r2, #3
        movs    r3, #0
        bl      __aeabi_uldivmod
        pop     {r7, pc}

However, Clang16 for AArch64 implements division by 3 using multiplication by the inverse as:

div3(unsigned long):                               // @div3(unsigned long)
        mov     x8, #-6148914691236517206
        movk    x8, #43691
        umulh   x8, x0, x8
        lsr     x0, x8, #1
        ret

While AArch32 instruction set like Cortex-M4 does not have the umulth instruction, it does have umull(32bit x 32bit multiply with 64-bit result). Using that, we can emulate umulh with the following function:

uint64_t umulh( uint64_t a, uint64_t b )
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32 
When applicable each is an ARM UMULL instruction, might do better with UMLAL*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;

acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;

acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;
}

I will refer to this as umulh emulation, and in testing it results in a speedup, between 2x and 37x, depending on various factors, on Cortex-M4.

Firstly, the execution time of umulh() is a constant on Cortex-M4, but __aeabi_uldivmod() execution time depends on the numerator.

If the core and compiler-support library implement __aeabi_uldivmod() using the udiv instruction, the speedup for emulating umulh is relatively small, only around 2-4x on Cortex-M4. But if not, the umulh approach can divide numbers at about 28-37x the rate of __aeabi_uldivmod(). Roughly we can group cores as follows:

Cores where umulh emulation performs poorly due to lack of umull instruction: old Arm before ARM7TDMI, Cortex M0/0+/1/23. On these we should use __aeabi_uldivmod()
Cores where umulh emulation performs an order of magnitude faster due to presence of umull instruction but lack of udiv(or __aeabi_uldivmod does not take advantage of udiv): ARM7TDMI, ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9.
Cores where umulh emulation is only a little faster: Cortex-A7, Cortex-A15, 64-bit Arm cores.

If this is of interest, it might be worth putting umulh into the compiler support library and using it to divide uint64_t by constants.

The text was updated successfully, but these errors were encountered:

rsaxvc · 2023-07-07T09:09:16Z

It looks like this form of umulh utilizing both umull and umlal might be shorter, but I haven't benchmarked or tested it:

uint64_t umulh2( uint64_t a, uint64_t b )
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32
Results in one UMULL and UMLAL(when merged with accumulate below) per line*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Accumulate the results, keeping only the top 64-bits*/
uint64_t acc = acc0;
acc >>= 32;
acc += acc1;
acc += acc2;
acc >>= 32;
acc += acc3;

return acc;
}

llvmbot · 2023-07-07T14:00:17Z

@llvm/issue-subscribers-backend-arm

efriedma-quic · 2023-07-07T19:32:46Z

clang 16 generates (https://c.godbolt.org/z/1WxqanGq4) (see 38ffa2b):

_Z4div3y:
        push    {r11, lr}
        mov     r11, sp
        adds    r2, r0, r1
        movw    r12, #43691
        adc     lr, r2, #0
        movt    r12, #43690
        umull   r3, r2, lr, r12
        bic     r3, r2, #1
        add     r2, r3, r2, lsr #1
        movw    r3, #43690
        sub     r2, lr, r2
        movt    r3, #43690
        subs    r2, r0, r2
        sbc     r1, r1, #0
        umull   r0, lr, r2, r12
        mla     r2, r2, r3, lr
        mla     r1, r1, r12, r2
        pop     {r11, pc}

But we only do this transform for certain constants. And your suggested sequence appears to be shorter.

rsaxvc · 2023-07-08T01:39:48Z

clang 16 generates...But we only do this transform for certain constants.

Neat stuff! I didn't realize it know div3, so I'm afraid I over-simplified my example. I need to divide a uint64_t by 1e6(seconds+nanoseconds to milliseconds) or 1e3(seconds+nanoseconds to microseconds). I put together a fuller milliseconds example with both approaches(https://godbolt.org/z/qKnEr18f6):

typedef unsigned long long uint64_t;
typedef unsigned int uint32_t;

//convert struct timespec params to milliseconds, straightforward
uint64_t ts_to_millis_c(uint32_t s, uint32_t ns)
{
uint64_t ns64 = s * 1000000000ULL + ns;
return ns64 / 1000000ULL;
}

//emulate Arm64's umulh instruction (return top 64bits of 64b * 64b multiplication)
//see: https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/UMULH
static uint64_t umulh(uint64_t a, uint64_t b)
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32
Results in one UMULL or UMLAL(when merged with accumulate below) per multiply*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Accumulate the results, keeping only the top 64-bits of the 128-bit result*/
uint64_t acc = acc0;
acc >>= 32;
acc += acc1;
acc += acc2;
acc >>= 32;
acc += acc3;

return acc;
}

//convert struct timespec params to milliseconds
//using constants extracted from Arm64 target for umulh emulation
uint64_t ts_to_millis_umulh(uint32_t s, uint32_t ns)
{
uint64_t ns64 = s * 1000000000ULL + ns;
return umulh(0x431BDE82D7B634DBULL, ns64) >> 18;
}

//quick sanity-test
int main()
{
    int passes = 0;
    for(uint32_t s = 0; s < 1UL<<31; s += 12345678)
    {
        for(uint32_t ns = 0; ns < 1UL<<31; ns += 2500000)
        {
            uint64_t c = ts_to_millis_c(s,ns);
            uint64_t u = ts_to_millis_umulh(s,ns);  
            if(c != u) return -1;
            passes++;
        }
    }
    return passes;
}

rsaxvc · 2023-07-08T18:09:02Z

Another random data point: On Raspberry Pi 4, GCC10 -O3 -m32, a realtime thread on a low-interrupt core can run ts_to_millis_umulh() about 36936656 times per second. Using ts_to_millis_c(), takes about 23x longer, because it's a 32-bit target and needs to maintain binary compatibility Pi0/1, which use ARM1176 without division instructions.

github-actions bot added the new issue label Jul 7, 2023

EugeneZelenko added backend:ARM missed-optimization and removed new issue labels Jul 7, 2023

NCGThompson mentioned this issue Dec 17, 2023

Add new intrinsic is_constant and optimize pow rust-lang/rust#114390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizing division-by-constant on AArch32(ARM32) #63731

optimizing division-by-constant on AArch32(ARM32) #63731

rsaxvc commented Jul 7, 2023

rsaxvc commented Jul 7, 2023

llvmbot commented Jul 7, 2023

efriedma-quic commented Jul 7, 2023

rsaxvc commented Jul 8, 2023 •

edited

Loading

rsaxvc commented Jul 8, 2023

optimizing division-by-constant on AArch32(ARM32) #63731

optimizing division-by-constant on AArch32(ARM32) #63731

Comments

rsaxvc commented Jul 7, 2023

rsaxvc commented Jul 7, 2023

llvmbot commented Jul 7, 2023

efriedma-quic commented Jul 7, 2023

rsaxvc commented Jul 8, 2023 • edited Loading

rsaxvc commented Jul 8, 2023

rsaxvc commented Jul 8, 2023 •

edited

Loading