Skip to content

Sub-optimal codegen when adding with carry #169691

@andrepd

Description

@andrepd

Consider the following Rust code (no need to know Rust, it's very straightforward).

pub fn f1(q: &mut [u64; 8], implicit: u64) {
    let mut carry = false;
    for i in 0 .. q.len() {
        let (r, o) = u64::carrying_add(q[i], implicit, carry);  // Add q[i] + implicit + carry, return tuple with "result modulo 2^64" and "new carry"
        q[i] = r;
        carry = o;
    }
}

On x86, this gets compiled to what you would expect.

add     qword ptr [rdi], rsi
adc     qword ptr [rdi + 8], rsi
adc     qword ptr [rdi + 16], rsi
adc     qword ptr [rdi + 24], rsi
adc     qword ptr [rdi + 32], rsi
adc     qword ptr [rdi + 40], rsi
adc     qword ptr [rdi + 48], rsi
adc     qword ptr [rdi + 56], rsi
ret

However, if you take the carry as a parameter, rather than initialising it to zero

pub fn f2(q: &mut [u64; 8], implicit: u64, mut carry: bool) {
    for i in 0 .. q.len() {
        let (r, o) = u64::carrying_add(q[i], implicit, carry);
        q[i] = r;
        carry = o;
    }
}

I would expect it to be compiled to

add     dl, -1  ; Set carry flag to 0/1 if dl is 0/1 resp
adc     qword ptr [rdi], rsi
adc     qword ptr [rdi + 8], rsi
adc     qword ptr [rdi + 16], rsi
adc     qword ptr [rdi + 24], rsi
adc     qword ptr [rdi + 32], rsi
adc     qword ptr [rdi + 40], rsi
adc     qword ptr [rdi + 48], rsi
adc     qword ptr [rdi + 56], rsi
ret

However llvm (via rustc) writes it instead in a very roundabout way:

mov     rax, qword ptr [rdi]
add     rax, rsi
setb    cl
mov     edx, edx
add     rdx, rax
setb    al
or      al, cl
mov     qword ptr [rdi], rdx
add     al, -1
adc     qword ptr [rdi + 8], rsi
adc     qword ptr [rdi + 16], rsi
adc     qword ptr [rdi + 24], rsi
adc     qword ptr [rdi + 32], rsi
adc     qword ptr [rdi + 40], rsi
adc     qword ptr [rdi + 48], rsi
adc     qword ptr [rdi + 56], rsi

This is a godbolt illustrating this. https://godbolt.org/z/9M1Tr4cGj

This is the equivalent godbolt, but in C++: https://godbolt.org/z/sKz5vrEjr . If you look at this one, you can switch to gcc to see how it gets it right.

All my attempts to nudge codegen into the optimsed form (e.g. by explicitly writing something like let (dummy, o) = u64::carrying_add(0, u64::MAX, carry); carry = 0) have failed. This interests me because the optimised codegen cuts an inner loop in a program of mine in half! :)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions