Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: mod operator is very slow on intel hardware with int64 #59089

Open
rcsaquino opened this issue Mar 17, 2023 · 5 comments
Open

cmd/compile: mod operator is very slow on intel hardware with int64 #59089

rcsaquino opened this issue Mar 17, 2023 · 5 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@rcsaquino
Copy link

rcsaquino commented Mar 17, 2023

What version of Go are you using (go version)?

$ go version
go1.20.2 windows/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

Issue is discussed more with code example in this reddit thread.

TLDR; Mod operator is very slow on intel hardware when used on int type.

What did you expect to see?

Mod operator optimized to handle int types on intel hardwares

What did you see instead?

The mod operator is very slow when used on int types on intel hardware.

@merykitty
Copy link

This is expected, int64 division on Skylake results in 57 micro-ops and has a throughput of 24 cycles/operation, while the figures for int32 are only 10 micro-ops and 6 cycles/operation, change int64 to uint64 also helps reduce these to 36 micro-ops and 21 cycles/operation. As a result, on 64-bit hardware where int is equivalent to int64 the division throughput is a quarter of that of int32 division. Which explains the execution speed differences.

@merykitty
Copy link

merykitty commented Mar 17, 2023

I have also taken a look at the Rust compiled code, it seems they cleverly check if the upper 32-bits of both the dividend and the divisor are all zero, and fall to faster uint32 division if that is the case. This explains why the Rust implementation is as fast as the JS and V implementations.

.LBB3_10:
    mov     rax, rbx
    xor     edx, edx
    div     rdi.     // slow path, long division
.LBB3_11:
    inc     rdi
    test    rdx, rdx
    je      .LBB3_12
.LBB3_3:
    lea     rax, [r12 + rdi]
    cmp     rax, 2
    je      .LBB3_4
    mov     rax, rbx
    or      rax, rdi // or the dividend and the divisor
    shr     rax, 32  // get the upper 32-bit of the or
    jne     .LBB3_10 // shr set ZF so this is taken if any of the aforementioned 32 bits is non-zero
    mov     eax, ebx
    xor     edx, edx
    div     edi.     // fast path, integer division
    jmp     .LBB3_11

@seankhliao seankhliao changed the title Mod operator is very slow on intel hardware when used on int type. cmd/compile: mod operator is very slow on intel hardware with int64 Mar 17, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 17, 2023
@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 17, 2023
@ianlancetaylor
Copy link
Contributor

CC @randall77

@randall77
Copy link
Contributor

Seems fine to put in a similar test + 32-bit path in our code.

My only question is how universal is this? Skylake was mentioned - how does this fare on other processor vendors / families?

@merykitty
Copy link

It is generally true on x86 that an uint32 division is cheaper than int64 and uint64 ones, but to add a branch to fastpath the latters into the former I think would need more consideration. I think LLVM does so on all intel chips while doing the division normally on AMD ones, which seems reasonable and reflects the throughput and latency differences on architectures. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: No status
Development

No branches or pull requests

6 participants