Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: implement procyield as ISB instruction on arm64 #69232

Open
rhysh opened this issue Sep 3, 2024 · 2 comments
Open

runtime: implement procyield as ISB instruction on arm64 #69232

rhysh opened this issue Sep 3, 2024 · 2 comments
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Sep 3, 2024

When looking into #68578, I found that the implementation of runtime.procyield on GOARCH=arm64 uses the YIELD instruction, and that the YIELD instruction is in effect a (fast) NOP.

The current code is at

go/src/runtime/asm_arm64.s

Lines 917 to 923 in 9a4fe7e

TEXT runtime·procyield(SB),NOSPLIT,$0-0
MOVWU cycles+0(FP), R0
again:
YIELD
SUBW $1, R0
CBNZ R0, again
RET
, and I added a benchmark, runtime.BenchmarkProcYield, at https://go.dev/cl/601396 .

The difference in delay between amd64 (slow, using PAUSE) and arm64 (fast, using YIELD) makes it hard to be confident in the tuning of the runtime.lock2 spin loop. Note that it's easy to tune the spin loop for the specific duration of a microbenchmark's critical section, which might not be the best tuning for Go overall.

It looks like Rust uses ISB SY, https://github.com/rust-lang/rust/blob/d6c8169c186ab16a3404cd0d0866674018e8a19e/library/core/src/hint.rs#L291-L295 , changed in rust-lang/rust@c064b65 . I've confirmed that using ISB results in a longer delay on the hardware most easily available to me (M1 MacBook Air), which I'd expect to be a benefit to runtime.lock2, both reducing the likelihood of acquiring the lock without a sleep and controlling the electrical energy used to do so.

What I'd most prefer is for Go 1.24 to include a fix for #68578 , and for that to be the only change to runtime.lock2 in the Go 1.24 cycle (so it's clear whether that change is to blame for any changes in mutex performance). But I'm opening this now so we at least don't lose track of it.

CC @golang/runtime @golang/arm

@rhysh rhysh added arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. Performance labels Sep 3, 2024
@gabyhelp
Copy link

gabyhelp commented Sep 3, 2024

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@dmitshur dmitshur added this to the Backlog milestone Sep 4, 2024
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 4, 2024
@prattmic
Copy link
Member

prattmic commented Sep 4, 2024

Without truly understanding the details, at face value I am hesitant to use ISB SY over YIELD.

YIELD means exactly what we intend: "Software with a multithreading capability can use a YIELD instruction to indicate to the PE that it is performing a task, for example a spin-lock, that could be swapped out to improve overall system performance."

ISB SY is a sort of instruction fetch barrier: "Instruction Synchronization Barrier flushes the pipeline in the processor, so that all instructions following the ISB are fetched from cache or memory, after the instruction has been completed."

The purpose of ISB SY is for various page table operations (and perhaps self-modifying code?), which is not what we mean.

While I acknowledge that ISB SY takes longer to execute, it doesn't seem like a great fit, especially without understanding the full effects (at minimum it seems to affect branch predictor behavior).

Rather than picking another instruction simply because it takes longer, it seems to me we could instead adjust the number of iterations of the YIELD instruction to make it take longer rather than switching to a slower instruction [1].

I'm happy to be convinced otherwise if we have more concrete information available (such as recommendations from an ARM optimization manual e.g.).

(FWIW, from a quick glance, it looks like Linux uses YIELD: https://elixir.bootlin.com/linux/v6.10.7/source/arch/arm64/include/asm/vdso/processor.h)

[1] At a higher level, all of the various tuned busy/sleep waits in the runtime are sufficiently old that I am not convinced they are particularly well tuned for modern CPUs at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

4 participants