Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: expose the internal/runtime/sys.Prefetch intrinsic #68769

Open
rip-create-your-account opened this issue Aug 7, 2024 · 7 comments
Open
Labels
Milestone

Comments

@rip-create-your-account
Copy link

Proposal Details

I'm working on a concurrent hash map and with me having experience in using them in highly concurrent contexts with high throughput requirements I knew to add batch Put and Get methods to it. Such API design exposes the very obvious opportunity for prefetching the data.

To experiment with the impact of internal/runtime/sys.Prefetch on the batch Get call I modified runtime/panic.go to expose the intrinsic. go tool objdump then confirmed that a PREFETCHT0 instruction is emitted into the binary as expected.

// in runtime/panic.go
func Prefetch(addr uintptr) {
	sys.Prefetch(addr)
}

// and using it like this
for _, bucket := range initial_buckets_for_keys {
	runtime.Prefetch(uintptr(unsafe.Pointer(bucket)))
}

Looking up 100k distinct entries from a hash map with 5000k entries became significantly faster.

old: BenchmarkLookupBatches/size=100000-8                      63.14 million_lookups/s
new: BenchmarkLookupBatches/size=100000-8                      96.64 million_lookups/s

Which looks like a worthwhile optimization for me.

I personally like how for me this runtime.Prefetch Just Works™ so I propose adding just it. I also propose to not tell of it's addition in the change log so people don't use it willy-nilly. Just imagine the damage that all the blog spam will cause in the overall ecosystem, possibly leaking into other language communities too.

@gopherbot gopherbot added this to the Proposal milestone Aug 7, 2024
@seankhliao seankhliao changed the title proposal: Expose the internal/runtime/sys.Prefetch intrinsic. proposal: expose the internal/runtime/sys.Prefetch intrinsic. Aug 7, 2024
@seankhliao seankhliao changed the title proposal: expose the internal/runtime/sys.Prefetch intrinsic. proposal: expose the internal/runtime/sys.Prefetch intrinsic Aug 7, 2024
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Aug 7, 2024
@Jorropo
Copy link
Member

Jorropo commented Aug 7, 2024

This signature has less pitfalls:

func Prefetch[E any, P ~*E | ~unsafe.Pointer](P)

Does this have any memory model interactions ?
For example does anything happens if we call prefetch concurrently of writes to the same address ?


I also propose to not tell of it's addition in the change log

👎

so people don't use it willy-nilly.

👍
I think it would be more effective if we put it in unsafe.Prefetch.


Lastly I'm not sure how easily portable to other architectures that would be.

@ianlancetaylor
Copy link
Member

Does this have any memory model interactions ?
For example does anything happens if we call prefetch concurrently of writes to the same address ?

There are no memory model interactions here. A prefetch by definition does not change the meaning of a program, it only affects the execution time.

On an architecture that doesn't have a prefetch instruction, the proposed function would be a no-op.

It's worth noting that amd64 has at least six different kinds of prefetch instructions. The GCC compiler __builtin_prefetch intrinsic permits specifying up to three arguments: the address, a read/write flag, and locality. The read/write flag is 0 for a read, 1 for a write. The locality ranges from 0, meaning no temporal locality and there is no reason to cache the data at all after the actual memory access, to 3, meaning high locality and the value should be left in all caches if at all possible.

@Jorropo
Copy link
Member

Jorropo commented Aug 8, 2024

There are no memory model interactions here.

I thought it would be able to generate pagefaults which can be turned into callbacks with signals but after going over the docs at least on amd64 it does not.

On an architecture that doesn't have a prefetch instruction, the proposed function would be a no-op.

I'm more worried about architectures that don't implement it the same way.
This looks like the kind of things that could improve on a particular CPU, but pessimize on an other one.

@rip-create-your-account
Copy link
Author

@Jorropo

This looks like the kind of things that could improve on a particular CPU, but pessimize on an other one.

This is easily solved by only using it when it's useful to use. General purpose data structures are unlikely to need this. I only found one place in the Go project that actually uses sys.Prefetch currently and that is a very specialized data structure. Meanwhile plenty of recent literature shows that in huge hash maps it is useful especially when mixed with batch operations.

I thought it would be able to generate pagefaults which can be turned into callbacks with signals but after going over the docs at least on amd64 it does not.

In my use case the bucket pointers for which I'm prefetching are under certain rare conditions nil. Prefetch(uintptr(0)) must work without "affecting program behavior."

@ianlancetaylor

It's worth noting that amd64 has at least six different kinds of prefetch instructions...

The current situation is that there's a small set of {me, myself, I} who are asking for the exposed prefetch instrinsic and We happen to only need PREFETCHT0. And what We are working on is just a research experiment so in a sense a waste of time to begin with. Overcomplicating this proposal with the full prefetch capabilities of AMD64 and ARM64 could have unforeseen consequences. Interests of Ours are in having just enough low friction machinery to prototype data structures and algorithms of interest that then can be ported over and polished in more eloquent languages.

@rip-create-your-account
Copy link
Author

Bordering off-topic but apparently some mutex implementations use PREFETCHW when spinning. AMD64 manual does say that

"... However, prefetching write data takes longer than prefetching read data if the processor must wait
for another caching master to first write-back its modified copy of the requested data to memory
before the prefetch request is satisfied."

which one could read as stalling until the other CPU core is done with the cache-line, possibly yielding the CPU time to a virtual core when hyper-threading is enabled. It also sounds useful in my use case considering that all of my buckets happen to carry a sync.Mutex and the very first thing that I do with buckets is to mu.Lock() it. But I can also see ways in how it could go wrong compared to the smooth brained PREFETCHT0. The result of a biased coin flip tells me that the addition of PrefetchInAnticipationOfWriteThatWillComeSoonEnoughJustAFewLinesBelow() is in place.

@clausecker
Copy link

Prefetching would also be nice for radix sorting.

There should be an intrinsic that takes a slice and an index into it, prefetching from an index into the array while omitting the bounds check. Prefetching unmapped storage / beyond the end of an object is harmless and it's often very annoying to avoid failing the bounds check in code that iterates through arrays, like this hypothetical example:

foo []int
for i := range foo {
    prefetch(&foo[i + 1])  // prefetch next iteration
    doSomething(foo[i])
}

This code would crash on the final iteration as foo[len(foo)] is out of bounds.

This could of course also be achieved with a uintptr-based intrinsic, but it's less of a footgun to leave the pointer arithmetic to the compiler.

@clausecker
Copy link

It's worth noting that amd64 has at least six different kinds of prefetch instructions. The GCC compiler __builtin_prefetch intrinsic permits specifying up to three arguments: the address, a read/write flag, and locality. The read/write flag is 0 for a read, 1 for a write. The locality ranges from 0, meaning no temporal locality and there is no reason to cache the data at all after the actual memory access, to 3, meaning high locality and the value should be left in all caches if at all possible.

Here's a mini-survey of prefetch instructions from the architectures Go supports right now:

386 / amd64

These have prefetch instructions for reading: prefetcht0 (into all caches), prefetcht1 (into L2 and higher), prefetcht2 (into L3 and higher), and prefetchnta (non-temporal prefetch). Prefetches prefetch at least 32 bytes. Prefetching from unmapped or device memory is ignored. Prefetches behave as nops on CPUs too old to support them (hint nop space), as long as they are at least i686.

There is also a prefetch for writing prefetchw, which always prefetches into L1 cache. There's a custom cpuid flag for it. prefetchw was introduced by AMD with 3Dnow!, but was taken up by Intel with Broadwell.

arm

ARM has PLD for prefetching for reading, PLDW for prefetching for writing, and PLI for prefetching for execution. PLD was added with ARMv5TE / ARMv6, PLDW with “ARMv7 with MP extensions,” and PLI with ARMv7. The cache-level to be prefetched into is implementation-defined. Prefetch instructions may behave as NOPs. Prefetch instructions do not generate synchronous data aborts, but may generate asynchronous aborts under exceptional circumstances.

arm64

ARM64 has PRFM for prefetching in the scalar subsystem. A three-part prefetch operation is given as an operand, indicating what kind of prefetch to do (PLD for prefetch for load, PST for prefetch for store, or PLI for preload instructions), what level of cache to prefetch into (L1, L2, L3, or with an extension SLC), and what the desired policy is (KEEP or STRM [i.e. non-temporal). There is also the extension RPRFM (ranged prefetch memory) to prefetch an explicit range of addresses.

The semantics seem to be the same as on ARM.

MIPS

MIPS has a pref instruction since MIPS IV. Prefetches are available for loading or storing, with optional hints to prefetch for retaining or streaming. Exceptions caused by prefetches are ignored.

PowerPC

The PowerPC term for prefetching is “touch.” All touch instructions support prefetching into L1 and L2 cache. There are touch instructions for instruction memory (icbt, instruction cache block touch), prefetching for read (dcbt, data cache block touch), and prefetching for write (dcbtst, data cache block touch for store). A hint can be provided to the latter two, indicating if the prefetch is for transient (i.e. non-temporal) or intransient access. There's also a very complicated stream prefetch model I'll skip here. dcbt and dcbtst where added with PowerPC, icbt with POWER ISA 2.07.

S390

S390 supports prefetching through the PFD (prefetch data) and PFDRL (prefetch data relative long) instructions. Prefetches are available for loading and for storing. These were both added with the General-instructions-extension facility.

RISC-V

Prefetch instructions for loading, storing, and instructions are part of the Zicbop extension. They behave as NOPs when not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants