Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: math/bits: add Deposit and Extract functions #45455

Closed
mdempsky opened this issue Apr 8, 2021 · 21 comments
Closed

proposal: math/bits: add Deposit and Extract functions #45455

mdempsky opened this issue Apr 8, 2021 · 21 comments

Comments

@mdempsky
Copy link
Member

mdempsky commented Apr 8, 2021

The x86-64-v3 instruction set includes the "Bit Manipulation Instruction" sets BMI1 and BMI2. Many of these are useful as mere peephole optimizations (e.g., ANDN for ~x & y, BLSI for x & -x, BSLR for x & (x - 1)), but the "parallel bits deposit" (PDEP) and "parallel bits extract" (PEXT) instructions are instructions that could benefit from higher-level abstractions.

PEXT takes a value and a mask and extracts all the bits included in the mask (like AND), but then compacts them to the bottom of the result word. PDEP performs the opposite operation (taking bits from the bottom of a value and spreading them out according to a mask).

These instructions are especially useful for varint decoding (e.g., extracting and compacting all value bits in a single instruction), which is a very frequent operation within protobuf decoding. Also, somewhat hot within the Go runtime too.

It's possible today to write them using assembly helper functions, but function call overhead (since they can't be inlined anymore) ends up negating a lot of the potential benefit.

The function signatures would be:

func Deposit64(val, mask uint64) uint64
func Extract64(val, mask uint64) uint64

etc. See https://go-review.googlesource.com/c/go/+/299491/2/src/math/bits/bmi2.go#24 for generic implementations.

Questions:

  • These operations are admittedly fairly x86-specific, and it seems questionable whether other CPUs will add them. Is math/bits still the best place for them?

  • Do we need the full generality of uint{,8,16,32,64} variants? Haswell only provides PEXT{L,Q} and PDEP{L,Q}, but it's easy to implement 8- and 16-bit wrappers with zero-extension instructions.

  • The generic fallback code requires a loop, but the compiler could recognize PEXT/PDEP instructions with fixed masks and lower them into fixed and/shift operations. Is this worthwhile?

  • Alternatively, there are a bunch of ideas from C compilers that we can take inspiration from (e.g., inline assembly like GCC et al; limited inlining of functions written in assembly like SunPro; provide CPU-specific intrinsic functions like ICC et al).

@mdempsky
Copy link
Member Author

mdempsky commented Apr 9, 2021

I should note as another alternative, the compiler could recognize certain mask/shift expressions and turn them into PDEP/PEXT instructions when available, similar to how we recognize little-endian/big-endian word loads from individual byte load+shift operations. That would avoid introducing any new formal API surface, and it would address the couple of use cases I currently had in mind (protobufs and runtime GC maps).

However, I think it would be easier to teach the compiler to recognize (e.g.) bits.Extract64(x, 0x7f7f...7f7f) and turn that into (x & 0x7f) | ((x & 0x7f00) >> 1) | ((x & 0x7f0000) >> 2) | ...) than the other way around. I think that would also be easier to explain to users what to expect (i.e., on x86-64-v3, bits.Extract64 becomes a single instruction; on other CPUs, it gets turned into masks/shifts when the mask argument is constant).

@josharian
Copy link
Contributor

josharian commented Apr 10, 2021

Are BMI1 and BMI2 part of the instruction set that we assume everywhere for amd64 or will they require CPU feature flag checks? (If they're available everywhere we should add the peephole optimizations you mentioned.)

@zephyrtronium
Copy link
Contributor

zephyrtronium commented Apr 10, 2021

@josharian
BMI1 and BMI2 are part of HNI, roughly corresponding to AVX2. PDEP and PEXT are particularly in BMI2, which AMD added later than BMI1.

Interestingly, per Wikipedia, PDEP and PEXT may have slower performance than certain software methods on AMD architectures between Excavator (2015) and Zen 3 (2020). https://github.com/zwegner/zp7 is related.

@rsc
Copy link
Contributor

rsc commented Apr 21, 2021

If you adopt these for varint decoding but find yourself on a non-x86 platform and falling back to the unaccelerated forms, would the varint decoding end up slower than it would have?

Note also that Hacker's Delight calls these (if I am looking at the right sections)

  • Compress, or Generalized Extract
  • Expand, or Generalized Insert

I've not heard Deposit before, but it's possible I am misunderstanding the exact operations being proposed here.

@rsc rsc moved this from Incoming to Active in Proposals (old) Apr 21, 2021
@rsc
Copy link
Contributor

rsc commented Apr 21, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Apr 21, 2021

However, I think it would be easier to teach the compiler to recognize (e.g.) bits.Extract64(x, 0x7f7f...7f7f) and turn that into (x & 0x7f) | ((x & 0x7f00) >> 1) | ((x & 0x7f0000) >> 2) | ...) than the other way around.

I'm less worried about teaching the compiler than teaching users and updating old code (and doing that correctly). If we can make the compiler understand how to turn an expression like the latter into some advanced instruction, then we (a) don't need new API, and (b) optimize all the existing code using those kinds of patterns. That seems like a win-win?

@mdempsky
Copy link
Member Author

mdempsky commented Apr 21, 2021

If you adopt these for varint decoding but find yourself on a non-x86 platform and falling back to the unaccelerated forms, would the varint decoding end up slower than it would have?

I suspect so. That's why I also proposed #45454 to make it possible for code to detect whether it's being compiled for amd64.v3 (i.e., PDEP/PEXT available) or not, and to support different code paths for each.

Note also that Hacker's Delight calls these (if I am looking at the right sections)

Yes, I think those are the same operations. Section 7-5 (Expand, or Generalized Insert) mentions too "This function has also been called unpack, scatter, and deposit."

I don't feel strongly about the names. I used "Deposit" and "Extract" because that's what Intel calls them. If there's reason to favor other naming conventions for the same operations, I think that's fine. It's easy enough to cross-reference common names in the documentation.

If we can make the compiler understand how to turn an expression like the latter into some advanced instruction, then we (a) don't need new API, and (b) optimize all the existing code using those kinds of patterns. That seems like a win-win?

My concern is that detecting appropriate expressions seems likely to be subtle and fragile, and users will end up needing to double check the compiled output to make sure they and the compiler both got it right.

My intuition is also that people using this functionality for performance will want to structure their code differently if they're not available anyway. Whereas users who don't care about performance would prefer more concise code. So I think both of these users would again benefit from having functionality in math/bits, rather than the compiler recognizing special patterns.

@rsc
Copy link
Contributor

rsc commented Apr 28, 2021

It's a bit unfortunate to have a portable API in math/bits and then still have to have build tags like amd64.v3 to guard the use of the math/bits function because we "know" that it's unsuitable except on certain architectures. If you are getting that close to the machine, then teaching the compiler at leasts avoids having public API that attracts people but can't be used portably + profitably at the same time.

Is there some slightly higher-level API that we can implement efficiently?

Or if this is very specific to varint-encoding, can we adjust the implementation in encoding/binary to use the right tricks internally? (Write the expression, perhaps build-tag-guarded, with a test that the compiler is compiling it the right way, but all behind the current package API.)

There are other places that use varint encoding too, so fixing it in encoding/binary could potentially help many use cases (for example import/export data too).

@mdempsky
Copy link
Member Author

mdempsky commented Apr 28, 2021

It's a bit unfortunate to have a portable API in math/bits and then still have to have build tags like amd64.v3 to guard the use of the math/bits function because we "know" that it's unsuitable except on certain architectures.

Ack. It's not great.

One possibility would be to go the style of syscall / x/sys, and have a "math/bits/amd64" package where this function is declared, and the declaration is already guarded to only be available for amd64.v3 (i.e., no fallback).

Is there some slightly higher-level API that we can implement efficiently?

Off hand, I'm not sure. I think we need more experience to identify that.

I think if we were to generalize the runtime/internal/sys package's intrinsics to an internal/* package, that would give some opportunity to explore optimizations within the Go standard library before committing to a particular API for end users.

It's also possible that runtime + encoding/binary's varint stuff end up being the only profitable places for using this.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 28, 2021

If we go the route of math/bits/amd64, which I'm not opposed to, then we are basically walking toward the way that C works. For C each CPU architecture defines a set of header files (such as <bmiintrin.h>, available on both x86 and PPC but defining different names). This will push us toward defining GOARCH-specific functions for all the complex instructions that are not easily generated by the compiler.

In general performance requirements are a strong incentive to figure out a way to provide access to CPU-specific capabilities, one way or another. Introducing the math/bits package gave us a mechanism for providing that access when we could provide a generic implementation that was fine to use on processors without specific support. Now we need to figure out a mechanism for other capabilities.

For example, one approach we could consider would be

// Deposit64 takes the low bits from val and deposits them into a uint64 result
// at the corresponding bit locations that are set in mask.
// ??? Why not also pass in an initial value whose bits are set?
func Deposit64(val, mask uint64) uint64

// IsDeposit64Fast reports whether the Deposit64 function is faster than doing the operation in ordinary Go.
// This will be true iff the processor has specific support for the operation.
// ??? Would be nice if this could be a const, but that would not permit runtime CPU detection.
var IsDeposit64Fast bool

Likely there are better possible ideas.

@bcmills
Copy link
Member

bcmills commented Apr 28, 2021

I feel like having slower-than-optimal Deposit functions in math/bits is probably better than having them only defined in math/bits/amd64.

With suitable build constraints, high-performance code ends up structured the same either way — moving the intrinsic to a different package doesn't change the fact that callers need to use a build constraint to decide whether to use it. And with the functions in math/bits, code that is only optimized for AMD64 would at least be more likely to compile and run at all (if more slowly) on other platforms.

@mdempsky
Copy link
Member Author

mdempsky commented Apr 28, 2021

// ??? Why not also pass in an initial value whose bits are set?

PDEP sets the non-masked bits to 0, so we'd need to do more work to peephole optimize that case, but I think that's not a problem.

However, users can also just write Deposit64(a, b) | c themselves.

// IsDeposit64Fast reports whether the Deposit64 function is faster than doing the operation in ordinary Go.
// This will be true iff the processor has specific support for the operation.
// ??? Would be nice if this could be a const, but that would not permit runtime CPU detection.
var IsDeposit64Fast bool

If it was func IsDeposit64Fast() bool instead, we could make it inline as just true when targeting GOAMD64=v3 (or higher), a runtime check for other GOARCH=amd64, and inline as false when targeting other architectures.

@rsc
Copy link
Contributor

rsc commented Apr 29, 2021

I'm still trying to chart a path through this that ends up with APIs that we don't believe overfit too much to a particular architecture, even a dominant one like x86. As was said earlier, it's possible that the only profitable use case for these are varints, in which case we end up with a bit of troublesome API for relatively little applicability.

I'd really like to continue to avoid machine-specific packages and APIs. It's incredibly easy to just add "math/bits/amd64". It's a lot more work to find portable APIs, but if we can do that, the result is better for the ecosystem, for portability, and for Go's long-term viability.

encoding/binary and the compiler are already fairly tightly coupled. I don't believe the compiler knows directly about the package's import path, but there was definitely some co-evolution between code patterns the compiler recognizes and code patterns used in that package to make sure they match, with the result that, I believe, things like binary.BigEndian.Uint32 compile into a bswap on x86 and presumably whatever the most efficient forms are on other systems as well.

It looks like binary.Uvarint and binary.PutUvarint would be the places where this would be most impactful. What if the compiler just knew the right inlined implementations, same as it does for math.Sqrt and most of math/bits? Or perhaps there is a middle ground where we adjust the implementations so that the key pattern the compiler can optimize is more recognizable?

@mdempsky
Copy link
Member Author

mdempsky commented Apr 29, 2021

I think directly intrinsifying binary.Uvarint and binary.PutUvarint would work. I think it might still be easier to keep them as Go code built on top of internal-only intrinsics though.

I suggested above generalizing runtime/internal/sys (which already provides runtime-specific CPU intrinsics, albeit mostly just math/bits stuff) to internal/*, so that we could experiment with additional intrinsics for use both by the runtime and encoding/binary and maybe other std packages, but not yet directly to end users. Does that sound reasonable as a near-term solution while we work out how (and whether) to publicly expose more intrinsics?

@zephyrtronium
Copy link
Contributor

zephyrtronium commented Apr 29, 2021

Some additional data that may be helpful: paging through a Sourcegraph search for PDEP or PEXT in assembly files alongside Go files, it seems the only Go projects use these instructions are the standard library and toolchain, github.com/klauspost/crc32 (archived as its optimizations were incorporated into Go 1.7), and Apache Arrow (which defines an ExtractBits function but doesn't seem to use it anywhere). This is not an exhaustive search, but still touches a few thousand repos. It also includes a number of results from non-Go projects, which might reveal additional applications of these instructions.

@renthraysk
Copy link

renthraysk commented Apr 30, 2021

Aren't both instructions useful for UTF-8 rune decoding and encoding too?

@rsc
Copy link
Contributor

rsc commented May 5, 2021

I suggested above generalizing runtime/internal/sys (which already provides runtime-specific CPU intrinsics, albeit mostly just math/bits stuff) to internal/*, so that we could experiment with additional intrinsics for use both by the runtime and encoding/binary and maybe other std packages, but not yet directly to end users. Does that sound reasonable as a near-term solution while we work out how (and whether) to publicly expose more intrinsics?

Sounds fine. (And anything in internal doesn't need to be in the proposal process.)

@rsc
Copy link
Contributor

rsc commented May 5, 2021

It sounds like we've converged on, at least for now, doing something not user-visible that makes package binary's varint code faster. If that's correct, then this proposal should probably be declined.

Do I have that correct?

@mdempsky
Copy link
Member Author

mdempsky commented May 6, 2021

Yeah, I think it's fine to decline this proposal for now in favor of internal-only intrinsics. If concrete use cases outside of the standard library arise, I think we can revisit this.

@rsc
Copy link
Contributor

rsc commented May 12, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals (old) May 12, 2021
@rsc rsc moved this from Likely Decline to Declined in Proposals (old) May 19, 2021
@rsc
Copy link
Contributor

rsc commented May 19, 2021

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed May 19, 2021
@golang golang locked and limited conversation to collaborators May 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

8 participants