Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/sys/cpu: add support for ARM #33508

Open
jpap opened this issue Aug 7, 2019 · 19 comments

Comments

@jpap
Copy link

commented Aug 7, 2019

It would be nice to have feature detection for ARM (32-bit) in x/sys/cpu.

Concretely, a new cpu.ARM struct that closely resembles the existing cpu.ARM64 struct, tailored to the ARM specific hardware capabilities. The following fields are proposed, which map directly to the {HWCAP, HWCAP2} auxiliary vector values on Linux and FreeBSD:

HasSWP      bool // SWP instruction support
HasHALF     bool // Half-word load and store support
HasTHUMB    bool // ARM Thumb instruction set
Has26BIT    bool // Address space limited to 26-bits
HasFASTMUL  bool // 32-bit operand, 64-bit result multiplication support
HasFPA      bool // Floating point arithmetic support
HasVFP      bool // Vector floating point support
HasEDSP     bool // DSP Extensions support
HasJAVA     bool // Java instruction set
HasIWMMXT   bool // Intel Wireless MMX technology support
HasCRUNCH   bool // MaverickCrunch context switching and handling
HasTHUMBEE  bool // Thumb EE instruction set
HasNEON     bool // NEON instruction set
HasVFPv3    bool // Vector floating point version 3 support
HasVFPv3D16 bool // Vector floating point version 3 D8-D15
HasTLS      bool // Thread local storage support
HasVFPv4    bool // Vector floating point version 4 support
HasIDIVA    bool // Integer divide instruction support in ARM mode
HasIDIVT    bool // Integer divide instruction support in Thumb mode
HasIDIV     bool // Integer divide instruction support in ARM and Thumb mode
HasVFPD32   bool // Vector floating point version 3 D15-D31
HasLPAE     bool // Large Physical Address Extensions
HasEVTSTRM  bool // Event stream support
HasAES      bool // AES hardware implementation
HasPMULL    bool // Polynomial multiplication instruction set
HasSHA1     bool // SHA1 hardware implementation
HasSHA2     bool // SHA2 hardware implementation
HasCRC32    bool // CRC32 hardware implementation

As I look around, I see code detecting CPU features based on the runtime.goarm value (which is set by the GOARM environment variable at link time), rather than a runtime check. This means that:

  1. As runtime.goarm is not const, the fast-path (e.g. using NEON) and slow-path fallback are being compiled into the binary, but only one path can ever be used. It would be nice if both paths can be used via run-time detection instead.
  2. Using the above, one cannot have a "universal binary" that is especially problematic on Android.

In one of my projects, I have resorted to parsing /proc/cpuinfo for run-time detection of NEON, which only works on Linux. I'd love to instead use the standard library.

@gopherbot gopherbot added this to the Proposal milestone Aug 7, 2019

@gopherbot gopherbot added the Proposal label Aug 7, 2019

@smasher164

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

x/sys/cpu has still not been totally updated to match internal/cpu, which has some of the feature flags you are looking for.
Note that so far, we have only exposed flags that are needed for the runtime and standard library. That is, CL 114826 exposes HasIDIVA for use in the runtime, and CL 126315 exposes HasVFPv4 for use in the math package. I could see the argument for exposing all of the HWCAP flags listed in linux, for use in external code.

I will defer to @martisch's judgement on that.

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

Note that changing internal/cpu seems a separate topic (it also already supports arm) and there are no plans to export those variables outside the runtime/standard library.

If x/sys/cpu support for linux/arm is enough (not *BSD or windows/arm) then it should be as easy as implementing the hwCap/hwCap2 bit checks in https://github.com/golang/sys/blob/master/cpu/cpu_arm.go since hwCap/hwCap2 should already be populated on Linux by: https://github.com/golang/sys/blob/master/cpu/cpu_linux.go. I think its fine to expose all linux supported but general arm applicable flags in x/sys/cpu.

Note that the above would not support android as /proc/self/auxv might not be accessible and it wont change the runtime and standard libs detection of arm features (since that uses internal/cpu and not x/sys/cpu) and the effects of setting goarm are not effected as they override any cpu package .e.g.:

func checkgoarm() {

MOVB runtime·goarm(SB), R11

@jpap

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

Thanks for the detailed response, Martin.

Note that changing internal/cpu seems a separate topic (it also already supports arm) and there are no plans to export those variables outside the runtime/standard library.

That's right; I'm happy to open a new issue to discuss that separately if you wish. I would propose that the internal implementation be separate (a "copy") of the x/sys/cpu version.

Since you mention it, I was reading the CL for #28148 (see this source file), and noticed how the use of detection of NEON could be improved by checking the hardware-caps instead of using the static runtime.goarm value. I did notice other similar uses throughout the Go internal/standard library, as you have linked to. While it might be some work to review all of them for a possible switch to runtime hardware-cap checks (esp where we can use a NEON code-path), it might be beneficial to do so over time?

(As said above, I am doing run-time detection at present by parsing /proc/cpuinfo which could also be improved.)

If x/sys/cpu support for linux/arm is enough (not *BSD or windows/arm) then it should be as easy as implementing the hwCap/hwCap2 bit checks in https://github.com/golang/sys/blob/master/cpu/cpu_arm.go since hwCap/hwCap2 should already be populated on Linux by: https://github.com/golang/sys/blob/master/cpu/cpu_linux.go. I think its fine to expose all linux supported but general arm applicable flags in x/sys/cpu.

I am mostly interested in Android, so support for Linux would be a win. (Apple has moved exclusively to aarch64 for some time now.)

It looks like FreeBSD has added support back in 2017 (see D12290 and D12291), so it might be possible to do something similar there also.

As far as I can tell, Go bring-up for windows/arm has stalled (c.f. #26148), so that would need to wait regardless?

Note that the above would not support android as /proc/self/auxv might not be accessible and it wont change the runtime and standard libs detection of arm features (since that uses internal/cpu and not x/sys/cpu) and the effects of setting goarm are not effected as they override any cpu package .e.g.:

BoringSSL provides a good reference implementation: they weak-link getauxval which is available on Android from API level 20, and fall back to /proc/self/auxv and /proc/cpuinfo in that order. So it appears we might be able to improve on the implementation you have linked to.

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

That's right; I'm happy to open a new issue to discuss that separately if you wish. I would propose that the internal implementation be separate (a "copy") of the x/sys/cpu version.

The scope of the feature request is currently unclear to me when reading the title and first post content. From the comments I would infer it is to add arm support for android. It also does not seem restricted to x/sys/cpu but part of the proposal is to change the runtimes use of goarm with dynamic detection. As internal/cpu IRC already has support for arm the runtime could already use that in some places where goarm is used (no need to copy from x/sys/cpu). However this would be a behaviour change and some platforms would still need to use goarm (to populate the cpu variables). I think that is better discussed decoupled from any x/sys/cpu support.

The Go compiler uses the goarm setting to make decisions about the emitted instructions and thereby does not produce "universal binaries". Changing this would be a larger change than x/sys/cpu support and likely have larger performance and binary size implications when switched to runtime detection:

if objabi.GOARM >= 6 {

if s.f.Config.arch == "arm" && objabi.GOARM == 5 {

Since you mention it, I was reading the CL for #28148 (see this source file), and noticed how the use of detection of NEON could be improved by checking the hardware-caps instead of using the static runtime.goarm value.

Seems like a good candidate for using x/sys/cpu but it seems it would need a fallback in x/sys/cpu to goarm for platforms not supporting CPU feature detection via AUXV or syscalls/proc.

It looks like FreeBSD has added support back in 2017 (see D12290 and D12291), so it might be possible to do something similar there also.

The Go runtime sets hwcap on arm in internal/cpu on FreeBSD by reading the auxv thats provided after argv. I dont think x/sys/cpu has direct access to that. So I think we can not simply copy that approach to x/sys/cpu.

sysauxv(auxv[:])

As far as I can tell, Go bring-up for windows/arm has stalled (c.f. #26148), so that would need to wait regardless?

No having only partial support in x/sys/cpu is fine. However to not regress in performance for other platforms we would seem to need a fallback to setting the cpu features based on goarm on those not supporting runtime detection.

BoringSSL provides a good reference implementation: they weak-link getauxval which is available on Android from API level 20, and fall back to /proc/self/auxv and /proc/cpuinfo in that order. So it appears we might be able to improve on the implementation you have linked to.

Thank you. Thats a nice reference. If we can make the same work under go then that looks like an option to gain feature detection support on android in x/sys/cpu.

@jpap

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

The scope of the feature request is currently unclear to me when reading the title and first post content. From the comments I would infer it is to add arm support for android. It also does not seem restricted to x/sys/cpu but part of the proposal is to change the runtimes use of goarm with dynamic detection. As internal/cpu IRC already has support for arm the runtime could already use that in some places where goarm is used (no need to copy from x/sys/cpu). However this would be a behaviour change and some platforms would still need to use goarm (to populate the cpu variables). I think that is better discussed decoupled from any x/sys/cpu support.

Let me be more clear...

This issue is about:

  1. Improving CPU hardware-cap detection in x/sys/cpu for third parties.
  2. Exposing all available arm (32-bit) properties like whether Advanced SIMD (NEON) is supported on the device.
  3. Support for Linux seems straightforward, by using BoringSSL as a reference. That implementation goes beyond the current implementation that exists in internal/cpu.
  4. Support for FreeBSD may be possible by querying the elf32_freebsd_sysvec symbol in libc; see posted links above to the FreeBSD site for further details.

This issue is not about,

  1. Changes to internal/cpu and how CPU hardware-cap detection is done internally in the Go standard library.
  2. How the Go compiler emits opcodes, in general, for the different arm architecture versions.

however a separate issue can be created that proposes:

  1. Improvements to the CPU hardware-cap detection in internal/cpu; for example, to determine whether Advanced SIMD (NEON) is supported on the device on Linux. The BoringSSL implementation for Linux could be used as a reference here also.
  2. I previously assumed there were more NEON/VFP accelerated implementations in the Go standard library, and my suggestion was to review the path taken to them if they relied on runtime.goarm. After a quick look, it appears there is no support for NEON-accelerated algorithms on arm; only a reference to "will want to revision NEON, when support is better", which is actually on arm64 where NEON is always supported. (That reference is to Plan 9 assembly support, I am guessing.)
  3. This separate issue could go toward promoting development of SIMD-accelerated algorithms in the Go standard library. For example, #28148 could determine at run-time whether the NEON accelerated path is taken, rather than relying on GOARM which is set at link time. See also a comment in #7300 for a similar suggestion.
  4. I am not proposing that the Go compiler emits a "fat" or "universal" binary equivalent where at runtime, the ARM architecture version is determined, and optimized instructions used for math operations, and other granular instructions. I am however proposing that run-time detection of CPU hardware-caps like NEON support could allow some algorithm implementations (i.e. larger, dedicated functions; e.g. as used in crypto) to take a fast-path.

The Go compiler uses the goarm setting to make decisions about the emitted instructions and thereby does not produce "universal binaries". Changing this would be a larger change than x/sys/cpu support and likely have larger performance and binary size implications when switched to runtime detection:

I understand this issue, but I would never propose it. My reference to a "universal binary" was with reference to one of my projects where I detect NEON, and take a fast-path accordingly. A true universal binary -- that is, a program supporting multiple architectures -- is often achieved though shipping separate texts for each architecture in a sandwiched "fat binary" as is possible on macOS, and only exists as a proof-of-concept on Linux. I am not proposing that here.

Seems like a good candidate for using x/sys/cpu but it seems it would need a fallback in x/sys/cpu to goarm for platforms not supporting CPU feature detection via AUXV or syscalls/proc.

Yes -- and sure beats developers individually using go:linkname to punch a hole through to runtime.goarm.

The Go runtime sets hwcap on arm in internal/cpu on FreeBSD by reading the auxv thats provided after argv. I dont think x/sys/cpu has direct access to that. So I think we can not simply copy that approach to x/sys/cpu.

I have not tried it, but it appears a symbol exists on libc that gets you access to the information without having to go via auxv.

@gopherbot

This comment has been minimized.

Copy link

commented Aug 16, 2019

Change https://golang.org/cl/190525 mentions this issue: cpu: add support for ARM (32-bit) feature detection.

@jpap

This comment has been minimized.

Copy link
Author

commented Aug 16, 2019

@martisch, I've submitted a CL for this issue as described by my past post, and have tested it on linux/arm and freebsd/arm. Are you able to review it?

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

@martisch, I've submitted a CL for this issue as described by my past post, and have tested it on linux/arm and freebsd/arm. Are you able to review it?

Sure. Thanks for working on it. I added a first round of high level comments.
I been meaning to reply more in depth about how we can approach this but unfortunately did not find the time earlier.

There is already support for linux auxv reading in cpu_linux. We should first leverage that existing detection to add the hwcap bits to variable mapping and then I think basic linux/arm support should already work. Continuing from there we can add additional support for android to work around /proc/self/auxv not necessary being accessible to fill hwcap in x/sys/cpu. Afterwards we can extend the support for arm on other platforms and wheel in help from testers with hardware on those.

What makes arm special in x/sys/cpu vs other archs is that absent hwcap detection we should fall back to the minimal set of hwcap bits set that is mandated by the goarm variable.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

As far as the API is concerned, which is what the proposal process would care about, it seems that the proposal is to add a cpu.ARM that is very similar to cpu.ARM64 with appropriate modifications for the actual CPU features that might be present.

/cc @tklauser @martisch @ianlancetaylor for feedback

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

For the API I think we should stay consistent with the other architectures and internal/cpu which means to expose a cpu.ARM struct with fields named HasNAME where NAME is the corresponding HWCAP name of the feature. If needed the client of x/sys/cpu can create combinations such as HasIDIVA | HasIDIVT outside of x/sys/cpu but we do not need to create these as fields inside the cpu.ARM struct.

In general (for adding other architectures) I think if the existing naming schema is followed we do not need to have an separate proposal for each API addition for additional CPU feature structs/variables in x/sys/cpu.

@jpap

This comment has been minimized.

Copy link
Author

commented Aug 27, 2019

@martisch, I have updated the original post to include a concrete list of fields.

If needed the client of x/sys/cpu can create combinations such as HasIDIVA | HasIDIVT outside of x/sys/cpu but we do not need to create these as fields inside the cpu.ARM struct.

I would argue that having "derived" fields, such as the proposed HasIDIV makes the API easier to consume by the user. Anyone else keen to chime in?

@andybons andybons changed the title proposal: x/sys/cpu -- add support for ARM proposal: x/sys/cpu: add support for ARM Aug 27, 2019

@rsc

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

Based on the discussion, this is doing for ARM what has already been done for other architectures in the package, and there have been no objections. This sounds like a likely accept.

Leaving open for a week for any objections to accepting.

/cc @randall77

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

Sounds good.

I agree with Martin, we should just provide the raw hardware specs and let users make their own compound flags if needed. The set needed for any particular assembly is highly dependent on what that assembly needs, so even if there are a few standard ones people will be rolling their own anyway.

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Some down sides that I think combined outweigh the usefulness of providing combined feature flags:

  • There are many possible combinations that make sense and are used especially on amd64 (e.g. AVX+BMI... , AES+SSSE3+SSE41 ....). There is no useful objective cutoff criterion for inclusion of flag combinations for some architectures yet as we do not know their combination frequency without having implemented support for them e.g. arm.
  • If we choose our own names for feature combinations there can be name collisions with newly added features names in HWCAP and HWCAP2 in the future and we would loose the identity mapping for those which would be confusing. Naming features after their combinations can produce long names at which point it may be more readable to find a name useful for the context in which they are used locally (e.g. hashmap.useAEShash instead of cpu.HasAES_SSSE3_SSE41). Combination names can also be none self evident sometimes: should the name be cpu.X86.HasAES_SSSE3_SSE41 or cpu.X86.HasAES_SSE41_SSSE3?
  • Adding more features in alphabetical order into the feature flag struct will increase cache misses (granted we will be able to fit a lot (64) of them into one cache line or across two in many architectures) for non explicitly added combinations. Cache misses could also increase globally since different code paths are less likely to reuse a cache line already loaded by another program part. We can choose an ordering preferring useful flags together but then that depends on the use case and impacts readability of finding a name in the feature flag structure.
@jpap

This comment has been minimized.

Copy link
Author

commented Aug 28, 2019

Let us try to avoid conflating logical grouping of homogeneous hardware capabilities with one that is purely arbitrary. Saying "I always support integer division" (the topic of discussion above) is very different to "I support vector processing and bit manipulation", or "I support encryption and subset X of SIMD".

Let's consider for a moment how these hardware-capability flags are used in practice.

An algorithm needs to run and one implementation is selected from an available set based on hardware features (e.g. cpu.ARM.HasNEON), so that the runtime or power consumption is most favorable.

The amount of "work done" is often non-trivial: otherwise the developer could just let the compiler do its best without hand-tuning the instruction text. This also means that by the time the algorithm completes, the cache has been polluted and the hardware-capability flags have most likely been evicted. Think of image processing, compression, hashing, or cipher operation: they typically operate on lots of input data.

Let's say the algorithm instead relied on two hardware features. Regardless of whether the API under discussion reflected the situation in two fields or one "derived" field, the end result is the same: the cache will likely have evicted them all. Whether the fields were laid out in two or more cache lines to begin with are unlikely to impact the "setup overhead" of the algorithm as a whole.

Trying to optimize the hardware-capabilities cache line in the previous discussion seems rather over the top in general. When selecting hardware-dependent code paths, it is also often used practice to memoize the result in the form of a function pointer, avoiding a conditional branch. (It often, perhaps more importantly, makes the code more readable.)

The suggestion to include a derived field for ARM- and thumb-supported integer division (HasIDIV) was intended to simplify consumption of the API and nothing more. Even the Linux kernel implementation defines a convenience preprocessor symbol HWCAP_IDIV for the same purpose, and is where this all originated.

Either way, I'm not wedded to including this one derived field in the proposed API change. In fact, I have no intention of even using this field at all -- my interest is in consuming HasNEON. Please, let's not overcomplicate the situation here.

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Let us try to avoid conflating logical grouping of homogeneous hardware capabilities with one that is purely arbitrary.
Let's consider for a moment how these hardware-capability flags are used in practice.

Some of the examples given are not arbitrary but real use cases of the existing API mirrored in internal/cpu. e.g.:

cpu.X86.HasAES && // AESENC

https://github.com/golang/crypto/blob/614d502a4dac94afa3a6ce146bd1736da82514c6/chacha20poly1305/chacha20poly1305_amd64.go#L23

The suggestion to include a derived field for ARM- and thumb-supported integer division (HasIDIV) was intended to simplify consumption of the API and nothing more. Even the Linux kernel implementation defines a convenience preprocessor symbol HWCAP_IDIV for the same purpose, and is where this all originated.

That is a good point that would make the problem of name collisions less likely and does set IDIV apart from other combinations. Note that as far as I perceived the discussion it was generally about "derived" fields not only about IDIV.

Either way, I'm not wedded to including this one derived field in the proposed API change. In fact, I have no intention of even using this field at all -- my interest is in consuming HasNEON. Please, let's not overcomplicate the situation here.

Then it would seem we can just leave it out for now and wait for a use case.

@martisch

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Let's say the algorithm instead relied on two hardware features. Regardless of whether the API under discussion reflected the situation in two fields or one "derived" field, the end result is the same: the cache will likely have evicted them all. Whether the fields were laid out in two or more cache lines to begin with are unlikely to impact the "setup overhead" of the algorithm as a whole.

This may very well be a extreme scenario and therefore should only be a secondary consideration but for something like a memcpy implementation that is frequently accessed in different parts of a programming pulling in one more cache line besides the copied data (which is often small) can be shown to matter measurably in production.

Given that one can always add more but not easily remove feature variables I would still suggest to stay with a leaner set for now until its known what a good criteria for the potential inclusion of "derived" flags would be.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2019

Marked this last week as likely accept w/ call for last comments (#33508 (comment)).
No dissenting comments, so accepting.

@ianlancetaylor ianlancetaylor changed the title proposal: x/sys/cpu: add support for ARM x/sys/cpu: add support for ARM Sep 3, 2019

@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Unreleased Sep 3, 2019

@jpap

This comment has been minimized.

Copy link
Author

commented Sep 5, 2019

Now that the proposal has been accepted, I will update my PR (CL) shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.