Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/internal/obj/x86: AVX512 design #22779

Closed
quasilyte opened this issue Nov 17, 2017 · 17 comments
Closed

cmd/internal/obj/x86: AVX512 design #22779

quasilyte opened this issue Nov 17, 2017 · 17 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@quasilyte
Copy link
Contributor

quasilyte commented Nov 17, 2017

Discussion started here: golang-dev: AVX512 syntax

This issue keeps track of all agreed implementation/design choices (though most things may change many times), as well as discussed alternatives.

If you can and/or want to participate, please leave the comment here or in the thread that is linked above.


1. Accepted solutions:

1.1. New registers:

  • X16-X31
  • Y16-Y31
  • Z0-Z31
  • K0-K7 masking registers. Exact operand syntax/position not yet decided.

1.2. AVX512_4FMAPS register range operand:

Specified with ARM NEON-style register ranges syntax: [Rx-Ry].

V4FMADDPS (CX), [Z4-Z7], Z1 // Proposed Go syntax

V4FMADDPS (CX), Z4+3, Z1    // Intel manual syntax
V4FMADDPS (CX), Z4, Z1      // GAS syntax

2. Subjects under discussion:

Special syntax like {1toX} and {sae} is avoided:

We try to keep the syntax as regular as possible, meaning no special { } just for x86.

2.1. Masking register syntax.

VPSLLD $171, Z5, Z6, K7 // a) K after destination.
VPSLLD $171, Z5, K7, Z6 // b) K before the destination.
VPSLLD $171, Z5, Z6(K4) // c) K with addressing syntax.
VPSLLD $171, Z5, Z6&K4  // d) K as register modifier.

2.2. Zeroing syntax.

// Arbitrarily implies that 2.1.a is accepted..
VPSLLD.Z $171, Z5, Z6, K7   // a) ".Z" opcode suffix.
VPSLLD $171, Z5, Z6, K7(Z)  // b) Z with addressing syntax near K.
VPSLLD $171, Z5, Z6&K4.ZERO // c) Masking register modifier.

2.3. Encoding selection: VEX vs EVEX.

a) Always favor VEX encoding variants.
b) Some kind of flag to enable EVEX_ENCODING whenever beneficial.
c) Require explicit K operand for EVEX variants to give programmer full control over selected encoding.

2.4. Broadcast.

VPSLLD.1to16 $2, (AX), Z5, 2, Z6 // a) ".1toN" opcode suffix.
VPSLLD.B16 $123, 508(DX), Z6     // b) ".BN" opcode suffix.
VPSLLD $2, bcst (AX), Z5, 2, Z6  // c) "bcst" keyword.

2.5. Rounding.

VADDPD.RN.SAE Z4, Z5, Z6 // a) Opcode suffix.

3. Key notes:

Useful information about past/current trade-offs
  • Suffix-based operand size approach is not used for X/Y/Z registers to make information search easier. This also implies that VEX/EVEX encoding should not be resolved by special suffix/prefix (there was no such problem with SSE vs VEX because most latter opcodes are prefixed with "V").
  • CMP instructions operands are not in "special order" as legacy CMPs. This is consistent with AVX1/2.
  • Some 32bit data size instructions have suffix in Intel manual. For example, KADDD. Go uses L suffix for 32bit operands, so KADDL opcode used instead.
@bradfitz bradfitz added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Nov 17, 2017
@bradfitz bradfitz added this to the Go1.11 milestone Nov 17, 2017
@rasky
Copy link
Member

rasky commented Nov 18, 2017

Masking: ARM uses R7>>R8 and R7@>R8 for barrel shift/rotations. I think this is the closest concept to what AVX512 allows. My suggestion is to use a similar syntax, with a register modifier, like Z6@K7 or Z6/K7 or maybe Z7&K7, where the ampersand recalls the bitwise and operator which is basically what happens.

Zeroing: this is conceptually part of masking (at least in my view), so the zero modifier must be part of the masking syntax, not part of the opcode or elsewhere. If we go with Z7&K7, for instance, we can have Z7&K7Z or Z7&K7.Z or Z7&K7&Z or Z7&K07 or Z7&K7.ZERO. It's unfortunate that the "Z" letter is used for the new set of registers as well, but I guess we have to live with this.

Broadcast: I think having an opcode suffix .Bn is better than the eye-bleeding "1toN" moniker. If we keep masking/zeroing as a register modifier (as I suggest above), there is no composition issue and we can simply have VPSSLD.B16 $123, 508(DX), Z6&K2

Rounding. Again, an opcode suffix is probably the best, and it doesn't overlap with broadcasting anyway. VADDPD.RN.SAE

@quasilyte
Copy link
Contributor Author

@rasky, I believe that rounding can't be specified without SAE.
RN.SAE looks like two separate suffixes, which raises additional questions:

  • are SAE.RN or RN.B16.SAE valid? ?
  • Can we omit SAE as it implied anyway?

Maybe RN_SAE/RNSAE is a viable alternative?

VADDPD.RN_SAE Z4, Z5, Z6

@quasilyte
Copy link
Contributor Author

quasilyte commented Nov 22, 2017

For the reference that is related to "VEX vs EVEX" question:
XED enforces explicit k mask register to apply EVEX encoding.

$ ./xed -64 -e VADDPD xmm0 xmm1 xmm2
Request: VADDPD MODE:2, REG0:XMM0, REG1:XMM1, REG2:XMM2, SMODE:2
OPERAND ORDER: REG0 REG1 REG2 
Encodable! C5F158C2
.byte 0xc5,0xf1,0x58,0xc2

// (basically) same operation, but with EVEX encoding.
$ ./xed -64 -e VADDPD xmm0 k0 xmm1 xmm2
Request: VADDPD MODE:2, REG0:XMM0, REG1:K0, REG2:XMM1, REG3:XMM2, SMODE:2
OPERAND ORDER: REG0 REG1 REG2 REG3 
Encodable! 62F1F50858C2
.byte 0x62,0xf1,0xf5,0x08,0x58,0xc2

// As a consequence, attempt to use High-16 registers
// without k0 will result in error.
$ ./xed -64 -e VADDPD xmm17 xmm18 xmm19
Request: VADDPD MODE:2, REG0:XMM17, REG1:XMM18, REG2:XMM19, SMODE:2
OPERAND ORDER: REG0 REG1 REG2 
Could not encode: VADDPD xmm17 xmm18 xmm19 
Error code was: GENERAL_ERROR
[XED CLIENT ERROR] Dying

Same trick is applicable for x86 asm.
It means that instruction tables for such instructions will be like this:

{X0-X15, X0-X15, X0-X15/MEM}        => VEX
{X0-X31, K0-K7, X0-X31, X0-X31/MEM} => EVEX

Instead of:

// Note missing K0-K7 in EVEX table.
{X0-X15, X0-X15, X0-X15/MEM} => VEX
{X0-X31, X0-X31, X0-X31/MEM} => EVEX

This simple change makes it possible for programmer to explicitly choose encoding to use.

Not sure if it addresses this:

I guess quite some projects are using “dynamic” switching based on CPU capabilities, so if a subroutine has AVX512 (EVEX) instructions, it would be nice if the whole routine could take advantage of EVEX encoding.
Would adding a flag like “EVEX_ENCODING” in “text flag.h” make sense to enable this?

CC @TocarIP

@fwessels
Copy link

@quasilyte I noticed the same behaviour in xed regarding the EVEX instructions (and explicit need of mentioning k register for registers >= 16).

Sticking to the same behaviour nicely addresses the issue (and even allow mixed mode routines which is a benefit).

@quasilyte
Copy link
Contributor Author

quasilyte commented Dec 6, 2017

golang-dev first message, 5th question (@TocarIP):

  1. Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
    because we can't distinguish beetween 128 and 256 bit versions in some cases:
    VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
    But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
    So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

List of such instructions that is based on x86.csv v0.2:

$ grep VEX x86.v0.2.csv | egrep '"128"|"256"'
"VCVTPD2DQ xmm1, xmm2/m128","VCVTPD2DQX xmm2/m128, xmm1","vcvtpd2dqx xmm2/m128, xmm1","VEX.128.F2.0F.WIG E6 /r","V","V","AVX","","w,r","Y","128"
"VCVTPD2DQ xmm1, ymm2/m256","VCVTPD2DQY ymm2/m256, xmm1","vcvtpd2dqy ymm2/m256, xmm1","VEX.256.F2.0F.WIG E6 /r","V","V","AVX","","w,r","Y","256"
"VCVTPD2PS xmm1, xmm2/m128","VCVTPD2PSX xmm2/m128, xmm1","vcvtpd2psx xmm2/m128, xmm1","VEX.128.66.0F.WIG 5A /r","V","V","AVX","","w,r","Y","128"
"VCVTPD2PS xmm1, ymm2/m256","VCVTPD2PSY ymm2/m256, xmm1","vcvtpd2psy ymm2/m256, xmm1","VEX.256.66.0F.WIG 5A /r","V","V","AVX","","w,r","Y","256"
"VCVTTPD2DQ xmm1, xmm2/m128","VCVTTPD2DQX xmm2/m128, xmm1","vcvttpd2dqx xmm2/m128, xmm1","VEX.128.66.0F.WIG E6 /r","V","V","AVX","","w,r","Y","128"
"VCVTTPD2DQ xmm1, ymm2/m256","VCVTTPD2DQY ymm2/m256, xmm1","vcvttpd2dqy ymm2/m256, xmm1","VEX.256.66.0F.WIG E6 /r","V","V","AVX","","w,r","Y","256"

Planning to use X/Y/Z suffixes for now.
There is additional question though:
Given VCVTPD2PS instruction,

VCVTPD2PS xmm1 {k1}{z}, xmm2/m128/m64bcst     => VCVTPD2PSX
VCVTPD2PS xmm1 {k1}{z}, ymm2/m256/m64bcst     => VCVTPD2PSY
VCVTPD2PS ymm1 {k1}{z}, zmm2/m512/m64bcst{er} => VCVTPD2PS or VCVTPD2PSZ?

At least two options: use suffix only when can't encode proper instruction without it (a) or
add suffix for whole instruction class (b).
For the reference, GAS uses (a):

        VCVTPD2PSX (%rax), %xmm0
        VCVTPD2PSY (%rax), %xmm0
        VCVTPD2PS (%rax), %ymm0

        VFPCLASSPSX $1, (%rax), %k0{%k1}
        VFPCLASSPSY $1, (%rax), %k0{%k1}
        VFPCLASSPSZ $1, (%rax), %k0{%k1}

With current quality of error messages, it could be very disappointing to see "error: invalid instruction" when "Z" suffix is either used or not. Maybe alias can help here, but it would be better to have more precise error messages (see #21860).

@quasilyte
Copy link
Contributor Author

quasilyte commented Dec 16, 2017

Work in progress examples:

	// Embedded rounding.
	VADDPD.RU_SAE Z3, Z2, K1, Z1   // 62f1ed5958cb
	VADDPD.RD_SAE Z3, Z2, K1, Z1   // 62f1ed3958cb
	VADDPD.RZ_SAE Z3, Z2, K1, Z1   // 62f1ed7958cb
	VADDPD.RN_SAE Z3, Z2, K1, Z1   // 62f1ed1958cb
	VADDPD.RU_SAE.Z Z3, Z2, K1, Z1 // 62f1edd958cb
	VADDPD.RD_SAE.Z Z3, Z2, K1, Z1 // 62f1edb958cb
	VADDPD.RZ_SAE.Z Z3, Z2, K1, Z1 // 62f1edf958cb
	VADDPD.RN_SAE.Z Z3, Z2, K1, Z1 // 62f1ed9958cb

	// Embedded broadcasting.
	VADDPD.BCST (AX), X2, K1, X1    // 62f1ed195808
	VADDPD.BCST.Z (AX), X2, K1, X1  // 62f1ed995808
	VADDPD.BCST (AX), Y2, K1, Y1    // 62f1ed395808
	VADDPD.BCST.Z (AX), Y2, K1, Y1  // 62f1edb95808
	VADDPD.BCST (AX), Z2, K1, Z1    // 62f1ed595808
	VADDPD.BCST.Z (AX), Z2, K1, Z1  // 62f1edd95808
	VMAXPD.BCST (AX), Z2, K1, Z1    // 62f1ed595f08
	VMAXPD.BCST.Z (AX), Z2, K1, Z1  // 62f1edd95f08

	// Surpress all exceptions (SAE).
	VMAXPD.SAE   Z3, Z2, K1, Z1   // 62f1ed595fcb or 62f1ed195fcb
	VMAXPD.SAE.Z Z3, Z2, K1, Z1   // 62f1edd95fcb or 62f1ed995fcb
	VCMPSD.SAE $0, X0, X2, K0     // 62f1ef18c2c000
	VCMPSD.SAE $0, X0, X2, K1, K0 // 62f1ef19c2c000
	VMAXPD (AX), Z2, K1, Z1       // 62f1ed495f08

	// Multisource operands (4FMAPS/4VNNIW register range operand).
	VP4DPWSSD (AX), [Z0-Z3], K1, Z7   // 62f27f495238
	VP4DPWSSD 7(DX), [Z0-Z3], K1, Z7  // 62f27f4952ba07000000

	// K write mask.
	VADDPD X30, X1, X0               // 6291f50858c6
	VADDPD X2, X1, K1, X0            // 62f1f50958c2

Details:

  • Signals error when rounding+broadcast suffixes are combined.
  • Zeroing suffix should be the last suffix.
  • Register range does not accept comma-separated lists. Only simple ranges.
  • Can't use K0 as write mask.
  • Broadcasting enabled with BCST suffix that does not require mode hints (as there is no ambiguity).
  • VADDPD is not VADDPL. All opcodes, except one that got X/Y/Z suffix, are identical to Intel syntax.

@quasilyte
Copy link
Contributor Author

quasilyte commented Dec 16, 2017

The "always prefer VEX over EVEX" rule combined with "no explicit K0" or any other way to enforce EVEX encoding lead to this:

// VEX -- OK.
VADDPD (BX), X9, X2

// Two possible outcomes:
// a) signal error: "instruction does not support zeroing".
// b) select EVEX-encoded form.
VADDPD.Z (BX), X9, X2

I do believe that b option is better, but it may require some woodoo in implementation, because most form matching is done over the operands.
Both examples above have same operands, which complicates resolution a bit.

In my opinion, it's a +1 to "explicit K0" (XED-style):

// EVEX -- OK.
VADDPD.Z (BX), X9, K0, X2

@TocarIP
Copy link
Contributor

TocarIP commented Dec 18, 2017

I'm not sure that VADDPD.Z (BX), X9, X2 is an important case. As far as I understand this will use default write mask k0, so no element will be zeroed. However we have the same problem with broadcasting, which can be useful without any masks.

@quasilyte
Copy link
Contributor Author

With a simple rule like "skip non-EVEX forms if instruction has any suffixes", we can solve those issues of operand based matching:

	// Forced EVEX encoding due to suffixes.
	VADDPD.B4 2032(DX), X0, X0         // 62f1fd185882f0070000
	VADDPD.B8 2032(DX), Y0, Y0         // 62f1fd385882f0070000

This is possible because x86 uses suffixes only for AVX512 features.
Even if this changes in future, it is possible to maintain such behavior in backwards-compatible way.

@quasilyte
Copy link
Contributor Author

@TocarIP, zeroing without masking is permitted by GAS, but rejected by, for example, XED.

It can be used to force EVEX encoding when VEX will be selected otherwise (see above).

@gopherbot
Copy link

Change https://golang.org/cl/104496 mentions this issue: x86/x86spec: enable XED-based x86.csv generation

@quasilyte
Copy link
Contributor Author

Up-to-date examples: https://golang.org/cl/107217.

@gopherbot
Copy link

Change https://golang.org/cl/107216 mentions this issue: x86/x86avxgen: enable AVX512 encoder tables generation

@gopherbot
Copy link

Change https://golang.org/cl/113315 mentions this issue: cmd/asm: enable AVX512

gopherbot pushed a commit to golang/arch that referenced this issue May 16, 2018
Now generates both VEX and EVEX encoded optabs.

Encoder based on these optabs passes tests added in
https://golang.org/cl/107217.

This version uses XED datafiles directly instead of x86.csv.

Also moves x86/x86spec/xeddata package to x86/xeddata to make it
usable from x86 packages.
Ported x86spec pattern set type to xeddata.

Updates golang/go#22779

Change-Id: I304267d888dcda4f776d1241efa524f397a8b7b3
Reviewed-on: https://go-review.googlesource.com/107216
Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
gopherbot pushed a commit that referenced this issue May 22, 2018
- Uncomment tests for AVX512 encoder
- Permit instruction suffixes for x86
- Permit limited reg list [reg-reg] syntax for x86 for multi-source ops
- EVEX encoding support in obj/x86 (Z-cases, asmevex, etc.)
- optabs and ytabs generated by x86avxgen (https://golang.org/cl/107216)

Note: suffix formatting implemented with updated CConv function.
Now arch asm backend should register formatting function by
calling RegisterOpSuffix.

Updates #22779

Change-Id: I076a167ee49582700e058c56ad74e6696710c8c8
Reviewed-on: https://go-review.googlesource.com/113315
Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@quasilyte
Copy link
Contributor Author

quasilyte commented Jun 2, 2018

Documentation is published in form of Go wiki page.

I'll send a small update to doc/asm.html that would mention that page to improve discoverability.

This issue can be closed if there are no open questions left.

@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

Ping @cherrymui. Please close this issue if you are happy with the decisions documented on the wiki page.

EDIT - Corrected Cherry's name so that github notification goes out.

@cherrymui
Copy link
Member

The wiki page looks ok to me. Closing.

For testing, there is #25724 still open. If we do anything there, it may affect the design and the wiki page. But we can discuss it the there.

@golang golang locked and limited conversation to collaborators Feb 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants