-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: support AVX-512 #20
Comments
Can I provide any help here? |
@mmcloughlin thanks for the effort. Is there any planned date AVX512 support can be added? |
@rleiwang I don't have an exact date, but I am aware that AVX-512 is now an essential feature for this library, since real hardware deployment of AVX-512 is now far more widespread. I'm trying to finish off another side project right now, then there's another little thing I have to do, but after that I'd like to focus on this. Maybe I'll start work in a couple of weeks. |
The opcodes database only supports a subset of AVX-512 extensions.
Using the opcodes database is the simplest way to get some AVX-512 support for now. Complete support would probably rely on XED #23. |
I'm very interested in this getting this issue solved. If the 6 month lapse in activity is any hint, it seems to have stalled out (for now). I've been using Avo for some AVX2 coding with great success, but I'm not yet an "expert" in either. That said, if there's any way I can contribute to getting this issue moving, I'd be happy to help. I did some quick poking around, and it appears that Avo already knows about the AVX-512 registers (well, the ZMMs, not sure about the masks). The xml file referenced above seems to have all of the information for at least a large subset of the most commonly deployed AVX-512 features (e.g. those being widely deployed in the 10th Gen Core architecture rollout). Since all of the Avo x86 instruction API code seems to be auto-generated (from that file?), I'm assuming the code generator is where the bulk of the needed work is concentrated. That and register allocation? Just trying to sort out the next couple of concrete steps. Or do you have some grander vision for the package that is less incremental? |
@vsivsi Sorry about this. My progress on open-source work kinda hit a wall this year due to what I could only describe as COVID malaise. Then I moved apartments, now I'm trying to apply to grad school (including UW!). But this is very much on my radar, and I'm hoping to return to this once the grad school deadlines have passed (December 15). I also have an M1 mac on order, so I'm wondering about avo for arm64 as well! As for the technical details, yes you've got it about right. I need to add support for masked registers. Then it's a matter of extending the instruction code generator to output those What's the timeline for your work? How soon do you need this? |
Well, need is probably putting it too strongly... But I'd like to have it ASAP. The Go assembler works fine for AVX-512 as far as I've been able to test, so manually writing code (or writing simple bespoke generators) is the fallback. Bringing up ARM64 is interesting, and it has its (semi-)equivalent Neon SIMD/vector extensions. Would you say that Avo was designed with the necessary flexibility to mostly just plug-in different ISA targets? Or would it be a major rewrite to generalize? BTW, as a part of supporting AVX-512, it might be great if Avo autogenerated the necessary architecture/CPUID checks for a given assembly function to either switch to a less restrictive assembly function (e.g, AVX2 vs AVX-512), a pure-Go implementation, or fail that. return an error/panic when the runtime CPU doesn't support required features. That kind of boilerplate is tricky. It's helpful that Avo already tracks the CPUID dependencies of the used instructions, so that is a logical next step (and would also ease multi-architecture ARM/Intel implematations as well!) Well this is promising... I guess my next question is, if I were to throw, say, 4 hours at this, what should my next move be? Sussing out the Kn mask register situation in the code? Figuring out what might be missing from the XML data and how to pull that into the "internal database"? I'm pretty resourceful, but with something like this, it's just a bit hard to know where to productively start! |
Just to chime in here -- it's technically possible to use AVX-512 instructions in Avo today, you just need to do a lot of the heavy lifting yourself. Here's an example from one of my projects. |
@lukechampine That's excellent! I had wondered how difficult it would be to just define a few instructions using Avo's lower level interfaces, and it seems from your linked code to be pretty straightforward. Also, seems like a decent starting place to grok a little better how all of this works internally within Avo before diving in to solve this issue more generally. Thanks! |
Goals
FeaturesMasking"Instructions that support masking can omit K register operand." "K register should be placed right before destination operand." VADDPD.Z (AX), Z30, K3, Z10 Zeroing"Zeroing-masking can be activated with Z opcode suffix." Broadcast"For reg-mem instrictons with m32bcst/m64bcst operand, broadcasting can be turned on with BCST opcode suffix." Rounding"For reg-reg FP instructions with {er} enabled, rounding opcode suffix can be specified:" VADDPD.RU_SAE Z3, Z2, K1, Z1
VADDPD.RD_SAE Z3, Z2, K1, Z1
VADDPD.RZ_SAE Z3, Z2, K1, Z1
VADDPD.RN_SAE Z3, Z2, K1, Z1 SAE"For reg-reg FP instructions with {sae} enabled, exception suppression can be specified with SAE opcode suffix." VMAXPD.SAE.Z Z3, Z2, K1, Z1 Register BlocksOpcodes database does not support 4VNNIW or 4FMAPS extensions. VP4DPWSSD 7(SI)(DI*1), [Z2-Z5], K4, Z23 Note: these instructions are only available in Knights Mill processors at the moment, and do not seem to a priority. Other LibrariesGoThe Go assembler explicitly represents instruction suffixes using the See the following file for EVEX support in the x86 assembler: PeachPy
Examples: # zeroing, rounding control
VADDSS(xmm30(k2.z), xmm4, xmm19, {rn_sae})
# Broadcast
VEXPANDPS(zmm9(k6.z), zword[r8 + 64]) asmjitThe following are part of the instruction:
https://asmjit.com/doc/classasmjit_1_1x86_1_1Assembler.html #include <asmjit/x86.h>
using namespace asmjit;
void generateAVX512Code(x86::Assembler& a) {
using namespace x86;
// Opmask Selectors
// ----------------
//
// - Opmask / zeroing is part of the instruction options / extraReg.
// - k(reg) is like {kreg} in Intel syntax.
// - z() is like {z} in Intel syntax.
// vaddpd zmm {k1} {z}, zmm1, zmm2
a.k(k1).z().vaddpd(zmm0, zmm1, zmm2);
// Memory Broadcasts
// -----------------
//
// - Broadcast data is part of memory operand.
// - Use x86::Mem::_1toN(), which returns a new x86::Mem operand.
// vaddpd zmm0 {k1} {z}, zmm1, [rcx] {1to8}
a.k(k1).z().vaddpd(zmm0, zmm1, x86::mem(rcx)._1to8());
// Embedded Rounding & Suppress-All-Exceptoins
// -------------------------------------------
//
// - Rounding mode and {sae} are part of instruction options.
// - Use sae() to enable exception suppression.
// - Use rn_sae(), rd_sae(), ru_sae(), and rz_sae() - to enable rounding.
// - Embedded rounding implicitly sets {sae} as well, that's why the API
// also has sae() suffix, to make it clear.
// vcmppd k1, zmm1, zmm2, 0x00 {sae}
a.sae().vcmppd(k1, zmm1, zmm2, 0);
// vaddpd zmm0, zmm1, zmm2 {rz}
a.rz_sae().vaddpd(zmm0, zmm1, zmm2);
} Opcodes DatabaseThe Opcodes database is going to be the easiest way to incorporate AVX-512 into Note that rounding modes and exception suppression are handled as operands, for example: <InstructionForm gas-name="vcvtsd2si" xmm-mode="AVX">
<ISA id="AVX512F"/>
<Operand type="r32" input="false" output="true"/>
<Operand type="xmm" input="true" output="false"/>
<Operand type="{er}"/>
<Encoding>
<EVEX mm="01" pp="11" LL="#2" W="0" vvvv="0000" V="0" RR="#0" B="#1" X="#1" b="#2" aaa="000" z="0"/>
<Opcode byte="2D"/>
<ModRM mode="11" reg="#0" rm="#1"/>
</Encoding>
</InstructionForm> or: <InstructionForm gas-name="vucomiss" xmm-mode="AVX">
<ISA id="AVX512F"/>
<Operand type="xmm" input="true" output="false"/>
<Operand type="xmm" input="true" output="false"/>
<Operand type="{sae}"/>
<Encoding>
<EVEX mm="01" pp="00" W="0" vvvv="0000" V="0" RR="#0" B="#1" X="#1" b="#2" aaa="000" z="0"/>
<Opcode byte="2E"/>
<ModRM mode="11" reg="#0" rm="#1"/>
</Encoding>
</InstructionForm> asmdbhttps://github.com/asmjit/asmdb/blob/9bb794dc539c048bba8b6e089b116b65c8bef49b/x86data.js#L43-L49
In reality I don't see https://github.com/asmjit/asmdb/blob/9bb794dc539c048bba8b6e089b116b65c8bef49b/x86data.js#L2846-L2848 ["vaddpd" , "W:xmm {kz},~xmm,~xmm/m128/b64" , "RVM-FV" , "EVEX.128.66.0F.W1 58 /r" , "AVX512_F-VL"],
["vaddpd" , "W:ymm {kz},~ymm,~ymm/m256/b64" , "RVM-FV" , "EVEX.256.66.0F.W1 58 /r" , "AVX512_F-VL"],
["vaddpd" , "W:zmm {kz},~zmm,~zmm/m512/b64 {er}" , "RVM-FV" , "EVEX.512.66.0F.W1 58 /r" , "AVX512_F"], Resources |
Hmm, trying to come up with a good syntax for opcode suffixes. Example Go assembly: VADDPD.RD_SAE.Z Z3, Z2, K1, Z1 Possible syntax: // Accept variadic flags.
VADDPD(Z3, Z2, K1, Z1, mode.Z, mode.RD_SAE)
// Codegen all combinations.
VADDPD_RD_SAE_Z(Z3, Z2, K1, Z1)
// Function decorator.
Mod(VADDPD, mode.RD_SAE, mode.Z)(Z3, Z2, K1, Z1)
// Underscore version that takes flags.
VADDPD_(mode.RD_SAE, mode.Z)(Z3, Z2, K1, Z1)
// Methods.
VADDPD(Z3, Z2, K1, Z1).RD_SAE().Z() Thoughts? |
I think generating all combinations is the most consistent with avo's existing design. It also seems like the easiest API for converting existing asm. The biggest downsides are that the namespace becomes a lot more cluttered (but let's be real, the Another consideration is that (unless I'm mistaken) the Failing that, I think my preference is for the "decorator" or "underscore" APIs, because they make it easy to do this: var VPADDD_ZB = VADDPD_(mode.Z, mode.BCST) I defined a bunch of helper functions like this for my blake3 implementation. This feels fairly ergonomic, since most of the time you only need a tiny subset of the possible suffix combinations. Of course, you can accomplish this using any of the suggested APIs; it's just a little nicer to do it concisely in one line with |
I’m not super opinionated about this stuff so long as it’s done consistently throughout a design. A couple of principles that haven’t been mentioned:
Also, think about whether/how to make the opmask register optional as well. Having to explicitly specify K0 everywhere would be cluttered, and would make spotting non-default masks more difficult. |
Now that I’ve had the chance to read everything through again, I think I’d vote for a slightly modded version of the “methods“ design. VADDPD().RD_SAE().Z(Z3, Z2, K1, Z1) This has the advantage of reading closer to the generated assembly, being autocomplete friendly, and still supporting the case @lukechampine describes above: var VPADDD_ZB = VADDPD().RD_SAE().Z And since each function would need to be variadic to handle 0 or 4 parameters, might as well support 3 as well: VADDPD().RD_SAE().Z(Z3, Z2, Z1) // K0 implied Thoughts? |
Thanks for your input @lukechampine @vsivsi.
Maybe this is the point you're trying to make, but I think at the moment there is only one legal ordering in Go assembly itself. Specifically, https://github.com/golang/go/wiki/AVX512 sys "It is important to put zeroing opcode suffix last, otherwise it is a compilation error." That's implemented here: As you say, the Another factor is I have ARM support in the back of my mind, which also uses instruction suffixes. After a very quick search, I'm still not sure how many they have. Does anyone here know? Anyway, whatever design is chosen should ideally work for ARM as well. I'm leaning against codegen. I agree with @vsivsi that in general "parameterizing is superior to hard coding", and that you can still build shorthands given a parameterizable API.
Good point, hadn't considered this aspect.
This would be nice. I don't really like the "variadic flags" or "methods" approach I listed before since they put the suffixes in a different place to Go assembly. This would match Go assembly much more closely. I don't quite see how to implement this given the way the instruction functions like |
Yep, regarding masking, I see two high-level approaches. First, as you describe, you could have multiple instruction forms, one with the mask and one without. Second, you could copy PeachPy's approach and have a VADDPD(Z3, Z2, Z1) // K0 implied
VADDPD(Z3, Z2, Masked(K1, Z1)) My preference right now is for multiple forms, since it looks closest to Go syntax. However, I haven't thought through the implementation details, there may be a reason to prefer keeping the mask and the register associated with each other. |
Straw Man Proposal There are multiple places where instructions are generated.
These are in approximate order of importance in terms of API usability. Most people will only ever use the global functions in the I'm thinking of some kind of chaining builder API. The ctx.RD_SAE().Z().VADDPD(Z3, Z2, K1, Z1) Remember this is not the most common API, so I think it's fine that it looks "backwards" in this case. The most common interface is the global functions in the VADDPD.RD_SAE().Z()(Z3, Z2, K1, Z1)
VADDPD(Z3, Z2, K1, Z1) // can still be called as a function There's still an open question about what the constructors in the Questions:
|
Making good progress on #163! Does anyone have recommendations for AVX-512 examples? Ideally, these are realistic, small-ish, and demonstrate multiple AVX-512 features. |
TL;DR; I've ended up going with code generated functions, so it would look like Okay, I've gone in circles on this a bit. Yet another example of when I spent too much time thinking and not enough time prototyping. The proposed implementation right now #163 uses code generation rather than any of the fancier APIs discussed above. The reasons are:
Thoughts? @lukechampine @vsivsi |
@vsivsi You were talking about implicit masking before. I've implemented this by duplicating instruction forms that support masking, one with a Lines 14224 to 14301 in ebe0387
|
Spent more time agnonizing over a detail. Leaving notes for myself. ProblemMasking instruction forms have different actions on the output register depending on whether zeroing is enabled. So for example, the Currently this is handled incorrectly, there is not a way to specify different actions based on the zeroing flag. SolutionsFixup Pass. Have a pass that removes the output register from input list in the case where the Thoughts: Not a big fan of handling something dynamically that could be done with code generation. However, this solution would be more robust if we do later support parameterized instruction suffixes. Redesign Instruction Table. The current version of the instruction table cannot represent this properly. Each instruction form has a flag indicating whether it supports zeroing, wheras the operands list statically specifies the RW action on each operand. So it cannot represent the fact that the RW action depends on whether zeroing is used. So one option is to redesign the schema so that the zeroing and merging forms are separate forms. The simplest way to do this is to redefine the meaning of the Thoughts: Perhaps confusing that the Redesign Function Table. Code generation works by first transforming the instruction table into a list of functions that will construct those instruction forms. Some instructions have multiple different functions corresponding to supported suffix combinations. Similar to above, we could handle this at the point instructions are converted into functions. Thoughts: Not a fan of this approach. At the moment there is a load of ugly logic in the instruction table building, but after that it gets simpler. I'd rather continue to keep all the nasty details in one place, where you expect them. Special Action Type. Perhaps an action type like Thoughts: Again quite simple. Perhaps confusing to introduce a third action type, especially since some combinations would make no sense, like Other FrameworksPeachPy: I don't see any handling for this. I could have missed it but I suspect there is an edge case bug here. asmjit: Does not handle this case either. Petr Kobalicek confirmed the bug. |
Status: AVX-512 PR #163 builds for the first time, after dealing with more edge cases and enabling all the AVX-512 instruction sets supported by Opcodes database. I could land this now probably, but I'm hesitating mainly because the generated code has substantially grown to the point that it's affecting compile times. The CI jobs take a lot longer, especially linting. I'm exploring ways to reduce duplication in the generated code, which I hope will also reduce compile times. See Slack discussion, thanks @josharian! There's also some assorted todos left:
The compile times issue is the only real blocker since it affects the API design, which we'll be committed to once it lands. If it can't be fixed, might have to revisit the decision of autogenerating functions for every instruction suffix. |
Apologies for dropping out of the discussion for a bit there. I decided preserving my sanity required actually treating my vacation time as such, even though we didn't actually go anywhere. Looks like you've made excellent progress on this. Nice work spotting the result value merge under masking edge cases for register scheduling. I also find the compile time (and related name space) explosion to be concerning. One consideration is to think about this in terms of the inevitability of ever more complicated future ISA extensions from Intel (and likely on the ARM side as well). These architectures are steadily evolving into full blown vector units, and so I expect that the instruction sets will continue to become ever more complex over time. It feels like continuing to (approximately) double the size of the generated namespace with each such additional instruction option will rapidly become self-limiting in the future, even if you manage to overcome it for the generated forms this time. Its been a long time since I've had insider knowledge at Intel, but based on recent public statements they seem fully committed to continuing to expand features of the AVX512 architecture and whatever follows it. All of that aside, I also continue to aesthetically prefer the earlier proposed methods-based API, for whatever that's worth. |
@vsivsi Glad you got a real vacation :) Thanks for your input! I take your points. I'll take some time to consider the alternative API again, maybe prototype it. I actually think with the naive implementation I have, both APIs might stress the compiler. Compile times are not necessarily a deal-breaker because of the build cache, but it doesn't give users a good first impression and it's annoying for the development feedback loop. On the positive side, we do have something that works now. Wrangling the instruction database into shape was the most annoying part of it, and that's done :) |
Updated! See next comment. I've continued to use this branch to great effect. VGATHERDPS/VGATHERDPD/VGATHERQPS/VGATHERQPD The EVEX coded forms of these instructions should optionally accept a
Thanks again for all of the work that went into this branch! It's been a boon to have this support for the past couple of months. The fact that this is the first little annoyance I've discovered is a testament to how thorough you were with this AVX-512 implementation! Hope all is well and your grad school decision process is going smoothly! Best, -V |
Apologies, I was mistaken about the issue above! I just got back to this code, and being the first time I've attempted to use the scatter/gather instructions in AVX-512, I didn't realize these instructions seem to be irregular relative the norm in their use of the mask register. Specifically, they zero each bit for every value that is successfully transferred. So it seems that it is not valid to use K0, or to omit a mask register! I was misled by this instruction syntax definition taken straight from the Intel documentation:
But in this case, at least as far as the Golang assembler is concerned, K0 is not a valid mask for these instructions (used explicitly).
And if you try to run the assembler against source code that omits a mask register, it actually crashes!
So... Seems like for now, the behavior that Avo currently enforces is needed. Obviously, the Golang assembler should never crash like this, so that's a separate issue that I'll need to chase down. Thanks again! -V |
Found the definitive answer in the Intel SDM (Vol. 1):
That last sentence answers the questions I raised in the post above. Omitting a mask register (K1-K7) from VPSCATTER/GATHER instructions is not allowed, as the K register needs to be a writable destination, and my conjecture about implicit mask --> K0 dest functionality is false. So the only bug here is the Golang assembler crashing when the mask register is omitted from one of these instructions. I'll nail that down to a simple set of test cases and submit an issue for this on the Go project repo. |
And... I just found the golang issue @mmcloughlin filed on this in December! |
Okay, this time I think I've found a genuine omission of consequence for my work. According to the Intel SW Dev Prog Ref, when CPUID bit AVX512VL is present at runtime, all forms of the VPOPCNT[B|W|D|Q] instructions are valid for use with 256-bit ymm and 128-bit xmm registers. And because VPOPCNT is an AVX512 "heavy" instruction when used with 512-bit registers, there is a strong performance benefit (avoided license downclocking) to only using the register width needed. The AVX512 branch currently returns "bad operands" when VPOPCNT is used with registers other than zmm. For example:
Not sure if this is caused by a deficiency in the instruction database used in generation or something in Avo's interpretation. I haven't encountered similar difficulties with any other AVX512VL impacted instructions. The
Other AVX512VL impacted instructions in that file appear to fully elaborate all of the valid xmm/ymm and m128/m256 combinations. |
@mmcloughlin I've traced the above VPOPCNT issue to the x86_64.xml file. It appears to be missing the AVX512VL forms of VPOPCNT[D|Q]. As an aside, it also appears to be missing all AVX512_BITALG instructions, including VPOPCNT[B|W]. But that's not currently a blocker for me. I've hand edited the VPOPCNT section of that file to add the 128 and 256 bit forms, using analogous forms for the instruction VPORQ and the Intel documentation as my guide. The result is here: https://gist.github.com/vsivsi/06b742a04d2c7e226fae4fd3ab0753dd Unless I've screwed something up, in theory you should just be able to patch that section of the file with the gist contents and this will just work. But I haven't been able to test that yet, because I'm having some trouble sorting out precisely how to reproduce the steps to generate all of the Avo code that depends on x86_64.xml, this being my first time attempting to dive this deeply into the guts of Avo. How should I proceed here? |
A quick update... merging the changes in the With that working, the patch in the gist linked above works and solves my immediate blocker. For completeness, over the weekend I'll update it to add support for all of the AVX512_BITALG instructions (and their VL forms). I guess the next question is strategy for merging these changes. Should I put together a PR for Avo directly, or attempt to submit a PR for |
Thanks for all your feedback here and sorry for not getting back to you! I'll respond to various points you've made, maybe not all at once.
Really glad to hear it's working well for you. Do you recall being annoyed by compile times? I have some local work on using an optab approach for instruction generation, and I think this should deal with the compile time problem. However, I ended up getting blocked in "analysis paralysis" regarding the instruction suffix API (underscores |
Good question! I thought about this a while ago, although your suggestion is more extensive than what I had in mind. I've created a separate issue for discussion #168. |
You went through the exact same journey I did regarding the gather/scatter instructions. Yes, these are indeed special cases. See: Lines 534 to 539 in d60cc02
Lines 98 to 144 in d60cc02
https://github.com/golang/go/blob/4fd94558820100129b98f284e21b19fc27a99926/src/cmd/internal/obj/x86/asm6.go#L4219-L4240 |
You figured this out. Unfortunately Go 1.16 broke the
So, yes, you also diagnosed the In the meantime, I'm fine with adding a patch file to |
Yeah, hand editing that Everything looks good as far as the generated code goes. The only wrinkles were the fact that the BITALG VPOPCNTB/W don't support memory broadcast, so I had to puzzle that out using the Intel docs and the XML schema for the x86_64.xml file. The final remaining BITALG instruction VPSHUFBITQMB is pretty crazy, I can't imagine what I'd use it for, but I suppose I'll know it when I see it. That really required me to dig into the EVEX encoding to get right, but I'm about 95% sure I got it. The generated Avo code looks correct. All of this is on this branch on my fork. It triggered avogen to generate a bunch of new code beyond the instructions themselves because VPSHUFBITQMB has a unique operand signature, but Avo seems to have taken it in stride. So kudos again! https://github.com/vsivsi/avo/tree/avx512_vpopcnt_VL I should probably contribute to the test coverage of this new stuff before folding it into a PR, but I'm out of steam for tonight, so it'll need to wait until Monday. |
I've picked up work on this in the last few days. #217 has the most recent work. I've finally completed the transition to using an optab approach for the function constructors in @kalamay The https://github.com/mmcloughlin/avo/runs/4128468815?check_suite_focus=true#step:6:37 It's a tiny fix so if it's okay I'd prefer to just make the fix on your end after the upgrade. The issue is that extending the instruction database has turned the add functions into variadic functions rather than a fixed number of arguments (they now take masked and unmasked versions). Probably most users wouldn't notice but you have some code that relies on the signature, here: The fix is just: diff --git a/build/slices/sums_asm.go b/build/slices/sums_asm.go
index c46d300..4c98520 100644
--- a/build/slices/sums_asm.go
+++ b/build/slices/sums_asm.go
@@ -19,7 +19,7 @@ type Processor struct {
typ string
scale uint8
avxOffset uint64
- avxAdd func(mxy, xy, xy1 Op)
+ avxAdd func(...Op)
x86Mov func(imr, mr Op)
x86Add func(imr, amr Op)
x86Reg reg.GPVirtual I've confirmed everything else still generates, at least on your v1.0.0 tag. What do you think? @vsivsi Thanks for your PR #199. I think this is something we can address. It'll be easier once the avx-512 stuff is actually landed. I guess you've had the most experience actually working with the avx512 branch? Has it worked for you? Do you think it's a good enough interface to land? |
Hi @mmcloughlin I've just created a branch pulling in this version of Avo here: segmentio/asm#59 There's a little bit of a chicken-or-egg situation as far as released versions go, but we should be fine pulling in this |
Oh wow @kalamay I just realized you're now depending on the So as of last night CI is green on #217. I'm working on adding an AVX-512 example or two, which is also a way for me to kick the tires. I have a long weekend to focus on it, so hoping to finally get it landed! |
Hah 🔥, well let me know if me need to tweak anything else or help move anything forward. Currently though, we only run |
@mmcloughlin Yes, I've been working extensively off my fork of the avx-512 branch (shared in PR #199) I've build code that's using Avo to generate ~150K lines of Golang asm. It's all integer (no FP) but otherwise is exercising a good amount of both AVX2 and AVX-512 (I'm generating codepaths for each). The AVX-512 code is exercising opmasks, broadcast and zerofill suffixes, scatter/gather operations, AVX512VL variants, etc. It was unit testing of Avo generated AVX-512 code that revealed the MacOS kernel bug (opmask clobbering) I recently identified (see: golang/go#49233). I'd be more than happy to port my code back to using your testing branch once PR #199 is merged, as I need those changes for my code to work. Please let me know if there's any other way I can help. |
all: AVX-512 Extends avo to support most AVX-512 instruction sets. The instruction type is extended to support suffixes. The K family of opmask registers is added to the register package, and the operand package is updated to support the new operand types. Move instruction deduction in `Load` and `Store` is extended to support KMOV* and VMOV* forms. Internal code generation packages were overhauled. Instruction database loading required various messy changes to account for the additional complexities of the AVX-512 instruction sets. The internal/api package was added to introduce a separation between instruction forms in the database, and the functions avo provides to create them. This was required since with instruction suffixes there is no longer a one-to-one mapping between instruction constructors and opcodes. AVX-512 bloated generated source code size substantially, initially increasing compilation and CI test times to an unacceptable level. Two changes were made to address this: 1. Instruction constructors in the `x86` package moved to an optab-based approach. This compiles substantially faster than the verbose code generation we had before. 2. The most verbose code-generated tests are moved under build tags and limited to a stress test mode. Stress test builds are run on schedule but not in regular CI. An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and test the new functionality. Updates #20 #163 #229 Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
Just landed #217! Thank you for your input and infinite patience while I worked on this. Excited to see what people build. Hope it works smoothly, and please file bugs if it doesn't. |
Extends avo to support most AVX-512 instruction sets. The instruction type is extended to support suffixes. The K family of opmask registers is added to the register package, and the operand package is updated to support the new operand types. Move instruction deduction in `Load` and `Store` is extended to support KMOV* and VMOV* forms. Internal code generation packages were overhauled. Instruction database loading required various messy changes to account for the additional complexities of the AVX-512 instruction sets. The internal/api package was added to introduce a separation between instruction forms in the database, and the functions avo provides to create them. This was required since with instruction suffixes there is no longer a one-to-one mapping between instruction constructors and opcodes. AVX-512 bloated generated source code size substantially, initially increasing compilation and CI test times to an unacceptable level. Two changes were made to address this: 1. Instruction constructors in the `x86` package moved to an optab-based approach. This compiles substantially faster than the verbose code generation we had before. 2. The most verbose code-generated tests are moved under build tags and limited to a stress test mode. Stress test builds are run on schedule but not in regular CI. An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and test the new functionality. Updates #20 #163 #229 Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
Extends avo to support most AVX-512 instruction sets. The instruction type is extended to support suffixes. The K family of opmask registers is added to the register package, and the operand package is updated to support the new operand types. Move instruction deduction in `Load` and `Store` is extended to support KMOV* and VMOV* forms. Internal code generation packages were overhauled. Instruction database loading required various messy changes to account for the additional complexities of the AVX-512 instruction sets. The internal/api package was added to introduce a separation between instruction forms in the database, and the functions avo provides to create them. This was required since with instruction suffixes there is no longer a one-to-one mapping between instruction constructors and opcodes. AVX-512 bloated generated source code size substantially, initially increasing compilation and CI test times to an unacceptable level. Two changes were made to address this: 1. Instruction constructors in the `x86` package moved to an optab-based approach. This compiles substantially faster than the verbose code generation we had before. 2. The most verbose code-generated tests are moved under build tags and limited to a stress test mode. Stress test builds are run on schedule but not in regular CI. An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and test the new functionality. Updates #20 #163 #229 Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
Congratulations on the release, and thanks so much for the amazing effort! |
Agreed, it's been quite a journey to get here. Can't wait to try it out. Congrats! |
king 😤 |
For complexity reasons AVX-512 was not initially considered. We should add support.
avo/internal/load/load.go
Lines 130 to 133 in 9fbb71b
The text was updated successfully, but these errors were encountered: