crypto/sha256: add native SHA256 instruction implementation for AMD64 #50543

monocasa · 2022-01-10T22:00:20Z

The sha_ni sha256 instructions have been shown to provide an ~4x increase in hash rate on newer amd64 systems versus the avx2 implementation. Transliterating the Linux implementation shows an up to 3.79x increase in hash rate under benchmarks.

https://gist.github.com/monocasa/befeed9c8c5827417c0921e231703f2c

Q&A:

Q: Linux is GPL, is this OK to bring into golang?

A: It appears so. The Linux implementation file is dual BSD-3 clause/gpl. A look through the history of that file shows that all relevant changes have come from Intel employees, Intel has signed the CLA, and in fact the current AVX2 implementation submitted by Russ Cox is called out as arriving from the same source.

Q: Was the Assembly Policy followed?

A: I believe so, even to the point of leaving a measurable but ~1% perf increase on the table in exchange for simplifying constant table access and sharing that with the existing avx2 implementation.

gopherbot · 2022-01-10T22:03:47Z

Change https://golang.org/cl/377514 mentions this issue: crypto/sha256: Add support for shani instructions

mvdan · 2022-01-11T10:10:26Z

Roughly what percentage of consumer CPUs are expected to have these instructions? Following https://en.wikipedia.org/wiki/Intel_SHA_extensions, it seems like it should be most Intel since Ice Lake (2019) and AMD since Zen (2017), so it seems significantly newer than x86-64-v3, working on CPUs released back in 2015.

Not pushing back by any means, I just wonder how many users should expect to see an improvement e.g. once 1.19 releases in 8 months.

monocasa · 2022-01-11T16:21:35Z

@mvdan My specific impetus here is less about the consumer space. I probably should have noted this in the description, but the impetus for me is measurably improving the speed of content addressable storage using sha256, specifically container image generation and verification. So speeding up pretty much every container build and container registry access on x86_64. Because CI/CD pipelines for such tasks tend to run in datacenters (which are refreshed about on a 3 year cycle because of perf/watt concerns), and because developer local systems tend to be refreshed at a quicker pace than most consumer systems there should be more than a critical mass of users receiving benefits from this.

Because of the developer benefits, I'd personally love to see this in 1.19.

ianlancetaylor · 2022-01-12T17:10:46Z

CC @golang/security

rsc · 2022-01-19T18:56:54Z

Re licensing, the code needs to not introduce new notice requirements. What is the exact license on the code you are proposing to add? Also, if it's just an instruction we can use, it seems like we could write our own assembly and not worry about copying someone else's.

/cc @FiloSottile

rsc · 2022-01-19T18:59:21Z

Re licensing again, the other code in that file was contributed directly by Intel using Go's CLA in https://go-review.googlesource.com/22608, so we were not just "copying" the Linux version and therefore do not have to use that license. The right path might be to try to get Intel to contribute this one too.

rsc · 2022-01-19T19:03:39Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

monocasa · 2022-01-19T19:28:18Z

@rsc Re licensing comment 1: The original is dual licensed GPLv2 and a 3-Clause BSD which exactly matches golang's license text. An additional license clause on binaries shouldn't be necessary for golang to include this AFAIUI.

Re licensing comment 2: I misunderstood the git blame, so that's on me. I investigated that particular part of the problem before I had access to gerrit. What I will say is that the prologue and epilogue are different, so the inner loop is all that's taken from there, and there's essentially only one way, being dependent on hardware accel instructions which have implicit register arguments. That inner loop additionally appears to be documented by Intel as the only performance stable way for it to be implemented in their whitepaper. My guess is they want to use macro-op fusion on some of these ops in the future. I would hesitate going off of the beaten path there.

ianlancetaylor · 2022-01-19T19:33:24Z

@monocasa A direct copy of the code in https://github.com/torvalds/linux/blob/master/arch/x86/crypto/sha256_ni_asm.S would require an additional license. That code says, among other things, "Copyright(c) 2015 Intel Corporation." and "Redistributions of source code must retain the above copyright notice." It's true that the license has the same restrictions as the Go license, but the copyright is different, and as such the new license would need to be included by anybody who might be using the Go standard library with that code included. We're trying to avoid license proliferation for people who distribute Go code. Thanks.

(But copyright only protects individual expression, not ideas, so it's quite possible that the same sequence of instructions is not covered by copyright if that is the only way to write those instructions. But let's avoid that argument if we can.)

monocasa · 2022-01-19T20:23:29Z

@ianlancetaylor Ultimately I stand by the idea that the code here isn't ultimately derived in a way which would impede as simply the algorithm itself is taken (as stated in the comment), and there's sort of one performant way to do this, and it's what's documented in Intel's white papers as the correct use of the instructions.

That being said, I understand, and obviously defer to the golang maintainers here. I would ask that if this is determined to be a non starter, then at least I'd appreciate the chance to rewrite it (with probably a slight performance degradation, and an ongoing performance related technical debt for using a non recommended code sequence) rather than simply closing the proposal.

FiloSottile · 2022-01-20T13:05:14Z

I appreciate that the code is commented and fits in the existing structure, but the assembly policy is also about testing, and I don't see anything about making sure that both this code and the code it hides do the right thing. Do we even have test coverage on our builders for these instructions? Would we notice a regression if we changed test infrastructure?

Moreover, there are enough new-to-me instructions in here that I don't think I can do a useful review of the code itself. The easiest solution for that would be to get Intel to do a review while approving the submission under the CLA.

monocasa · 2022-01-20T18:24:42Z

@FiloSottile The functionality of this code is covered under existing sha256 tests. Additionally any bug in this code leaves you with a compiler that can't get to the unit tests since go internally uses sha256 to address intermediate objects it seems (the compiler is very unhappy when this code is wrong, ends up printing intermediate objects as strings, and lays waste to my tty requiring a shell reset). On top of that, my main concern over the existing test vectors was overflowing the DATA_PTR at a 32 bit boundary since golang's assembler doesn't make it as easy to track if you're using 32bit or 64bit registers as GAS does. I did this using AES to create a deterministic but pseudo random 8GB byte stream in a slice, and validated this new test vector against existing implementations both in golang, and externally via libsodium. I didn't include this as adding close to a minute of test time in the case of a lack of sha hardware instructions or avx2 support seemed gratuitous compared to the existing compiler build time. Is there a location for long relatively running tests like that to be submitted into the tree? I know I would feel better with that under more tests.

Beyond that, I attempted to follow the existing nuances WRT to commenting (particularly following the standard set in other crypto hardware accel asm functions) and making the code easy to follow, but I can certainly add to that. It's a little under commented for my general asm sensibilities, but I took a 'when in Rome' approach. I'd love to add anything that would help. Towards that end (and because I don't like the idea of golang accepting code they don't understand whether from Intel or from me), here's some of what I left off that I would normally have added (and have negative qualms with adding if desired):

General Notes: The core algorithm processes four rounds at a time, limited by the width of the xmm registers. Golang's assembler uses nonstandard mnemonics: MOVO is better known as movdqa, MOVOU is better known as movdqu. Because of golang's inability to specify 128bit aligned constants, MOVOU is used exclusively for transfers between xmm registers and memory, and MOVO is used for transfers between xmm registers. Like the other asm sha256 functions in golang, this code makes no attempt to handle the case of being passed an empty slice, or the padding concerns of sha256 proper to leave one with 512bit blocks. It does terminate under such conditions, but it also does give incorrect results, like the existing implementations.
Timing: The time complexity is O(N) N=number of blocks. This does not open timing attacks beyond side channel attacks present on any memory accessing function. This can be validated statically as there is only one branch it it clearly is dependent on remaining blocks.
Data flow: MSG is initially used as the current 128bit chunk of the 512bit block in rounds 0 through 15, then is used as sha256 h[0..3] in rounds 16 through 63. STATE0 and STATE1 are the in progress hash values known in sha256 as a, b, c, d, e, f, g, h. MSGTMP0 through MSGTMP3 more or less represent the message schedule array. MSGTMP4 is a clobberable short term temporary. ABEF_SAVE and CDGH_SAVE are the hash accumulator across blocks, and when unshuffled represent the final digest when complete.
Code flow: The inner loop is best thought of as a completely unrolled software pipelined loop, rotating across the four xmm registers MSGTMP0-MSGTEMP3. Sha256 rounds are symmetric, so why is this code so unsymmetric? Rounds 0-11 have the additional overhead of mixing the current message block into the message schedule. Rounds 12-15 mix the last four words of the message into the message, and additionally setup for the following rounds now that the entire message block has been read in. Rounds 16-59 are symmetric, and rotate through MSGTMP0-MSGTMP3 as their input. I initially added an unused temporary to the #define SHANI_ROUND_16_59. As an example rounds 16-19 would look like

SHANI_ROUND_16_59(4, MSGTMP0, MSGTMP1, MSGTEMP2, MSGTMP3)

followed by rounds 20-23

SHANI_ROUND_16_59(4, MSGTMP1, MSGTMP2, MSGTEMP3, MSGTMP0)

to make this rotation clearer but decided that unused variables to a #define was a larger sin. I welcome calls to change that back. Form there at rounds 52-59, we no longer have to message schedule for the next rounds, because that work has already been done in previous round, so the SHA256MSG1 can be left off (and should for correctness). Finally rounds 60-63 have significantly less work as the epilogue of the unrolled loop and can simply mix in the existing schedule, and perform the rounds, not having to do any additional work for non existent later rounds.

WRT the instructions themselves, this patch doesn't add them to the compiler. They've existed in golang for nearly four years, having been added to cmd/internal/obj/x86 by Intel when they added AVX-512 support. However if additional test coverage is required there, I'm more than happy to add it. Can you suggest an example to follow? As for "would we notice a regression", I imagine that Google has been using Intel chips manufactured in the past 3 years or any of the AMD Zen chips in it's test infrastructure. Because of how the new instructions are core to the round and message scheduling any miscompilation of the shani instructions would both cause the sha256 test vectors to fail, and leave you with a useless compiler. If a miscompilation of the accel instructions that didn't result in failures in the test vectors, this would represent most of the way to a viable collision attack on sha256 proper because of how each round message schedule is necessary to fully smear the entropy in a way to get a compliant hash.

Finally, I of course welcome a review from Intel, but lack knowledge as to Intel's current relationship with golang. Is making that a requirement a viable option for moving forward, or does that practically kill the proposal?

dbussink · 2022-01-24T09:15:26Z

I'd like to also mention that I opened #48720 a while ago which implements this and I've implemented that with specifically only the Intel reference documentation. I've also address already a few bits of feedback there but there's still a bunch of open questions.

monocasa · 2022-01-24T19:14:10Z

@dbussink Unfortunately, basing the code off of Intel reference documentation seems to be a more of a no go from a license perspective since the Intel whitepaper's license states

NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.

So either the decision is that there isn't enough here to be a true copyright claim and we're allowed take the underlying flow regardless, or the Linux side is safer from a license perspective already being 3-clause BSD from an entity in the AUTHORS file, or neither are safe and we truly require Intel's explicit signoff to use their instructions as described.

dbussink · 2022-01-24T19:21:09Z

@dbussink Unfortunately, basing the code off of Intel reference documentation seems to be a more of a no go from a license perspective since the Intel whitepaper's license states

Ah, where did you find this license? I don't see it on https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html but I might be overlooking something? Although it would seem very surprising that the documentation itself would not allow an implementation based on it licensing wise 😕.

monocasa · 2022-01-24T20:12:13Z

@dbussink The last page of https://www.intel.com/content/dam/develop/external/us/en/documents/intel-sha-extensions-white-paper.pdf which is the original source of that information in your blog post (and unfortunately the blog post doesn't grant another license).

I agree that it's incredibly goofy to give example code and then ban you from using it directly, but I'm not sure else how to interpret that.

ulikunitz · 2022-01-24T23:17:45Z

The statement says only that the paper doesn't grant any license to any intellectual property rights. The statement is unnecessary because the white paper doesn't grant any such rights without it, but lawyers like to add those statements to leave nothing in doubt.

You don't need a license to use machine instructions in your own code. The code sequences in the white paper are clearly designated as examples, so you can follow them in your own code.

However you cannot copy the whitepaper into any repository or copy Intel code from other open source projects into the Go source code. The Contributor License Agreement (CLA) requirement by Google prevents the latter even if the code would have the same license terms as the Go code.

monocasa · 2022-01-24T23:45:14Z

@ulikunitz

The statement says only that the paper doesn't grant any license to any intellectual property rights. The statement is unnecessary because the white paper doesn't grant any such rights without it, but lawyers like to add those statements to leave nothing in doubt.

You don't need a license to use machine instructions in your own code. The code sequences in the white paper are clearly designated as examples, so you can follow them in your own code.

Licensing isn't based on what makes sense, it's based on the license text. It very clearly says that you aren't granted a license to any IP in the document, which would include copyright over code sequences. Google v Oracle ended with the de minimis defense still squashed by the courts.

However you cannot copy the whitepaper into any repository or copy Intel code from other open source projects into the Go source code. The Contributor License Agreement (CLA) requirement by Google prevents the latter even if the code would have the same license terms as the Go code.

We're talking about the exact same code sequences in question both in the white paper and in the Linux kernel. Either it doesn't matter at all because they're both exactly the same code sequences, or neither are valid sources of these sequences, or the source with the same license explicitly designated as 3 clause BSD by an entity already in the AUTHORs file is a valid source of these sequences.

ulikunitz · 2022-01-25T08:05:42Z

@monocasa You are arguing here that you need an explicit license by Intel to use Intel(r) machine instructions for their intended purpose and as documented to implement a public standard. If that would be true, we would live in a very poor world.

At the end it is not for you and me to assess the legal risk here and whether to accept it.

monocasa · 2022-01-25T17:48:03Z

@ulikunitz > You are arguing here that you need an explicit license by Intel to use Intel(r) machine instructions for their intended purpose and as documented to implement a public standard. If that would be true, we would live in a very poor world.

The document doesn't just say "these are what these instructions do" like normal Intel documentation. It walks through a full fairly large code sequence on the order of a hundred lines or so, that is verbatim what is at question here, down to the variable names and indentation. It is the exact same code sequence in both cases. A legal text at the end granting no license to any IP is very clear in the US. Hence why I found an alternative with a much clearer license that appears more than compatible. That's how I knew about the lack of license being granted by that document in the first place.

rsc · 2022-01-26T18:05:41Z

For the Go project, the only way we are going to accept this code is to get Intel to agree to contribute it under our CLA, as they have done in the past. That's the path forward. Arguments about what the law is are beside the point.

rsc · 2022-01-26T18:06:19Z

@TocarIP, are you still at Intel and do you know who could agree to contribute the SHANI SHA256 code under CLA as we did for the other implementations? Thanks.

TocarIP · 2022-01-26T18:28:23Z

I'm no longer with Intel. @aregm and @intel-go in general should be able to help with licence issues. FWIW I've sent https://go-review.googlesource.com/c/go/+/135378 when I was at Intel.

aregm · 2022-01-26T20:22:35Z

@rsc let me ping the guys who are supposedly in charge.

monocasa · 2022-04-14T00:36:16Z

@jbrosenz Thanks, that's great to hear! I really appreciate Intel making this a priority.

tpaint · 2022-05-01T12:33:13Z

Status update - the first revision of our SHA-256 sha-ni patch (SHA256RNDS2, SHA256MSG1, SHA256MSG2) has been integrated with the current version of sha256block_amd64.s and is passing all unit tests. As soon as our internal review and approval cycle completes we will generate a pull request for review.

tpaint · 2022-05-25T20:45:34Z

sha-256 patch is approved and ready to submit. Proposed dependent internal/cpu support for sha-ni instructions in #53084.

tpaint · 2022-05-25T20:46:45Z

both change lists will be submitted next

gopherbot · 2022-05-26T00:45:29Z

Change https://go.dev/cl/408794 mentions this issue: internal/cpu: detect sha-ni instruction support for AMD64

tpaint · 2022-05-26T00:49:16Z

internal/cpu sha-ni: https://go-review.googlesource.com/c/go/+/408794/1
crypto/sha256: https://go-review.googlesource.com/c/go/+/408795/1

addresses proposal #53084 required by sha-256 change list developed for #50543 Change-Id: I5454d746fce069a7a4993d70dc5b0a5544f8eeaf Reviewed-on: https://go-review.googlesource.com/c/go/+/408794 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@google.com>

rolandshoemaker · 2023-03-22T21:56:19Z

@rsc looks like there is now a CL from the Intel folks (thanks @tpaint), and the proposal process should start moving again. It'd be nice to get this into 1.21 if possible.

rsc · 2023-03-29T17:42:57Z

Given that it's a small amount of assembly and we've gotten it contributed directly from Intel under CLA now, this seems fine.

rsc · 2023-03-29T19:17:51Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

gopherbot · 2023-03-31T14:21:08Z

Change https://go.dev/cl/408795 mentions this issue: crypto/sha256: add sha-ni implementation

In go1.21 the go std has a new SHANI accelerated rountine (just like minio's sha256). See golang/go#50543. Reduce the code surface by dropping what is now a redondent librairy. Benchmarks using go1.21 and go1.20 shows no significant difference between minio and the std for sha256: ``` name old time/op new time/op delta SumAllLarge/sha2-256-12 8.63ms ± 3% 8.63ms ± 1% ~ (p=0.220 n=20+17) SumAllLarge/sha2-512-12 23.1ms ± 3% 23.0ms ± 3% ~ (p=0.361 n=18+20) SumAllLarge/dbl-sha2-256-12 36.0ms ± 4% 8.6ms ± 4% -76.16% (p=0.000 n=20+20) name old speed new speed delta SumAllLarge/sha2-256-12 1.94GB/s ± 4% 1.95GB/s ± 1% ~ (p=0.220 n=20+17) SumAllLarge/sha2-512-12 725MB/s ± 3% 729MB/s ± 3% ~ (p=0.350 n=18+20) SumAllLarge/dbl-sha2-256-12 467MB/s ± 4% 1958MB/s ± 4% +319.53% (p=0.000 n=20+20) ``` `dlb-sha2-256` doubled in speed because it was never wired to use minio's sha256 implementation and thus was bumped by `crypto/sha256`'s improvement. I think we should remove `github.com/multiformats/go-multihash/register/miniosha256` one go1.22 is released (~1 year) since it would be purposeless, we could also stub it to github.com/multiformats/go-multihash/register/sha256 for all versions but [no one imports it directly](https://pkg.go.dev/github.com/multiformats/go-multihash/register/miniosha256?tab=importedby).

In go1.21 the go std has a new SHANI accelerated rountine (just like minio's sha256). See golang/go#50543. Reduce the code surface by dropping what is now a redondent librairy. Benchmarks using go1.21 and go1.20 shows no significant difference between minio and the std for sha256: ``` name old time/op new time/op delta SumAllLarge/sha2-256-12 8.63ms ± 3% 8.63ms ± 1% ~ (p=0.220 n=20+17) SumAllLarge/sha2-512-12 23.1ms ± 3% 23.0ms ± 3% ~ (p=0.361 n=18+20) SumAllLarge/dbl-sha2-256-12 36.0ms ± 4% 8.6ms ± 4% -76.16% (p=0.000 n=20+20) name old speed new speed delta SumAllLarge/sha2-256-12 1.94GB/s ± 4% 1.95GB/s ± 1% ~ (p=0.220 n=20+17) SumAllLarge/sha2-512-12 725MB/s ± 3% 729MB/s ± 3% ~ (p=0.350 n=18+20) SumAllLarge/dbl-sha2-256-12 467MB/s ± 4% 1958MB/s ± 4% +319.53% (p=0.000 n=20+20) ``` `dlb-sha2-256` quadrupled in speed because it was never wired to use minio's sha256 implementation and thus was brought back up to speed by `crypto/sha256`'s improvement. I think we should remove `github.com/multiformats/go-multihash/register/miniosha256` one go1.22 is released (~1 year) since it would be purposeless, we could also stub it to github.com/multiformats/go-multihash/register/sha256 for all versions but [no one imports it directly](https://pkg.go.dev/github.com/multiformats/go-multihash/register/miniosha256?tab=importedby).

yahavi · 2024-02-21T16:06:22Z

@rsc Does the implementation include support for AVX512? Or just AVX2?

monocasa added the Proposal label Jan 10, 2022

ianlancetaylor added this to Incoming in Proposals (old) Jan 12, 2022

ianlancetaylor changed the title ~~crypto/sha256: add native SHA256 instruction implementation for AMD64~~ proposal: crypto/sha256: add native SHA256 instruction implementation for AMD64 Jan 12, 2022

gopherbot added this to the Proposal milestone Jan 12, 2022

ianlancetaylor added the Proposal-Crypto Proposal related to crypto packages or other security issues label Jan 12, 2022

rsc moved this from Incoming to Active in Proposals (old) Jan 19, 2022

tpaint mentioned this issue May 25, 2022

proposal: internal/cpu: add sha-ni detection for AMD64 #53084

Closed

rsc changed the title ~~proposal: crypto/sha256: add native SHA256 instruction implementation for AMD64~~ crypto/sha256: add native SHA256 instruction implementation for AMD64 Mar 29, 2023

rsc modified the milestones: Proposal, Backlog Mar 29, 2023

rsc added Proposal-Accepted and removed Proposal-Hold labels Mar 29, 2023

gopherbot closed this as completed in 1a64574 Mar 31, 2023

klauspost mentioned this issue Apr 18, 2023

Integrate with google/oss-fuzz minio/sha256-simd#70

Closed

Jorropo mentioned this issue May 24, 2023

sha256: drop minio in favor of crypto/sha256 for go1.21 and above multiformats/go-multihash#173

Merged

dmitshur modified the milestones: Backlog, Go1.21 Jun 4, 2023

chrisd8088 mentioned this issue Jun 19, 2024

Get Exception when using command after update to 2.45.1 git-for-windows/git#4996

Closed

1 task

yahavi mentioned this issue Jul 14, 2024

Support uploading files with Size filtering and AQL avoiding jfrog/jfrog-client-go#963

Merged

4 tasks

crypto/sha256: add native SHA256 instruction implementation for AMD64 #50543

crypto/sha256: add native SHA256 instruction implementation for AMD64 #50543

Comments

monocasa commented Jan 10, 2022 • edited Loading

gopherbot commented Jan 10, 2022

mvdan commented Jan 11, 2022

monocasa commented Jan 11, 2022 • edited Loading

ianlancetaylor commented Jan 12, 2022

rsc commented Jan 19, 2022

rsc commented Jan 19, 2022

rsc commented Jan 19, 2022

monocasa commented Jan 19, 2022

ianlancetaylor commented Jan 19, 2022

monocasa commented Jan 19, 2022 • edited Loading

FiloSottile commented Jan 20, 2022

monocasa commented Jan 20, 2022 • edited Loading

dbussink commented Jan 24, 2022 • edited Loading

monocasa commented Jan 24, 2022

dbussink commented Jan 24, 2022

monocasa commented Jan 24, 2022

ulikunitz commented Jan 24, 2022

monocasa commented Jan 24, 2022 • edited Loading

ulikunitz commented Jan 25, 2022

monocasa commented Jan 25, 2022

rsc commented Jan 26, 2022

rsc commented Jan 26, 2022

TocarIP commented Jan 26, 2022

aregm commented Jan 26, 2022

monocasa commented Apr 14, 2022

tpaint commented May 1, 2022

tpaint commented May 25, 2022

tpaint commented May 25, 2022

gopherbot commented May 26, 2022

tpaint commented May 26, 2022

rolandshoemaker commented Mar 22, 2023

rsc commented Mar 29, 2023

rsc commented Mar 29, 2023

gopherbot commented Mar 31, 2023

yahavi commented Feb 21, 2024

monocasa commented Jan 10, 2022 •

edited

Loading

monocasa commented Jan 11, 2022 •

edited

Loading

monocasa commented Jan 19, 2022 •

edited

Loading

monocasa commented Jan 20, 2022 •

edited

Loading

dbussink commented Jan 24, 2022 •

edited

Loading

monocasa commented Jan 24, 2022 •

edited

Loading