Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: crypto/sha256: add native SHA256 instruction implementation for AMD64 #50543

Open
monocasa opened this issue Jan 10, 2022 · 38 comments
Open
Labels
Proposal Proposal-Crypto Proposal-Hold
Projects
Milestone

Comments

@monocasa
Copy link

@monocasa monocasa commented Jan 10, 2022

The sha_ni sha256 instructions have been shown to provide an ~4x increase in hash rate on newer amd64 systems versus the avx2 implementation. Transliterating the Linux implementation shows an up to 3.79x increase in hash rate under benchmarks.

https://gist.github.com/monocasa/befeed9c8c5827417c0921e231703f2c

Q&A:

Q: Linux is GPL, is this OK to bring into golang?

A: It appears so. The Linux implementation file is dual BSD-3 clause/gpl. A look through the history of that file shows that all relevant changes have come from Intel employees, Intel has signed the CLA, and in fact the current AVX2 implementation submitted by Russ Cox is called out as arriving from the same source.

Q: Was the Assembly Policy followed?

A: I believe so, even to the point of leaving a measurable but ~1% perf increase on the table in exchange for simplifying constant table access and sharing that with the existing avx2 implementation.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 10, 2022

Change https://golang.org/cl/377514 mentions this issue: crypto/sha256: Add support for shani instructions

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 11, 2022

Roughly what percentage of consumer CPUs are expected to have these instructions? Following https://en.wikipedia.org/wiki/Intel_SHA_extensions, it seems like it should be most Intel since Ice Lake (2019) and AMD since Zen (2017), so it seems significantly newer than x86-64-v3, working on CPUs released back in 2015.

Not pushing back by any means, I just wonder how many users should expect to see an improvement e.g. once 1.19 releases in 8 months.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 11, 2022

@mvdan My specific impetus here is less about the consumer space. I probably should have noted this in the description, but the impetus for me is measurably improving the speed of content addressable storage using sha256, specifically container image generation and verification. So speeding up pretty much every container build and container registry access on x86_64. Because CI/CD pipelines for such tasks tend to run in datacenters (which are refreshed about on a 3 year cycle because of perf/watt concerns), and because developer local systems tend to be refreshed at a quicker pace than most consumer systems there should be more than a critical mass of users receiving benefits from this.

Because of the developer benefits, I'd personally love to see this in 1.19.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jan 12, 2022
@ianlancetaylor ianlancetaylor changed the title crypto/sha256: add native SHA256 instruction implementation for AMD64 proposal: crypto/sha256: add native SHA256 instruction implementation for AMD64 Jan 12, 2022
@gopherbot gopherbot added this to the Proposal milestone Jan 12, 2022
@ianlancetaylor ianlancetaylor added the Proposal-Crypto label Jan 12, 2022
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 12, 2022

@rsc
Copy link
Contributor

@rsc rsc commented Jan 19, 2022

Re licensing, the code needs to not introduce new notice requirements. What is the exact license on the code you are proposing to add? Also, if it's just an instruction we can use, it seems like we could write our own assembly and not worry about copying someone else's.

/cc @FiloSottile

@rsc
Copy link
Contributor

@rsc rsc commented Jan 19, 2022

Re licensing again, the other code in that file was contributed directly by Intel using Go's CLA in https://go-review.googlesource.com/22608, so we were not just "copying" the Linux version and therefore do not have to use that license. The right path might be to try to get Intel to contribute this one too.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 19, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Jan 19, 2022
@monocasa
Copy link
Author

@monocasa monocasa commented Jan 19, 2022

@rsc Re licensing comment 1: The original is dual licensed GPLv2 and a 3-Clause BSD which exactly matches golang's license text. An additional license clause on binaries shouldn't be necessary for golang to include this AFAIUI.

Re licensing comment 2: I misunderstood the git blame, so that's on me. I investigated that particular part of the problem before I had access to gerrit. What I will say is that the prologue and epilogue are different, so the inner loop is all that's taken from there, and there's essentially only one way, being dependent on hardware accel instructions which have implicit register arguments. That inner loop additionally appears to be documented by Intel as the only performance stable way for it to be implemented in their whitepaper. My guess is they want to use macro-op fusion on some of these ops in the future. I would hesitate going off of the beaten path there.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 19, 2022

@monocasa A direct copy of the code in https://github.com/torvalds/linux/blob/master/arch/x86/crypto/sha256_ni_asm.S would require an additional license. That code says, among other things, "Copyright(c) 2015 Intel Corporation." and "Redistributions of source code must retain the above copyright notice." It's true that the license has the same restrictions as the Go license, but the copyright is different, and as such the new license would need to be included by anybody who might be using the Go standard library with that code included. We're trying to avoid license proliferation for people who distribute Go code. Thanks.

(But copyright only protects individual expression, not ideas, so it's quite possible that the same sequence of instructions is not covered by copyright if that is the only way to write those instructions. But let's avoid that argument if we can.)

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 19, 2022

@ianlancetaylor Ultimately I stand by the idea that the code here isn't ultimately derived in a way which would impede as simply the algorithm itself is taken (as stated in the comment), and there's sort of one performant way to do this, and it's what's documented in Intel's white papers as the correct use of the instructions.

That being said, I understand, and obviously defer to the golang maintainers here. I would ask that if this is determined to be a non starter, then at least I'd appreciate the chance to rewrite it (with probably a slight performance degradation, and an ongoing performance related technical debt for using a non recommended code sequence) rather than simply closing the proposal.

@FiloSottile
Copy link
Contributor

@FiloSottile FiloSottile commented Jan 20, 2022

I appreciate that the code is commented and fits in the existing structure, but the assembly policy is also about testing, and I don't see anything about making sure that both this code and the code it hides do the right thing. Do we even have test coverage on our builders for these instructions? Would we notice a regression if we changed test infrastructure?

Moreover, there are enough new-to-me instructions in here that I don't think I can do a useful review of the code itself. The easiest solution for that would be to get Intel to do a review while approving the submission under the CLA.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 20, 2022

@FiloSottile The functionality of this code is covered under existing sha256 tests. Additionally any bug in this code leaves you with a compiler that can't get to the unit tests since go internally uses sha256 to address intermediate objects it seems (the compiler is very unhappy when this code is wrong, ends up printing intermediate objects as strings, and lays waste to my tty requiring a shell reset). On top of that, my main concern over the existing test vectors was overflowing the DATA_PTR at a 32 bit boundary since golang's assembler doesn't make it as easy to track if you're using 32bit or 64bit registers as GAS does. I did this using AES to create a deterministic but pseudo random 8GB byte stream in a slice, and validated this new test vector against existing implementations both in golang, and externally via libsodium. I didn't include this as adding close to a minute of test time in the case of a lack of sha hardware instructions or avx2 support seemed gratuitous compared to the existing compiler build time. Is there a location for long relatively running tests like that to be submitted into the tree? I know I would feel better with that under more tests.

Beyond that, I attempted to follow the existing nuances WRT to commenting (particularly following the standard set in other crypto hardware accel asm functions) and making the code easy to follow, but I can certainly add to that. It's a little under commented for my general asm sensibilities, but I took a 'when in Rome' approach. I'd love to add anything that would help. Towards that end (and because I don't like the idea of golang accepting code they don't understand whether from Intel or from me), here's some of what I left off that I would normally have added (and have negative qualms with adding if desired):

  1. General Notes: The core algorithm processes four rounds at a time, limited by the width of the xmm registers. Golang's assembler uses nonstandard mnemonics: MOVO is better known as movdqa, MOVOU is better known as movdqu. Because of golang's inability to specify 128bit aligned constants, MOVOU is used exclusively for transfers between xmm registers and memory, and MOVO is used for transfers between xmm registers. Like the other asm sha256 functions in golang, this code makes no attempt to handle the case of being passed an empty slice, or the padding concerns of sha256 proper to leave one with 512bit blocks. It does terminate under such conditions, but it also does give incorrect results, like the existing implementations.

  2. Timing: The time complexity is O(N) N=number of blocks. This does not open timing attacks beyond side channel attacks present on any memory accessing function. This can be validated statically as there is only one branch it it clearly is dependent on remaining blocks.

  3. Data flow: MSG is initially used as the current 128bit chunk of the 512bit block in rounds 0 through 15, then is used as sha256 h[0..3] in rounds 16 through 63. STATE0 and STATE1 are the in progress hash values known in sha256 as a, b, c, d, e, f, g, h. MSGTMP0 through MSGTMP3 more or less represent the message schedule array. MSGTMP4 is a clobberable short term temporary. ABEF_SAVE and CDGH_SAVE are the hash accumulator across blocks, and when unshuffled represent the final digest when complete.

  4. Code flow: The inner loop is best thought of as a completely unrolled software pipelined loop, rotating across the four xmm registers MSGTMP0-MSGTEMP3. Sha256 rounds are symmetric, so why is this code so unsymmetric? Rounds 0-11 have the additional overhead of mixing the current message block into the message schedule. Rounds 12-15 mix the last four words of the message into the message, and additionally setup for the following rounds now that the entire message block has been read in. Rounds 16-59 are symmetric, and rotate through MSGTMP0-MSGTMP3 as their input. I initially added an unused temporary to the #define SHANI_ROUND_16_59. As an example rounds 16-19 would look like

    SHANI_ROUND_16_59(4, MSGTMP0, MSGTMP1, MSGTEMP2, MSGTMP3)

followed by rounds 20-23

SHANI_ROUND_16_59(4, MSGTMP1, MSGTMP2, MSGTEMP3, MSGTMP0)

to make this rotation clearer but decided that unused variables to a #define was a larger sin. I welcome calls to change that back. Form there at rounds 52-59, we no longer have to message schedule for the next rounds, because that work has already been done in previous round, so the SHA256MSG1 can be left off (and should for correctness). Finally rounds 60-63 have significantly less work as the epilogue of the unrolled loop and can simply mix in the existing schedule, and perform the rounds, not having to do any additional work for non existent later rounds.

WRT the instructions themselves, this patch doesn't add them to the compiler. They've existed in golang for nearly four years, having been added to cmd/internal/obj/x86 by Intel when they added AVX-512 support. However if additional test coverage is required there, I'm more than happy to add it. Can you suggest an example to follow? As for "would we notice a regression", I imagine that Google has been using Intel chips manufactured in the past 3 years or any of the AMD Zen chips in it's test infrastructure. Because of how the new instructions are core to the round and message scheduling any miscompilation of the shani instructions would both cause the sha256 test vectors to fail, and leave you with a useless compiler. If a miscompilation of the accel instructions that didn't result in failures in the test vectors, this would represent most of the way to a viable collision attack on sha256 proper because of how each round message schedule is necessary to fully smear the entropy in a way to get a compliant hash.

Finally, I of course welcome a review from Intel, but lack knowledge as to Intel's current relationship with golang. Is making that a requirement a viable option for moving forward, or does that practically kill the proposal?

@dbussink
Copy link

@dbussink dbussink commented Jan 24, 2022

I'd like to also mention that I opened #48720 a while ago which implements this and I've implemented that with specifically only the Intel reference documentation. I've also address already a few bits of feedback there but there's still a bunch of open questions.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 24, 2022

@dbussink Unfortunately, basing the code off of Intel reference documentation seems to be a more of a no go from a license perspective since the Intel whitepaper's license states

NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.

So either the decision is that there isn't enough here to be a true copyright claim and we're allowed take the underlying flow regardless, or the Linux side is safer from a license perspective already being 3-clause BSD from an entity in the AUTHORS file, or neither are safe and we truly require Intel's explicit signoff to use their instructions as described.

@dbussink
Copy link

@dbussink dbussink commented Jan 24, 2022

@dbussink Unfortunately, basing the code off of Intel reference documentation seems to be a more of a no go from a license perspective since the Intel whitepaper's license states

Ah, where did you find this license? I don't see it on https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html but I might be overlooking something? Although it would seem very surprising that the documentation itself would not allow an implementation based on it licensing wise 😕.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 24, 2022

@dbussink The last page of https://www.intel.com/content/dam/develop/external/us/en/documents/intel-sha-extensions-white-paper.pdf which is the original source of that information in your blog post (and unfortunately the blog post doesn't grant another license).

I agree that it's incredibly goofy to give example code and then ban you from using it directly, but I'm not sure else how to interpret that.

@ulikunitz
Copy link
Contributor

@ulikunitz ulikunitz commented Jan 24, 2022

The statement says only that the paper doesn't grant any license to any intellectual property rights. The statement is unnecessary because the white paper doesn't grant any such rights without it, but lawyers like to add those statements to leave nothing in doubt.

You don't need a license to use machine instructions in your own code. The code sequences in the white paper are clearly designated as examples, so you can follow them in your own code.

However you cannot copy the whitepaper into any repository or copy Intel code from other open source projects into the Go source code. The Contributor License Agreement (CLA) requirement by Google prevents the latter even if the code would have the same license terms as the Go code.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 24, 2022

@ulikunitz

The statement says only that the paper doesn't grant any license to any intellectual property rights. The statement is unnecessary because the white paper doesn't grant any such rights without it, but lawyers like to add those statements to leave nothing in doubt.

You don't need a license to use machine instructions in your own code. The code sequences in the white paper are clearly designated as examples, so you can follow them in your own code.

Licensing isn't based on what makes sense, it's based on the license text. It very clearly says that you aren't granted a license to any IP in the document, which would include copyright over code sequences. Google v Oracle ended with the de minimis defense still squashed by the courts.

However you cannot copy the whitepaper into any repository or copy Intel code from other open source projects into the Go source code. The Contributor License Agreement (CLA) requirement by Google prevents the latter even if the code would have the same license terms as the Go code.

We're talking about the exact same code sequences in question both in the white paper and in the Linux kernel. Either it doesn't matter at all because they're both exactly the same code sequences, or neither are valid sources of these sequences, or the source with the same license explicitly designated as 3 clause BSD by an entity already in the AUTHORs file is a valid source of these sequences.

@ulikunitz
Copy link
Contributor

@ulikunitz ulikunitz commented Jan 25, 2022

@monocasa You are arguing here that you need an explicit license by Intel to use Intel(r) machine instructions for their intended purpose and as documented to implement a public standard. If that would be true, we would live in a very poor world.

At the end it is not for you and me to assess the legal risk here and whether to accept it.

@monocasa
Copy link
Author

@monocasa monocasa commented Jan 25, 2022

@ulikunitz > You are arguing here that you need an explicit license by Intel to use Intel(r) machine instructions for their intended purpose and as documented to implement a public standard. If that would be true, we would live in a very poor world.

The document doesn't just say "these are what these instructions do" like normal Intel documentation. It walks through a full fairly large code sequence on the order of a hundred lines or so, that is verbatim what is at question here, down to the variable names and indentation. It is the exact same code sequence in both cases. A legal text at the end granting no license to any IP is very clear in the US. Hence why I found an alternative with a much clearer license that appears more than compatible. That's how I knew about the lack of license being granted by that document in the first place.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 26, 2022

For the Go project, the only way we are going to accept this code is to get Intel to agree to contribute it under our CLA, as they have done in the past. That's the path forward. Arguments about what the law is are beside the point.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 26, 2022

@TocarIP, are you still at Intel and do you know who could agree to contribute the SHANI SHA256 code under CLA as we did for the other implementations? Thanks.

@TocarIP
Copy link
Contributor

@TocarIP TocarIP commented Jan 26, 2022

I'm no longer with Intel. @aregm and @intel-go in general should be able to help with licence issues. FWIW I've sent https://go-review.googlesource.com/c/go/+/135378 when I was at Intel.

@aregm
Copy link

@aregm aregm commented Jan 26, 2022

@rsc let me ping the guys who are supposedly in charge.

@jbrosenz
Copy link

@jbrosenz jbrosenz commented Jan 26, 2022

Intel Go guy here, received the ping from Areg. I'll do some digging on my end to help drive resolution. Thanks for bringing the issue to my attention.

@aregm
Copy link

@aregm aregm commented Jan 26, 2022

@jbrosenz can you please also contribute the code (if needed) that monocasa is mentioning from the internal repo under the CLA so it will be a clean contribution and no copy/paste licensing issues will occur?

@tpaint
Copy link

@tpaint tpaint commented Jan 27, 2022

@rsc we (Intel) are working on an optimized SHA256 implementation now that could be contributed under the CLA.

@rsc
Copy link
Contributor

@rsc rsc commented Feb 2, 2022

@tpaint, thank you very much!

@rsc
Copy link
Contributor

@rsc rsc commented Feb 23, 2022

Putting on hold until Intel contributes a CL.

@rsc
Copy link
Contributor

@rsc rsc commented Feb 23, 2022

Placed on hold.
— rsc for the proposal review group

@howardjohn
Copy link

@howardjohn howardjohn commented Apr 8, 2022

Intel folks, is there any timeline on when this will be submitted?

Not trying to be an entitled demand, but understanding if this will land soon will help determine whether a number of projects will adopt third party solutions (https://github.com/minio/sha256-simd#drop-in-replacement) or wait for this to be merged.

Many container operations are, surprisingly, sha256 bound, so that ecosystem has the opportunity for dramatic performance improvements which are currently on hold until the timeline for this is understood

@jbrosenz
Copy link

@jbrosenz jbrosenz commented Apr 12, 2022

Attending to this is next in our queue, which all things being equal, puts it on the order of weeks, not months away. @tpaint will update here.

@monocasa
Copy link
Author

@monocasa monocasa commented Apr 14, 2022

@jbrosenz Thanks, that's great to hear! I really appreciate Intel making this a priority.

@tpaint
Copy link

@tpaint tpaint commented May 1, 2022

Status update - the first revision of our SHA-256 sha-ni patch (SHA256RNDS2, SHA256MSG1, SHA256MSG2) has been integrated with the current version of sha256block_amd64.s and is passing all unit tests. As soon as our internal review and approval cycle completes we will generate a pull request for review.

@tpaint
Copy link

@tpaint tpaint commented May 25, 2022

sha-256 patch is approved and ready to submit. Proposed dependent internal/cpu support for sha-ni instructions in #53084.

@tpaint
Copy link

@tpaint tpaint commented May 25, 2022

both change lists will be submitted next

@gopherbot
Copy link

@gopherbot gopherbot commented May 26, 2022

Change https://go.dev/cl/408794 mentions this issue: internal/cpu: detect sha-ni instruction support for AMD64

@tpaint
Copy link

@tpaint tpaint commented May 26, 2022

internal/cpu sha-ni: https://go-review.googlesource.com/c/go/+/408794/1
crypto/sha256: https://go-review.googlesource.com/c/go/+/408795/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal Proposal-Crypto Proposal-Hold
Projects
Development

No branches or pull requests