LLVM's SHA256 is very inefficient #56121

nico · 2022-06-19T14:05:02Z

I was able to make ld64.lld 20% (!) faster by tweaking how it computes SHA256.

The current SHA256 API requires putting the 32-byte result on the stack and the memcpy()ing it to the right place. Tweaking the API to have an outparam would save this.
The SHA256 internals aren't very optimized. Using e.g. CommonCrypto/CommonDigest.h on macOS is faster. (17% speedup)
(We don't parallelize the SHA256 implementation. Doing that helps a bit more (20%)

https://github.com/nico/llvm-project/commits/hash has details, proof-of-concept level.

I used the repro file at https://drive.google.com/file/d/1wWCeDWQ3OAyVwadyCdFZ0WXNB11ADnBK/view?usp=sharing as benchmark case.

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-06-19T14:05:20Z

@llvm/issue-subscribers-lld-macho

nico · 2022-06-20T13:12:48Z

As for what to do about this: Rather than calling out to system libraries (defensible on macOS, but weird on other unixen), we should probably try to make our sha256 impl less slow.

There are SIMD instructions for modern x86 chips: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html (using the intrins)

There are also SIMD instructions for armv8+crypto (which M1 chips support): vsha256hq_u32, vsha256h2q_u32, vsha256su1q_u32.

Looks like there's public-domain code using these at https://sources.debian.org/src/libcrypto++/8.4.0-1/sha_simd.cpp/ which can provide inspiration for replacing SHA256::hashBlock() with a SIMD impl (…probably using something like https://llvm.org/devmtg/2014-10/Slides/Christopher-Function%20Multiversioning%20Talk.pdf ).

(Alternatively, llvm/lib/Support/BLAKE3/ can provide inspiration for going with asm instead of intrinsics – but intrinsics are probably easier to work with.)

nico · 2022-06-21T18:58:48Z

858e8b1 did the cop-out for now.

But still slow when e.g. linking a mac/arm binary on a linux system.

nico · 2022-06-22T13:23:35Z

0baf13e did the parallelization. (Commit message has some metrics.)

After these changes, things are fairly fast on macOS. Linking an arm64 mac binary on different hosts is still 2.55% slower than it needs to be; the fix for that is to make LLVM's SHA256 implementation faster as described in #56121 (comment)

davidben · 2022-07-12T13:50:33Z

One thing to note is the Intel SHA extensions aren't all that common right now. As I understand it, they were originally only on low power chips. (Though AMD Zen has it now, and I think the newest Intel ones might too?) OpenSSL and friends also have SHA-256 implementations using more general purpose SIMD that kick in more often in x86 today.

SHA-256 acceleration on Arm is common though. It's part of the original cryptography extension from Armv8, so most Armv8 chips have it.

nico added the lld:MachO label Jun 19, 2022

EugeneZelenko added the performance label Jun 19, 2022

nico mentioned this issue Jul 12, 2022

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVM's SHA256 is very inefficient #56121

LLVM's SHA256 is very inefficient #56121

nico commented Jun 19, 2022

llvmbot commented Jun 19, 2022

nico commented Jun 20, 2022

nico commented Jun 21, 2022

nico commented Jun 22, 2022

davidben commented Jul 12, 2022

LLVM's SHA256 is very inefficient #56121

LLVM's SHA256 is very inefficient #56121

Comments

nico commented Jun 19, 2022

llvmbot commented Jun 19, 2022

nico commented Jun 20, 2022

nico commented Jun 21, 2022

nico commented Jun 22, 2022

davidben commented Jul 12, 2022