Basic chore cleanups of tests and benchmarks #66

ribasushi · 2023-01-28T23:57:17Z

No functional changes whatsoever, only renaming of some internal functions and shoring up the benchmark executor.

Interesting results on go 1.19 /cc @klauspost

AWS Graviton Neoverse-N1

BenchmarkHash/Generic/8Bytes-48         	 2081868	       576.1 ns/op	  13.89 MB/s
BenchmarkHash/Generic/64Bytes-48        	 1000000	      1123 ns/op	  56.98 MB/s
BenchmarkHash/Generic/1K-48             	  132132	      9094 ns/op	 112.60 MB/s
BenchmarkHash/Generic/8K-48             	   17422	     68553 ns/op	 119.50 MB/s
BenchmarkHash/Generic/1M-48             	     136	   8727612 ns/op	 120.14 MB/s
BenchmarkHash/Generic/5M-48             	      26	  43630193 ns/op	 120.17 MB/s
BenchmarkHash/Generic/10M-48            	      13	  87234265 ns/op	 120.20 MB/s
BenchmarkHash/ArmSha2/8Bytes-48         	12504379	        95.34 ns/op	  83.91 MB/s
BenchmarkHash/ArmSha2/64Bytes-48        	 7602079	       157.2 ns/op	 407.24 MB/s
BenchmarkHash/ArmSha2/1K-48             	 1556676	       770.0 ns/op	1329.81 MB/s
BenchmarkHash/ArmSha2/8K-48             	  224259	      5350 ns/op	1531.33 MB/s
BenchmarkHash/ArmSha2/1M-48             	    1788	    670851 ns/op	1563.05 MB/s
BenchmarkHash/ArmSha2/5M-48             	     350	   3358073 ns/op	1561.28 MB/s
BenchmarkHash/ArmSha2/10M-48            	     177	   6707201 ns/op	1563.36 MB/s
BenchmarkHash/GoStdlib/8Bytes-48        	11228222	       106.3 ns/op	  75.29 MB/s
BenchmarkHash/GoStdlib/64Bytes-48       	 7964581	       150.9 ns/op	 424.12 MB/s
BenchmarkHash/GoStdlib/1K-48            	 1566817	       765.9 ns/op	1336.90 MB/s
BenchmarkHash/GoStdlib/8K-48            	  224491	      5345 ns/op	1532.70 MB/s
BenchmarkHash/GoStdlib/1M-48            	    1788	    670828 ns/op	1563.11 MB/s
BenchmarkHash/GoStdlib/5M-48            	     356	   3354074 ns/op	1563.14 MB/s
BenchmarkHash/GoStdlib/10M-48           	     177	   6710705 ns/op	1562.54 MB/s

Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

BenchmarkHash/Generic/8Bytes-48			 1000000	      1383 ns/op	   5.78 MB/s
BenchmarkHash/Generic/64Bytes-48         	  363115	      2934 ns/op	  21.81 MB/s
BenchmarkHash/Generic/1K-48              	   51576	     23917 ns/op	  42.81 MB/s
BenchmarkHash/Generic/8K-48              	    5802	    184658 ns/op	  44.36 MB/s
BenchmarkHash/Generic/1M-48              	      44	  24440770 ns/op	  42.90 MB/s
BenchmarkHash/Generic/5M-48              	      12	 102287809 ns/op	  51.26 MB/s
BenchmarkHash/Generic/10M-48             	       7	 151643552 ns/op	  69.15 MB/s
BenchmarkHash/GoStdlib/8Bytes-48         	 1775823	       684.9 ns/op	  11.68 MB/s
BenchmarkHash/GoStdlib/64Bytes-48        	  906128	      1203 ns/op	  53.21 MB/s
BenchmarkHash/GoStdlib/1K-48             	  133762	      8618 ns/op	 118.82 MB/s
BenchmarkHash/GoStdlib/8K-48             	   17229	     63527 ns/op	 128.95 MB/s
BenchmarkHash/GoStdlib/1M-48             	     136	   9105973 ns/op	 115.15 MB/s
BenchmarkHash/GoStdlib/5M-48             	      48	  25859536 ns/op	 202.74 MB/s
BenchmarkHash/GoStdlib/10M-48            	      26	  48503560 ns/op	 216.19 MB/s
BenchmarkAvx512_05M-48                   	      13	 111209048 ns/op	  75.43 MB/s
BenchmarkAvx512_1M-48                    	       6	 220232186 ns/op	  76.18 MB/s
BenchmarkAvx512_5M-48                    	       2	 922569076 ns/op	  90.93 MB/s
BenchmarkAvx512_10M-48                   	       1	1323279994 ns/op	 126.79 MB/s
BenchmarkAvx512_5M_2Cores-48             	       2	 549461240 ns/op	 305.34 MB/s
BenchmarkAvx512_5M_4Cores-48             	       2	 645792065 ns/op	 519.59 MB/s
BenchmarkAvx512_5M_6Cores-48             	       1	1132375978 ns/op	 444.48 MB/s

AMD Ryzen 7 3700X 8-Core Processor

BenchmarkHash/Generic/8Bytes-16 	      	 3536659	       341.3 ns/op	  23.44 MB/s
BenchmarkHash/Generic/64Bytes-16         	 1775308	       660.1 ns/op	  96.95 MB/s
BenchmarkHash/Generic/1K-16              	  219138	      5403 ns/op	 189.52 MB/s
BenchmarkHash/Generic/8K-16              	   28696	     40512 ns/op	 202.21 MB/s
BenchmarkHash/Generic/1M-16              	     216	   5214497 ns/op	 201.09 MB/s
BenchmarkHash/Generic/5M-16              	      44	  25382559 ns/op	 206.55 MB/s
BenchmarkHash/Generic/10M-16             	      22	  51851523 ns/op	 202.23 MB/s
BenchmarkHash/IntelSHA/8Bytes-16         	19744753	        59.74 ns/op	 133.91 MB/s
BenchmarkHash/IntelSHA/64Bytes-16        	12684049	        93.82 ns/op	 682.17 MB/s
BenchmarkHash/IntelSHA/1K-16             	 2163363	       531.0 ns/op	1928.27 MB/s
BenchmarkHash/IntelSHA/8K-16             	  300894	      3932 ns/op	2083.20 MB/s
BenchmarkHash/IntelSHA/1M-16             	    2388	    495169 ns/op	2117.61 MB/s
BenchmarkHash/IntelSHA/5M-16             	     436	   2486044 ns/op	2108.92 MB/s
BenchmarkHash/IntelSHA/10M-16            	     235	   4984128 ns/op	2103.83 MB/s
BenchmarkHash/GoStdlib/8Bytes-16         	 7553770	       166.8 ns/op	  47.96 MB/s
BenchmarkHash/GoStdlib/64Bytes-16        	 3785841	       292.5 ns/op	 218.83 MB/s
BenchmarkHash/GoStdlib/1K-16             	  545599	      2128 ns/op	 481.19 MB/s
BenchmarkHash/GoStdlib/8K-16             	   77860	     15812 ns/op	 518.09 MB/s
BenchmarkHash/GoStdlib/1M-16             	     534	   2026093 ns/op	 517.54 MB/s
BenchmarkHash/GoStdlib/5M-16             	     122	  10069437 ns/op	 520.67 MB/s
BenchmarkHash/GoStdlib/10M-16            	      57	  20029738 ns/op	 523.51 MB/s

Macbook 2015 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

BenchmarkHash/Generic/8Bytes-8         	 2214006	       541.3 ns/op	  14.78 MB/s
BenchmarkHash/Generic/64Bytes-8        	 1000000	      1058 ns/op	  60.47 MB/s
BenchmarkHash/Generic/1K-8             	  137259	      8761 ns/op	 116.88 MB/s
BenchmarkHash/Generic/8K-8             	   17540	     65134 ns/op	 125.77 MB/s
BenchmarkHash/Generic/1M-8             	     144	   8301478 ns/op	 126.31 MB/s
BenchmarkHash/Generic/5M-8             	      27	  41688594 ns/op	 125.76 MB/s
BenchmarkHash/Generic/10M-8            	      13	  83150422 ns/op	 126.11 MB/s
BenchmarkHash/GoStdlib/8Bytes-8        	 4718106	       254.9 ns/op	  31.39 MB/s
BenchmarkHash/GoStdlib/64Bytes-8       	 2546954	       466.6 ns/op	 137.16 MB/s
BenchmarkHash/GoStdlib/1K-8            	  374762	      3170 ns/op	 322.98 MB/s
BenchmarkHash/GoStdlib/8K-8            	   50767	     23429 ns/op	 349.66 MB/s
BenchmarkHash/GoStdlib/1M-8            	     402	   2969511 ns/op	 353.11 MB/s
BenchmarkHash/GoStdlib/5M-8            	      79	  15015291 ns/op	 349.17 MB/s
BenchmarkHash/GoStdlib/10M-8           	      39	  30022449 ns/op	 349.26 MB/s

Macbook M1 Pro 2022

BenchmarkHash/Generic/8Bytes-10         	 3660108	       310.0 ns/op	  25.80 MB/s
BenchmarkHash/Generic/64Bytes-10        	 1974002	       607.5 ns/op	 105.35 MB/s
BenchmarkHash/Generic/1K-10             	  238525	      5026 ns/op	 203.75 MB/s
BenchmarkHash/Generic/8K-10             	   31683	     38009 ns/op	 215.53 MB/s
BenchmarkHash/Generic/1M-10             	     247	   4840271 ns/op	 216.64 MB/s
BenchmarkHash/Generic/5M-10             	      48	  24228139 ns/op	 216.40 MB/s
BenchmarkHash/Generic/10M-10            	      24	  48486415 ns/op	 216.26 MB/s
BenchmarkHash/ArmSha2/8Bytes-10         	23491410	        51.09 ns/op	 156.59 MB/s
BenchmarkHash/ArmSha2/64Bytes-10        	15159856	        78.47 ns/op	 815.56 MB/s
BenchmarkHash/ArmSha2/1K-10             	 2506291	       478.2 ns/op	2141.24 MB/s
BenchmarkHash/ArmSha2/8K-10             	  342472	      3463 ns/op	2365.80 MB/s
BenchmarkHash/ArmSha2/1M-10             	    2692	    436449 ns/op	2402.52 MB/s
BenchmarkHash/ArmSha2/5M-10             	     546	   2182072 ns/op	2402.71 MB/s
BenchmarkHash/ArmSha2/10M-10            	     272	   4379574 ns/op	2394.24 MB/s
BenchmarkHash/GoStdlib/8Bytes-10        	22089243	        54.34 ns/op	 147.22 MB/s
BenchmarkHash/GoStdlib/64Bytes-10       	16055944	        74.21 ns/op	 862.44 MB/s
BenchmarkHash/GoStdlib/1K-10            	 2557021	       472.0 ns/op	2169.35 MB/s
BenchmarkHash/GoStdlib/8K-10            	  348588	      3432 ns/op	2386.66 MB/s
BenchmarkHash/GoStdlib/1M-10            	    2752	    435490 ns/op	2407.81 MB/s
BenchmarkHash/GoStdlib/5M-10            	     553	   2162255 ns/op	2424.73 MB/s
BenchmarkHash/GoStdlib/10M-10           	     276	   4331759 ns/op	2420.67 MB/s

ribasushi · 2023-01-29T21:56:05Z

After the latest renames I am also noticing that the arm go-preamble does a lot of assignment work:
https://github.com/minio/sha256-simd/blob/9235fbaea/sha256block_arm64.go#L26-L37

compared to the intel one pushing everything into asm-land:
https://github.com/minio/sha256-simd/blob/9235fbaea/sha256block_amd64.go#L26-L31

Sadly I do not know enough assembly yet to properly adjust the preamble of https://github.com/minio/sha256-simd/blob/9235fbaea/sha256block_arm64.s#L28-L36, but I am pretty sure this will speed up things even more.

klauspost · 2023-01-30T08:45:26Z

@ribasushi Thanks. The preamble work should be pretty harmless since it doesn't escape. I had to double check the logic for the Xeon case, but it seems that the library correctly falls back to using the stdlib.

I guess now we could remove the arm64 version as well.

klauspost

LGTM

ribasushi · 2023-01-30T10:01:35Z

I guess now we could remove the arm64 version as well.

Yeah. Although if you look closely the ARM version on sub-block messages is noticeably faster than the stdlib one. I didn't manage to find the source of the difference

Graviton

BenchmarkHash/ArmSha2/8Bytes-48         	12504379	        95.34 ns/op	  83.91 MB/s
BenchmarkHash/GoStdlib/8Bytes-48        	11228222	       106.3 ns/op	  75.29 MB/s

M1

BenchmarkHash/ArmSha2/8Bytes-10         	23491410	        51.09 ns/op	 156.59 MB/s
BenchmarkHash/GoStdlib/8Bytes-10        	22089243	        54.34 ns/op	 147.22 MB/s

fwessels · 2023-01-30T18:19:30Z

Sadly I do not know enough assembly yet to properly adjust the preamble of https://github.com/minio/sha256-simd/blob/9235fbaea/sha256block_arm64.s#L28-L36, but I am pretty sure this will speed up things even more.

Virtually all the time is spend in the (main) hashing loop, so optimizing calling into the assembly will have a negligible performance effect (even for very short messages that are hashed, but those are super fast regardless).

ribasushi added 3 commits January 28, 2023 23:01

Upgrade cpuid, regen buildtags

f2605e7

Move all uses of cpuid to one place, adjust benchmarks

4933c9a

Rename SIMD functions to match instruction set

309a6ee

harshavardhana requested a review from klauspost January 29, 2023 02:14

One more rename, align arch-specific files

9235fba

klauspost approved these changes Jan 30, 2023

View reviewed changes

klauspost merged commit d9c3aea into minio:master Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic chore cleanups of tests and benchmarks #66

Basic chore cleanups of tests and benchmarks #66

ribasushi commented Jan 28, 2023

ribasushi commented Jan 29, 2023

klauspost commented Jan 30, 2023 •

edited

Loading

klauspost left a comment

ribasushi commented Jan 30, 2023

fwessels commented Jan 30, 2023

Basic chore cleanups of tests and benchmarks #66

Basic chore cleanups of tests and benchmarks #66

Conversation

ribasushi commented Jan 28, 2023

ribasushi commented Jan 29, 2023

klauspost commented Jan 30, 2023 • edited Loading

klauspost left a comment

Choose a reason for hiding this comment

ribasushi commented Jan 30, 2023

fwessels commented Jan 30, 2023

klauspost commented Jan 30, 2023 •

edited

Loading