Optimise pairwise multiplication #5282

cj5716 · 2024-05-22T02:33:53Z

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) = 962946 +/- 1202
test (...ise-max-less) = 979696 +/- 1084
diff = +16750 +/- 1794

speedup = +0.0174
P(speedup > 0) = 1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) = 966033 +/- 1574
test (...max-less_gcc) = 983319 +/- 1513
diff = +17286 +/- 2515

speedup = +0.0179
P(speedup > 0) = 1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes #5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change

github-actions · 2024-05-22T02:34:30Z

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 9209296365 / attempt 1)

@AndyGrant

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal." That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo. However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even) Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom Result of 50 runs base (...es/stockfish) = 962946 +/- 1202 test (...ise-max-less) = 979696 +/- 1084 diff = +16750 +/- 1794 speedup = +0.0174 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on Also a speedup on "COMP=gcc", courtesy of Torom once again Result of 50 runs base (...tockfish_gcc) = 966033 +/- 1574 test (...max-less_gcc) = 983319 +/- 1513 diff = +17286 +/- 2515 speedup = +0.0179 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on closes official-stockfish#5282 Passed STC: LLR: 2.96 (-2.94,2.94) <0.00,2.00> Total: 67712 W: 17715 L: 17358 D: 32639 Ptnml(0-2): 225, 7472, 18140, 7759, 260 https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906 No functional change

Torom · 2024-05-22T07:20:32Z

I think it should also be mentioned that this is only a small speedup if any for NEON:

ARCH=armv8-dotprod COMP=clang

Result of  30 runs
==================
base (../stockfish   ) =     259861  +/- 695
test (...ise-max-less) =     260887  +/- 361
diff                   =      +1026  +/- 876

speedup        = +0.0039
P(speedup > 0) =  0.9891

CPU: 4 x aarch64
Hyperthreading: off

ARCH=armv8-dotprod COMP=gcc

Result of  30 runs
==================
base (...tockfish_gcc) =     277171  +/- 608
test (...max-less_gcc) =     278231  +/- 400
diff                   =      +1060  +/- 834

speedup        = +0.0038
P(speedup > 0) =  0.9935

CPU: 4 x aarch64
Hyperthreading: off

This is also the reason why the first SPRT failed when vondeles fleet was active:

https://tests.stockfishchess.org/tests/view/664b21fb830eb9f886614a79
LLR: -2.95 (-2.94,2.94) <0.00,2.00>
Total: 122432 W: 31254 L: 31331 D: 59847
Ptnml(0-2): 331, 12841, 34940, 12782, 322

mstembera · 2024-05-22T07:30:59Z

Congrats! Nice speedup.

vondele · 2024-05-22T07:33:30Z

This is a nice speedup, but can only be merged if the needed changes are made upstream to the trainer, IMO. @linrock or @Sopel97 ... opinions?

linrock · 2024-05-22T13:50:23Z

nice speedup, maybe it will help with larger L1.

what did you use for converting the current nets?
what nnue-pytorch changes are needed to train usable nets?

cj5716 · 2024-05-23T00:31:25Z

To convert the nets I simply doubled all the weights and biases on net loading, and then exported it

For changes on the trainer side, I'm not sure where to edit, but the change should be as simple as changing the FT quantised_one value. Maybe Sopel can help with this?

@AndyGrant

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal." That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo. However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even) Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom Result of 50 runs base (...es/stockfish) = 962946 +/- 1202 test (...ise-max-less) = 979696 +/- 1084 diff = +16750 +/- 1794 speedup = +0.0174 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on Also a speedup on "COMP=gcc", courtesy of Torom once again Result of 50 runs base (...tockfish_gcc) = 966033 +/- 1574 test (...max-less_gcc) = 983319 +/- 1513 diff = +17286 +/- 2515 speedup = +0.0179 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on closes official-stockfish#5282 Passed STC: LLR: 2.96 (-2.94,2.94) <0.00,2.00> Total: 67712 W: 17715 L: 17358 D: 32639 Ptnml(0-2): 225, 7472, 18140, 7759, 260 https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906 No functional change

Sopel97 · 2024-05-23T13:26:12Z

looks good, but this requires modifying the network, and renders all others silently incompatible, as far as I understand. I'd instead multiply the weights and biases within the engine, and think of a way to version this change within the network, should we decide to change it in the trainer.

I believe feature with index 0 may be unused in all sound feature sets, so we could use that space potentially

@AndyGrant

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal." That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo. However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even) Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom Result of 50 runs base (...es/stockfish) = 962946 +/- 1202 test (...ise-max-less) = 979696 +/- 1084 diff = +16750 +/- 1794 speedup = +0.0174 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on Also a speedup on "COMP=gcc", courtesy of Torom once again Result of 50 runs base (...tockfish_gcc) = 966033 +/- 1574 test (...max-less_gcc) = 983319 +/- 1513 diff = +17286 +/- 2515 speedup = +0.0179 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on closes official-stockfish#5282 Passed STC: LLR: 2.96 (-2.94,2.94) <0.00,2.00> Total: 67712 W: 17715 L: 17358 D: 32639 Ptnml(0-2): 225, 7472, 18140, 7759, 260 https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906 No functional change

cj5716 · 2024-05-23T13:52:27Z

Have moved it into the engine

@AndyGrant

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal." That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo. However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom Result of 50 runs base (...es/stockfish) = 962946 +/- 1202 test (...ise-max-less) = 979696 +/- 1084 diff = +16750 +/- 1794 speedup = +0.0174 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on Also a speedup on "COMP=gcc", courtesy of Torom once again Result of 50 runs base (...tockfish_gcc) = 966033 +/- 1574 test (...max-less_gcc) = 983319 +/- 1513 diff = +17286 +/- 2515 speedup = +0.0179 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on closes official-stockfish#5282 Passed STC: LLR: 2.96 (-2.94,2.94) <0.00,2.00> Total: 67712 W: 17715 L: 17358 D: 32639 Ptnml(0-2): 225, 7472, 18140, 7759, 260 https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906 No functional change

@AndyGrant

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal." That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo. However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom Result of 50 runs base (...es/stockfish) = 962946 +/- 1202 test (...ise-max-less) = 979696 +/- 1084 diff = +16750 +/- 1794 speedup = +0.0174 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on Also a speedup on "COMP=gcc", courtesy of Torom once again Result of 50 runs base (...tockfish_gcc) = 966033 +/- 1574 test (...max-less_gcc) = 983319 +/- 1513 diff = +17286 +/- 2515 speedup = +0.0179 P(speedup > 0) = 1.0000 CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Hyperthreading: on closes official-stockfish#5282 Passed STC: LLR: 2.96 (-2.94,2.94) <0.00,2.00> Total: 67712 W: 17715 L: 17358 D: 32639 Ptnml(0-2): 225, 7472, 18140, 7759, 260 https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906 No functional change

Sopel97 · 2024-05-24T11:46:11Z

we could potentially test if the additional precision matters by randomly altering all affected weights by a value from [-1, 0, 1] and see the Elo impact

cj5716 · 2024-05-25T05:28:15Z

Trying something like that with this test https://tests.stockfishchess.org/tests/view/665176bca86388d5e27da95f

Edit: the test description is wrong, it rounds up if magnitude > 192, not 255

mstembera · 2024-05-25T22:35:22Z

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

@cj5716 Can you say a bit more about this? I naively thought this https://tests.stockfishchess.org/tests/view/665244aaa86388d5e27da999 would be non functional and get rid of the load scaling but clearly not.

cj5716 · 2024-05-26T03:35:03Z

127 == 0b0000000001111111
After shifting left by 9, we get 0b1111111000000000.
But this value is taken as an i16 in the mulhi, leading the actual value to be -512

And what if we use mulhi_epu16 instead? In this case, we will have to add the lower-side clip for the sum1s, and if we attempt to remove the upper side clip, the output from the mulhi will occupy full 16 bits meaning we cannot use packs to shrink it to [0, 127] as already done implicitly. So we will have to add all the instructions back, leading to no speedup in the end.

mstembera · 2024-05-26T19:21:52Z

Makes sense. Thanks! The pre multiplication is clever to avoid the issue. As you mentioned if we move this to the trainer and train the weights using the larger range we may get some benefit from better resolution. Pre multiplying by 4 also works
master...mstembera:Stockfish:Premul4 Currently the largest non pre multiplied weight is 1484 so well below the signed 16bit max of 32767 even after pre multiplying.

AndyGrant · 2024-05-26T19:29:51Z

Should note that in theory you will be adding up to 31 weights (1 for every piece relation, except the kings are combined as one). Any weights above 32767/31 = 1057 are prone to overflow. If sum(biggest-31-weights) > 32767 or sum (smallest-31-weights) < -32768 for an individual neuron, then you have issues in theory.

Sopel97 · 2024-05-26T20:08:55Z

yea, we would need to add an entry for weight clipping for the first layer, careful not to touch PSQT https://github.com/official-stockfish/nnue-pytorch/blob/a32432ec99859b7a8583e8f8f4ca051fd4b3dab0/model.py#L156

mstembera · 2024-05-27T00:56:04Z

So in theory we could already have a problem because some of our weights are above 1057. On the other hand it's extremely unlikely this is actually happening in practice and constraining the weights to below 1057 could hurt elo. It would be possible to check no relevant combinations of 31 weights add to over 32767 at load time.

cj5716 · 2024-05-27T01:10:41Z

I suppose a "band-aid" solution would be to simply switch all the accumulator changes to adds and subs instead of add and sub?

Also, did we have this problem before?

AndyGrant · 2024-05-27T20:43:32Z

I think that is a very good band-aid fix, and might let me make use of this optimization in Torch, while I try to take back the reigns on network training.

With adds/subs instead of add/sub, your new fear is to have a series of inputs cause you to cap out, and then a series after to get you back BELOW 127, or 254/255 in Stockfish's case now. If you pretend that weights are truly random, and the order they get added together is random, it seems exceptionally unlikely you'll run into this case.

Also, did we have this problem before?

Having 31+ weights that are >= 1057 is a prerequisite for being able to overflow. So you probably could in that sense. But really you need 31 weights that all contribute to the same index of the accumulator, AND can be active at the same time.

So take all the accum weights, and split them up based on the output neuron. Next question is do you have 31+ weights there that can exceed 1057? Well we know we can only have 16 pieces of each colour, so you need to look at those ones.

... and they all have to be for the same King sq
... and none of the problematic weights can be on the same sq, since two pieces cannot share the same square,

So at the end of the day, the math to say "Can I actually overflow" is quite complicated IMO. Perhaps I will write something up in python that works for Torch, and I can share it with you Cj.

AndyGrant · 2024-05-28T05:31:56Z

It might not be a good solution actually, since adds and subs requires more specialized execution ports, at least on zen, lowering the throughput of the intruction. I don't know what Intel does, but the intrinsics guides notes a dropping in throughput there as well, IE fewer execution ports present for the operations.

The following is probably good enough to convince yourselves for Stockfish... you look at chunks of 64x11 weights, IE King Squares, for each output neuron. If you want to get really pedantic, there are times where you can add more than 31 inputs at once, since you are in the midst of removing another.

IE suppose your Finny table has 31 inputs added. Then the order of these operations technically matters:

entry = add(set, sub(entry, unset))
entry = sub(add(set, entry), unset)
entry = add(entry, sub(set, unset)) <- Preferable you could say

def check_for_errors(weights, biases):


    chunk_size = 64 * 11
    BOUND  = (2 ** 15) - 1

    # weights.shape = (64*32*11, L1-SIZE)
    for col in range(weights.shape[1]):

        data = weights[:, col]
        for start in range(0, len(data), chunk_size):

            chunk = np.sort(data[start:start+chunk_size])
            lower = sum(chunk[:31]); upper = sum(chunk[-31:])

            upper += max(0, biases[col])
            lower -= min(0, biases[col])

            if (upper > BOUND:
                print (upper)

            if lower < -BOUND:
                print (lower)

n = Network()
n.read_network(args.network)
check_for_errors(n.accumulator['weights'], n.accumulator['biases'])

cj5716 · 2024-05-30T13:52:03Z

I'll see how we can handle this, hopefully all goes well

cj5716 force-pushed the pairwise-max-less branch from 969a5b6 to e700c01 Compare May 22, 2024 02:34

cj5716 force-pushed the pairwise-max-less branch from e700c01 to 8d29658 Compare May 23, 2024 02:41

cj5716 force-pushed the pairwise-max-less branch from 8d29658 to e28e7e2 Compare May 23, 2024 13:51

cj5716 force-pushed the pairwise-max-less branch from e28e7e2 to a3704b8 Compare May 23, 2024 13:53

cj5716 force-pushed the pairwise-max-less branch from a3704b8 to adde48f Compare May 23, 2024 13:54

Disservin added the to be merged Will be merged shortly label May 23, 2024

Disservin closed this in c6a1e7f May 23, 2024

cj5716 deleted the pairwise-max-less branch May 27, 2024 01:09

cj5716 restored the pairwise-max-less branch May 27, 2024 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise pairwise multiplication #5282

Optimise pairwise multiplication #5282

cj5716 commented May 22, 2024 •

edited

Loading

github-actions bot commented May 22, 2024 •

edited

Loading

Torom commented May 22, 2024

mstembera commented May 22, 2024

vondele commented May 22, 2024

linrock commented May 22, 2024

cj5716 commented May 23, 2024

Sopel97 commented May 23, 2024 •

edited

Loading

cj5716 commented May 23, 2024

Sopel97 commented May 24, 2024

cj5716 commented May 25, 2024 •

edited

Loading

mstembera commented May 25, 2024

cj5716 commented May 26, 2024

mstembera commented May 26, 2024

AndyGrant commented May 26, 2024

Sopel97 commented May 26, 2024

mstembera commented May 27, 2024

cj5716 commented May 27, 2024

AndyGrant commented May 27, 2024

AndyGrant commented May 28, 2024 •

edited

Loading

cj5716 commented May 30, 2024

Optimise pairwise multiplication #5282

Optimise pairwise multiplication #5282

Conversation

cj5716 commented May 22, 2024 • edited Loading

github-actions bot commented May 22, 2024 • edited Loading

Torom commented May 22, 2024

mstembera commented May 22, 2024

vondele commented May 22, 2024

linrock commented May 22, 2024

cj5716 commented May 23, 2024

Sopel97 commented May 23, 2024 • edited Loading

cj5716 commented May 23, 2024

Sopel97 commented May 24, 2024

cj5716 commented May 25, 2024 • edited Loading

mstembera commented May 25, 2024

cj5716 commented May 26, 2024

mstembera commented May 26, 2024

AndyGrant commented May 26, 2024

Sopel97 commented May 26, 2024

mstembera commented May 27, 2024

cj5716 commented May 27, 2024

AndyGrant commented May 27, 2024

AndyGrant commented May 28, 2024 • edited Loading

cj5716 commented May 30, 2024

cj5716 commented May 22, 2024 •

edited

Loading

github-actions bot commented May 22, 2024 •

edited

Loading

Sopel97 commented May 23, 2024 •

edited

Loading

cj5716 commented May 25, 2024 •

edited

Loading

AndyGrant commented May 28, 2024 •

edited

Loading