Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise pairwise multiplication #5282

Closed
wants to merge 1 commit into from

Conversation

cj5716
Copy link
Contributor

@cj5716 cj5716 commented May 22, 2024

This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) = 962946 +/- 1202
test (...ise-max-less) = 979696 +/- 1084
diff = +16750 +/- 1794

speedup = +0.0174
P(speedup > 0) = 1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) = 966033 +/- 1574
test (...max-less_gcc) = 983319 +/- 1513
diff = +17286 +/- 2515

speedup = +0.0179
P(speedup > 0) = 1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes #5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change

Copy link

github-actions bot commented May 22, 2024

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 9209296365 / attempt 1)

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 22, 2024
This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even)

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) =     962946  +/- 1202
test (...ise-max-less) =     979696  +/- 1084
diff                   =     +16750  +/- 1794

speedup        = +0.0174
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) =     966033  +/- 1574
test (...max-less_gcc) =     983319  +/- 1513
diff                   =     +17286  +/- 2515

speedup        = +0.0179
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes official-stockfish#5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change
@Torom
Copy link
Contributor

Torom commented May 22, 2024

I think it should also be mentioned that this is only a small speedup if any for NEON:

ARCH=armv8-dotprod COMP=clang

Result of  30 runs
==================
base (../stockfish   ) =     259861  +/- 695
test (...ise-max-less) =     260887  +/- 361
diff                   =      +1026  +/- 876

speedup        = +0.0039
P(speedup > 0) =  0.9891

CPU: 4 x aarch64
Hyperthreading: off

ARCH=armv8-dotprod COMP=gcc

Result of  30 runs
==================
base (...tockfish_gcc) =     277171  +/- 608
test (...max-less_gcc) =     278231  +/- 400
diff                   =      +1060  +/- 834

speedup        = +0.0038
P(speedup > 0) =  0.9935

CPU: 4 x aarch64
Hyperthreading: off

This is also the reason why the first SPRT failed when vondeles fleet was active:

https://tests.stockfishchess.org/tests/view/664b21fb830eb9f886614a79
LLR: -2.95 (-2.94,2.94) <0.00,2.00>
Total: 122432 W: 31254 L: 31331 D: 59847
Ptnml(0-2): 331, 12841, 34940, 12782, 322

@mstembera
Copy link
Contributor

Congrats! Nice speedup.

@vondele
Copy link
Member

vondele commented May 22, 2024

This is a nice speedup, but can only be merged if the needed changes are made upstream to the trainer, IMO. @linrock or @Sopel97 ... opinions?

@linrock
Copy link
Contributor

linrock commented May 22, 2024

nice speedup, maybe it will help with larger L1.

  • what did you use for converting the current nets?
  • what nnue-pytorch changes are needed to train usable nets?

@cj5716
Copy link
Contributor Author

cj5716 commented May 23, 2024

To convert the nets I simply doubled all the weights and biases on net loading, and then exported it

For changes on the trainer side, I'm not sure where to edit, but the change should be as simple as changing the FT quantised_one value. Maybe Sopel can help with this?

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 23, 2024
This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even)

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) =     962946  +/- 1202
test (...ise-max-less) =     979696  +/- 1084
diff                   =     +16750  +/- 1794

speedup        = +0.0174
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) =     966033  +/- 1574
test (...max-less_gcc) =     983319  +/- 1513
diff                   =     +17286  +/- 2515

speedup        = +0.0179
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes official-stockfish#5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change
@Sopel97
Copy link
Member

Sopel97 commented May 23, 2024

looks good, but this requires modifying the network, and renders all others silently incompatible, as far as I understand. I'd instead multiply the weights and biases within the engine, and think of a way to version this change within the network, should we decide to change it in the trainer.

I believe feature with index 0 may be unused in all sound feature sets, so we could use that space potentially

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 23, 2024
This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work. In the future we can increase the quantisation number in the FT to 255 (instead of 2x127=254) and also train a net from scratch to make use of the extra bit (currently all the weights and biases are even)

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) =     962946  +/- 1202
test (...ise-max-less) =     979696  +/- 1084
diff                   =     +16750  +/- 1794

speedup        = +0.0174
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) =     966033  +/- 1574
test (...max-less_gcc) =     983319  +/- 1513
diff                   =     +17286  +/- 2515

speedup        = +0.0179
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes official-stockfish#5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change
@cj5716
Copy link
Contributor Author

cj5716 commented May 23, 2024

Have moved it into the engine

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 23, 2024
This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) =     962946  +/- 1202
test (...ise-max-less) =     979696  +/- 1084
diff                   =     +16750  +/- 1794

speedup        = +0.0174
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) =     966033  +/- 1574
test (...max-less_gcc) =     983319  +/- 1513
diff                   =     +17286  +/- 2515

speedup        = +0.0179
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes official-stockfish#5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change
This speedup was first inspired by a comment by @AndyGrant on my recent PR "If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal."

That got me thinking, because although mullo_epi16 did not preserve the signedness, mulhi_epi16 did, and so we could shift left and then use mulhi_epi16, instead of shifting right after the mullo.

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) =     962946  +/- 1202
test (...ise-max-less) =     979696  +/- 1084
diff                   =     +16750  +/- 1794

speedup        = +0.0174
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) =     966033  +/- 1574
test (...max-less_gcc) =     983319  +/- 1513
diff                   =     +17286  +/- 2515

speedup        = +0.0179
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

closes official-stockfish#5282

Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906

No functional change
@Disservin Disservin added the to be merged Will be merged shortly label May 23, 2024
@Disservin Disservin closed this in c6a1e7f May 23, 2024
@Sopel97
Copy link
Member

Sopel97 commented May 24, 2024

we could potentially test if the additional precision matters by randomly altering all affected weights by a value from [-1, 0, 1] and see the Elo impact

@cj5716
Copy link
Contributor Author

cj5716 commented May 25, 2024

Trying something like that with this test https://tests.stockfishchess.org/tests/view/665176bca86388d5e27da95f

Edit: the test description is wrong, it rounds up if magnitude > 192, not 255

@mstembera
Copy link
Contributor

However, due to some issues with shifting into the sign bit, the FT weights and biases had to be multiplied by 2 for the optimisation to work.

@cj5716 Can you say a bit more about this? I naively thought this https://tests.stockfishchess.org/tests/view/665244aaa86388d5e27da999 would be non functional and get rid of the load scaling but clearly not.

@cj5716
Copy link
Contributor Author

cj5716 commented May 26, 2024

127 == 0b0000000001111111
After shifting left by 9, we get 0b1111111000000000.
But this value is taken as an i16 in the mulhi, leading the actual value to be -512

And what if we use mulhi_epu16 instead? In this case, we will have to add the lower-side clip for the sum1s, and if we attempt to remove the upper side clip, the output from the mulhi will occupy full 16 bits meaning we cannot use packs to shrink it to [0, 127] as already done implicitly. So we will have to add all the instructions back, leading to no speedup in the end.

@mstembera
Copy link
Contributor

Makes sense. Thanks! The pre multiplication is clever to avoid the issue. As you mentioned if we move this to the trainer and train the weights using the larger range we may get some benefit from better resolution. Pre multiplying by 4 also works
master...mstembera:Stockfish:Premul4 Currently the largest non pre multiplied weight is 1484 so well below the signed 16bit max of 32767 even after pre multiplying.

@AndyGrant
Copy link

Should note that in theory you will be adding up to 31 weights (1 for every piece relation, except the kings are combined as one). Any weights above 32767/31 = 1057 are prone to overflow. If sum(biggest-31-weights) > 32767 or sum (smallest-31-weights) < -32768 for an individual neuron, then you have issues in theory.

@Sopel97
Copy link
Member

Sopel97 commented May 26, 2024

yea, we would need to add an entry for weight clipping for the first layer, careful not to touch PSQT https://github.com/official-stockfish/nnue-pytorch/blob/a32432ec99859b7a8583e8f8f4ca051fd4b3dab0/model.py#L156

@mstembera
Copy link
Contributor

So in theory we could already have a problem because some of our weights are above 1057. On the other hand it's extremely unlikely this is actually happening in practice and constraining the weights to below 1057 could hurt elo. It would be possible to check no relevant combinations of 31 weights add to over 32767 at load time.

@cj5716 cj5716 deleted the pairwise-max-less branch May 27, 2024 01:09
@cj5716 cj5716 restored the pairwise-max-less branch May 27, 2024 01:09
@cj5716
Copy link
Contributor Author

cj5716 commented May 27, 2024

I suppose a "band-aid" solution would be to simply switch all the accumulator changes to adds and subs instead of add and sub?

Also, did we have this problem before?

@AndyGrant
Copy link

I think that is a very good band-aid fix, and might let me make use of this optimization in Torch, while I try to take back the reigns on network training.

With adds/subs instead of add/sub, your new fear is to have a series of inputs cause you to cap out, and then a series after to get you back BELOW 127, or 254/255 in Stockfish's case now. If you pretend that weights are truly random, and the order they get added together is random, it seems exceptionally unlikely you'll run into this case.


Also, did we have this problem before?

Having 31+ weights that are >= 1057 is a prerequisite for being able to overflow. So you probably could in that sense. But really you need 31 weights that all contribute to the same index of the accumulator, AND can be active at the same time.

So take all the accum weights, and split them up based on the output neuron. Next question is do you have 31+ weights there that can exceed 1057? Well we know we can only have 16 pieces of each colour, so you need to look at those ones.

... and they all have to be for the same King sq
... and none of the problematic weights can be on the same sq, since two pieces cannot share the same square,

So at the end of the day, the math to say "Can I actually overflow" is quite complicated IMO. Perhaps I will write something up in python that works for Torch, and I can share it with you Cj.

@AndyGrant
Copy link

AndyGrant commented May 28, 2024

It might not be a good solution actually, since adds and subs requires more specialized execution ports, at least on zen, lowering the throughput of the intruction. I don't know what Intel does, but the intrinsics guides notes a dropping in throughput there as well, IE fewer execution ports present for the operations.

The following is probably good enough to convince yourselves for Stockfish... you look at chunks of 64x11 weights, IE King Squares, for each output neuron. If you want to get really pedantic, there are times where you can add more than 31 inputs at once, since you are in the midst of removing another.

IE suppose your Finny table has 31 inputs added. Then the order of these operations technically matters:

  1. entry = add(set, sub(entry, unset))
  2. entry = sub(add(set, entry), unset)
  3. entry = add(entry, sub(set, unset)) <- Preferable you could say
def check_for_errors(weights, biases):


    chunk_size = 64 * 11
    BOUND  = (2 ** 15) - 1

    # weights.shape = (64*32*11, L1-SIZE)
    for col in range(weights.shape[1]):

        data = weights[:, col]
        for start in range(0, len(data), chunk_size):

            chunk = np.sort(data[start:start+chunk_size])
            lower = sum(chunk[:31]); upper = sum(chunk[-31:])

            upper += max(0, biases[col])
            lower -= min(0, biases[col])

            if (upper > BOUND:
                print (upper)

            if lower < -BOUND:
                print (lower)

n = Network()
n.read_network(args.network)
check_for_errors(n.accumulator['weights'], n.accumulator['biases'])

@cj5716
Copy link
Contributor Author

cj5716 commented May 30, 2024

I'll see how we can handle this, hopefully all goes well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants