New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed shift fancy magic bitboards #3429
Fixed shift fancy magic bitboards #3429
Conversation
src/bitboard.cpp
Outdated
// and if using 64 bit then existing magics are pre-computed. | ||
if constexpr (HasPext || Is64Bit) { | ||
unsigned index = m.index(occupied); | ||
m.attacks[index] = reference[size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a good opportunity to assert that attacks[index]
was empty or the same, to verify the validity of the known magics.
Can we test the patch on fishtest, to verify if the added complexity gains Elo? Thanks :-) |
Sure, there is currently a STC test (paused currently): https://tests.stockfishchess.org/html/live_elo.html?60763a7b8141753378960666 In case STC results come out good, should I do LTC as well? |
I think the problem is that it requires 64 bit machines with PEXT off, which is not that common case at fishtest I think @snicolet |
@Vizvezdenec I think you'd be surprised, PEXT is horribly slow on AMD Ryzen cpus, on my machine perft takes 70% (!!!) longer with PEXT-enabled builds, even though it "fully supports" BMI2. With rise of AMD cpus I wouldn't be surprised if it's the majority of CPUs that would benefit from this. I'm not sure about the distribution of build parameters on fishtest though. |
@Vizvezdenec I think a possible testing path for this could be to create a branch from master, remove pext so it always uses old fancy magics, and then use that as the base branch - this way all builds would test the new vs. old bitboards. |
The thing is that majority of fishtest is from @noobpwnftw and I don't quite remember what CPUs he has, if they are supporting PEXT then like 80% of machines wouldn't get anything from this patch. |
Finished running fishtest, STC reported 3.61 elo, LTC 0.23 elo. I expected LTC to do worse, it applies to any performance upgrades that don't affect the search path. Increased search depth gives diminishing returns - it's much more beneficial for example to sometimes search depth 15 instead of 14 than it is to sometime search depth 25 instead of 24. |
Hmmm, so this is a huge patch (233 additions and 54 deletions) which did not pass LTC... |
Hm, originally I wasn't even going to run fishtest... The thing is, most patches change the behavior of search, this doesn't - I can just run bench and show that this is measurably faster. If we agree in general that higher nps is good and horizon effect is irrelevant over a large number of searches then there's no chance of regression and elo must be >0 on any tc. I took a look at the rest of movegen - it seems pretty optimal in terms of code efficiency, well at least I couldn't spot anything else to optimize. If STC / LTC is a hard requirement, it would be very hard to make patches of similar nature. Regarding lines of code, I admit this is quite the hippo. Almost all of that comes from known magics and comments. Using known magics is unavoidable, creating a compact attack table is NP-complete: https://backscattering.de/chess/hashtable-packing/. Think training NNUE on startup. A lot of effort (way more than I put into this patch) has been put into finding optimal magics by Volker Annuss, Grant Osborne and probably some others from as far back as 2010 and their results are tested and being used by a lot of other engines. I was quite surprised when I looked at Stockfish and found it's not using neither white nor black magics. I'm probably biased towards this since I already put time into this, but adding fixed shift seemed like a no-brainer from the start. |
Fascinating stuff and really nice research and implementation. However, I personally tend to agree with @snicolet's doubts if it's really worth the extra code complexity for arguably little gain, as some architectures are faster with PEXT on. I wouldn't put too much trust in the 3.64 elo estimate of https://tests.stockfishchess.org/html/live_elo.html?6077088a81417533789606b6 TBH, because of high error bounds and biased estimator for passed patches, plus the fact that you "only" measure 0.61% speedup. A true fishtest elo test at STC (no SPRT) would be better to estimate elo. But even then, the question is if forcing the entire fishtest fleet to PEXT off yields meaningful results (we are not interested in archs where PEXT on is faster anyway). |
For reference here #1538 is an old PR that completely removed magic generation inside SF. Historically SF only has magic generation to "show how it's done". IMO since magic generation isn't needed in an engine and in fact can be done much better offline it doesn't belong. |
I don't know why some think this is too complicated. It's as simple as magics get. Most of the code is constants. This PR is great IMO. |
Yeah I also don't even think that we are a chess engine wikipedia of some sort. |
Just to clarify, on-the-fly generation is not removed, it's still used for the 32-bit implementation since the known magics are incompatible with the 32-bit hashing. I could have removed the 32-bit exclusive hashing but that could allow regression on 32-bit if that ends up slower. So I thought - better limit the scope and not touch that for this patch. So I guess there's still a "wiki" for now. |
I don't think 32 and 64 bit should have different magics. If that would imply a small regression for 32 bit then so be it. |
Added conditional compilation to STC: 1.93 elo Seems quite reasonable. |
About removing 32 bit on-the-fly magic generation completely:
|
With #3435 being proposed I though I'd test with a 32-bit build to see what would happen if on-the-fly magics generation would be removed completely and the 64-bit hash was used on 32-bit. I decided to used bench instead of perft test because that better reflects the real usage of magics in a game.
Average: -60ms, or -0.78%. The difference is similar to adding fixed shift to 64-bit but in the opposite direction, so probably -2 elo STC, -1 elo LTC or thereabouts, the benefits to 64-bit would be unchanged. This would get rid of a special case for 32 bit, get rid of ~40 lines of code, and remove magics generation on-the-fly completely. What do you think, is this worth doing? |
@GediminasMasaitis Is it possible to use your code on fishtest? if yes, could you test it with fixed number of games, so we may see the extension of this slowdown? If it's as you say, this could be the patch that removes 32bit optimization. |
Unfortunately no, I think there's no way to control build architecture on fishtest. If I forced 64-bit hash vs 32-bit hash code then it would still build with 64-bit and would crush the 32-bit version because it's not really built on 32-bit... |
@GediminasMasaitis Can't you just change Makefile to force 32bit build? (I'm not an expert in compilers) |
By using MIPS to produce a similar multiplication of 64 bit as done in a 32 bit architecture. Removing 32 bits optimization failed non-regression test as shown here on 32 bit arch. This test is not optimal since some compiler may improve a bit more than a 32 bit machine. In addition, I don't know much about architectures, but this test assumes that the architecture produces two 32 bit numbers (hi and lo) by multiplying two 32 bit numbers. If it doesn't, the slowdown might be more severe. |
I just closed my issue #3435 as vondele made a good suggestion. So long 32 bit works it is fine to not waste time and optimize it as much as possible. |
You are attacking a straw man (it is almost comical). Nobody has been optimizing for 32 bit for a long time. There is no time being wasted on 32 bit. |
I'm referring to this:
I would like to know how 40 new lines for a small 32 bit speed boost isn't overkill? |
For what it's worth, non-regression test for removing 32-bit magics: STC: I really don't think we need to waste resources by running LTC, there's few 32 bit machines on fishtest, if any at all. I'd like to get confirmation from the maintainers, given this non-regression, the bench results for 32 bit, and the "simulation" @BM123499 did, is it ok to remove magics generation and the 32-bit hashing code? |
This patch seems to be a slowdown on my system.
|
@Torom Yes, perft NPS should be higher as well. But I don't think I trust that tool... I tried to run a plain old master vs. master and it's very consistently inconsistent:
+20k, -20k, that pattern repeats all the way down. I'm not sure what it's doing exactly but it's very wonky... Also it's confidently telling me that master (test) is 100% confidently worse than master (base). Well, regardless, I did the same modification to make it run
+11.3%? Eh... Well, maybe. I'll play around with it though. I'll try to set up a linux-based VM, try to test a non-mingw build. |
I got a Linux VM set up and my results on build/run with gcc are similar within margin of error compared to my results on mingw. I don't think it's OS-dependent. I don't think there's more I can test on my machine. |
Yeah, it's probably more of a system thing than an OS thing. Same system, windows, different testing tool:
Engine 1: master, Engine 2: your patch (rebased to master)
|
I understand your fishtest test on removing the special case for 32bit system, i.e. removing the generation code, did pass the non-regression test, and it would make sense to make that part of this PR. In that case, we'd be using either pext or one set of magics? That would be an advantage Indeed pext is slow on amd system that will be around for a while (but it is fixed in their zen3). While the code is fairly large in lines, most of it is just the table and some comments. I would suggest put two constants per line, that's going to be a bit more compact. Could you also add an url to the code were you refer to Annuss' post? So please combine with the 32 bit removal, and squash to a single commit, and I'll have a more detailed look. |
This is basically doing #1352 again, right? |
I don't see how this "simulation" represents what I've done. It seems you removed optimization for 32 bit and tested on 64 bit machines with no adjustment. Please correct me if I'm wrong. I agree that it's non-regression on 64bit machine, but I don't think it shows that it's non-regression on a 32bit machine. |
It seems cache management impacts more than the slowdown in 64bit multiplication in 32bit machine as shown in: STC: Said that, I don't believe it will cause slowdown on 32bit machines. |
29eebfc
to
ba5d989
Compare
@GediminasMasaitis why don't you merge Edit: As an example: BM123499@81ba78b |
So, I have tried to measure the impact of this patch carefully, and I see no speedup over master. That's compiled on AMD Ryzen 9 3950X with make -j ARCH=x86-64-avx2 profile-build
who's measuring a speedup for a bench? |
Intel i7-3520M
|
ba5d989
to
c355beb
Compare
c355beb
to
a2f01c0
Compare
Theory
Fixed shift fancy magic bitboards are a variant of magic bitboards where instead of calculating a variable shift based on possible occupancy permutations of a given slide, a compile-constant shift is used. This the compiler can use less instructions to perform the shift by a constant, reducing indexing time to retrieve the attack bitboard.
Additionally it's possible to permute the pieces in the attack table in such a way that the end of square X attaks matches square Y attacks, given this it's possible to overlap entries, reducing the table size, in theory improving cache locality. Not all magics are equal, different magics will result in different permutations of attacks of a certain square, and by continuing to test other magics it's possible to reduce the table size even more. It's unreasonable to compute this kind of a compressed table on engine startup, finding magics and their permutations which result in a small table takes hours and hours. Because of this, magics are pre-computed and hardcoded.
Implementation
RookTable
andBishopTable
have been merged into oneSlideAttackTable
to allow overlapping.Magic
class has been templated to allow constant shift,index()
has been updated.Tests
I ran some tests on my machine on 64 bit with PEXT disabled. Test results are "hot" - meaning first few test results are dicsarded, this helps stabilize the results and decrease deviation.
bench 16 1 5 default perft
:Tests ran with PEXT disabled on test and base branches in order to avoid testing essentially master vs. master
STC:
LLR: 2.95 (-2.94,2.94) {-0.20,1.10}
Total: 26280 W: 2370 L: 2208 D: 21702
Ptnml(0-2): 76, 1736, 9365, 1876, 87
https://tests.stockfishchess.org/tests/view/6079579e162adf76afa5b7ac
LTC:
LLR: 2.94 (-2.94,2.94) {0.20,0.90}
Total: 104904 W: 4056 L: 3791 D: 97057
Ptnml(0-2): 44, 3234, 45642, 3477, 55
https://tests.stockfishchess.org/tests/view/60797216162adf76afa5b7b2