New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SF NNUE #2728
Comments
|
I don't know if its the direction the devs want to go in but I think it should be considered to maybe integrate ML into SF given the impressive results. |
|
We should be open-minded and see how things evolve... it is an interesting development. Let's see how the code base evolves, the performance goes, etc. Once we have some data and understanding, we should see what the opportunities are. |
|
Given that Stockfish tunes in attempts to match Leela evaluations has failed in the past, I'm not entirely sure that you can extract much useful information from another similar black box, especially since neural networks have convolution structures that make them useful and less compressible. EDIT: I found out (anecdotally) that this Neural net doesn't use convolutions. If you want to investigate, you should probably ask on the Stockfish discord or the fork mentioned by vondele below. |
|
I don't know much about SF NNUE. What is it? Does NNUE stand for something? |
|
So it's been claimed on discord that NNUE is now 34elo stronger than SFDev. |
|
I don't think anybody claimed that besides the occassional SSS result. NNUE definitely is much worse at 10+0.1 STC, but does quickly gain elo on SF_dev as TC increases. |
|
Just for reference, this issue refers to the fork being developed here: https://github.com/nodchip/Stockfish with an eval function based on a neural net architecture. |
|
Data is sounding more and more convincing on this (look at jjosh and lkaufman posts): "Anecdotally", I have several test positions which SF consistently takes up to 50-100 billion nodes or more (or sometimes never finds it) to find the correct move, that SF NNUE finds within a few million nodes. The difference is night and day. Is there any chance fishtest resources could be used for this? Or if we could somehow run one of these "patches" (SF NNUE) against "master" with SPRT elo bounds at 180+1.8? I think it might pass very fast! |
|
well, I think we should slowly start to think about how we can utilize fishtest to train networks and stuff like this. |
"Cornercases that corrupt SF play a lot" I'll bet there's equally many (if not more) corner cases to be met with the NNUE architecture, given that even leela has lots of trouble with its own kind of corner cases, especially those of which that are both distant to mate and require pruning exponentially larger search trees. Current SF has a reasonable combination of search code and eval code to be able to direct it to finding improvements in obscure endgames and make those problems far less difficult by deliberation. This may make it easier to identify and fix specific problems. In my experience with neural networks, specific problems are far harder to fix when trying to generalize evaluation. Also, NNUE may not provide a higher ceiling than handcrafted evals because of the inefficiency of information packing in Neural Networks as opposed to formal handcrafted evaluation. NNUE can only be so large of a network that it'll probably hit its limit and it will stop improving after a certain point, much like how Leela's network architecture has hardly improved since it first had squeeze and excitation (SE) nets. That said, it's easier to train this NNUE than Lc0 because it's got so many fewer variables, so designing improvements (in the short term at least) may come easier to it. So I'd still be a bit skeptical (even though I predict NNUE will be better in the near future) of the long-term implications of NNUE. I fear that SF could stuck in a local minimum with NNUE when the NN stops improving and people would lose interest in the SF project instead of returning to the handcrafted evaluations with a higher Elo ceiling. If AlphaZero came 2 years earlier and blew everyone out of the water then, it probably would have made many people abandon SF instead of realizing there is still great potential for handcrafted evaluations. The SF project is probably one of the largest (if not the largest) open source projects of handcrafted feature recognition and in my opinion it would be a shame if it were just to become an exhibit in a github museum. All this said, it's just my experience from watching from the Lc0 stand of things. |
|
The difference is that 80% of elo sf gains are improvements of search. So even if eval will be "stuck" - well, it's not THAT big of a deal, tbh. |
|
I don't think handcrafted evaluation should be abandoned, as the possibility of it having a higher ceiling remains. That being said, as Viz mentioned, handcrafted search appears to be "unthreatened" anyway, so the "SF project" won't become an "exhibit in a github museum" regardless. People shouldn't forget that a big reason of why SF NNUE is so strong already is because of its strong search. For example, I'd predict that if Komodo NNUE was released (Komodo being the 2nd strongest CPU-alone engine), it would still get crushed by native SF. However, my point was that it may be prudent to do some "testing on fishtest" for the NNUE component, if just to become adept to using/testing/training it. The handcrafted eval component should still continue as much as possible, but perhaps when it comes to submitting SF for tournaments etc, the strongest version of SF should be submitted at the time (whether it's native SF or SF NNUE). |
|
From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior. (e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1) |
|
Problem is you don't really have a way to decide which eval is correct and which is not even with shallow search. With native eval, people spot certain problems and write patches, they still often break more stuff than they fix by failing fishtest, so how is NNUE going to magically make this problem disappear is beyond me. |
Those misevaluations are mostly the result of the data its been trained on.* Things should eventually improve, once we can get fishtest, or Leela or Noob's data to work. Anyway, I turned skeptical about its scaling after seeing a fixed node test at 1m, 10m and 20m. But maybe Jjoshua's net has fixed that. *But a lot of them will exist even if we use deeper data, SF evaluating a draw endgame as +1 is just as wrong as Leela saying +0.8 or NNUE +3.4. |
|
what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions. |
A few others have experimented with the data but had some strange behaviour. |
|
concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful. |
|
@gekkehenker it's much harder* to tune a neural network to give desired relative evaluations than it is for the handcrafted alternatives.** *might have to be proven to be known true, but stockfish's evaluations are tuned to beat other versions of itself. that makes the patches that pass alive out of fishtest very good at introducing adversarial play, which a small neural network trained on external data could not provide to such high fidelity. what ends up happening against stronger or "drawish" opponents is the neural network tends to prefer things which itself cannot evaluate properly instead of being able to focus on generating play from its own internal strengths. **"handcrafted alternatives" rely on far more concrete values to evaluate a position, making any small differences in evaluation which might find wins/draws effect magnified. also, the deeper the search, the more false positives which the neural network generates effects how the edges of search behave, especially drawn 50-move rule bound endgames. @noobpwnftw being able to distinguish when our handcrafted evaluations are better to use could rely on a table of precalculated values in from a file, those of which would allow us to determine what evaluation method is better for what amount of pieces on board, and what type of pieces on the board--we can create such an evaluation-accuracy piece-table by using mean square error of an evaluation to the result of a game, for which we might have to figure out how the new network's evaluations convert to "actual" win percentage. one potential downside is that might get a bit messy if different networks have different strengths. |
This link contains a few Windows compiles (popcnt, avx2, bmi2) and my current strongest net: https://workupload.com/file/ggEUrvNVgmH It seems like the latest binaries (same goes for the binaries on Nodchip's repo) fixed a few bugs. It's roughly as simple as SF now. UCI option "evalfile" has to point towards the NN file. There's sadly not a lot of centralized information because it was originally nothing more than a quick port to test if NNUE works in chess too. Whatever I know is build upon quick instructions from Twitter, looking through the learner.cpp code and google translated YaneuraOu docs: https://twitter.com/nodchip/status/993432774387249153 |
|
Just thought it'd be important to post some real results in my testing so far.
I'm going to let it run to 1000-games mainly just for future consistency.
Anyway, thanks to @gekkehenker and nodchip for continuing to share their knowledge publicly! |
|
I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550] I'm not really sure I understand/trust it completely though. I did try to double check everything |
Yes, the first time I saw the results of the new binaries I couldn't believe them either. In an era where a 5 elo patch is believed as too good to be true, a 30 elo "patch" must be impossible to believe. |
Your result is "consistent" with basically every test done so far (including mine) that used nodchip's binaries (or equivalent) from July 11th or later. Again, testing with the newer binaries is crucial (probably stick with July 13th binary until we're absolutely certain of the strength improvement), as older binaries were for some reason 50-100+ elo weaker - SF is so far ahead of the rest that it was still a relatively strong engine, around the level of Komodo 14. It appears that the elo difference at 10+0.1 (and likely even shorter TC) is likely bigger than at 60+0.6. The elo difference seems to be around 30-50 at the shorter TCs, and around 15-35 at the longer TCs. It'd be interesting to see if fishtest can verify these numbers - ideally test at its usual TC for patches - 10+0.1 and 60+0.6 with 1-thread, and 5+0.05 and 20+0.2 with 8-threads, all to 40,000 games each or similar. |
|
Yeah fishtest tests would be quite something if that is possible. My own test for 20+0.2 I stopped when it was giving a similar result: 20+0.2: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 506 - 292 - 1224 [0.553] and then I started the more interesting 60+0.6 and that, while with little amount of games so far, did as well: 60+0.6 hash64: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 204 - 105 - 663 [0.551] |
|
Just to follow up on my testing from above. The 1-core test finished as follows: SF NNUE vs SF: 161 - 103 - 736 [0.529] 2-core test with exactly the same conditions as above, currently showing even better results, although sample sizes are tiny to draw any conclusions about scaling: SF NNUE vs SF: 81 - 30 - 327 [0.558] |
|
So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4e) I get the following results: That's a bit better than the results posted previously. The cutechess cmdline is quite standard: |
|
Tests on ccc seems to indicate that nnue can't handle more than 64 threads though? Is that true or is ccc nnue set up incorrectly? Anyways I highly doubt blitz tests represent the true strength difference at vltc (I'm talking about tcec conditions). I expect at best +20 elo in those conditions (which by the way was my prediction on how much better leela was back when a horde of leela fans were claiming +50 at least.). |
|
well it is unlikely that fundamentally nnue would show worse threading behavior. After all, this is just changing eval, which is really threading-independent. However, there could be threading related bugs, or new threading-related bottlenecks that haven't been found. That could happen in a relatively new code. Another thing to consider is that there might be a difference in performance wrt. hyperthreading as the nnue has different characteristics (e.g. avx2 intensive). A first test at a higher thread count here seems fine: |
|
Some new unexpected results given the first ones. |
|
I think it was trained on depth 8 or depth 12 (@gekkehenker ?). However, I think this must not be too surprising, we know Elo gain at STC depths is something like 30-60Elo, which is less than what 1 ply of depth is worth (at around STC depths). |
Net was trained on both depth 8 and depth 12 games. |
|
@vondele thanks for your hard work in getting NNUE merged - just wondered what SV net is being run on fishtest now? |
|
@vondele thanks - just wondered what the corresponding net number etc is from here: Also which binary is used? |
|
don't know, you should be able to find it from a matching |
uploaded by Sergio Vieri NNUE signature: 4254913 Bench: 4746616
|
|
|
has anyone tried to use NNUE in FRC? doesn't seem to work for some. |
|
Hmm worked OK for me here: https://lichess.org/yV7J1imd |
|
I haven't tried but in principle it should work. NNUE only touches eval. Also the classical eval had almost no special handling of FRC (one term if I recall correctly). |
|
In my experience NNUE will play some FRC positions and crash in the rest. |
|
hmm, will be the added code in position that might wrong in that case. |
|
I am behind the times. . . is this really ~90 ELO better than master on the same hardware? |
|
Correct - this will be a 100+ Elo gain merge or so - give or take a few Elo. The mother of all merges. |
90 elo conservatively. On a modern CPU with normal LTC conditions and a PGO build it's a bit stronger than that ;) |
|
Note there are certain incompatibilities on old hardware that would make it significantly less efficient. Also, there are hints there is some significant elo compression at very long time controls with increment. Also note that contempt has yet to be implemented, which has the potential to present itself as an ELO gainer. ...ALSO note that it's likely just much stronger from the start position than it is from some many-ply-long books, but that claim has yet to be sufficiently backed up. |
|
I would not get too excited about contempt. Contempt was designed for use against weaker engines. Against equal or stronger engine, it’s just about worthless. So the only thing contempt does is squeeze a few extra elo out of much lower rating opponents. I would be hard pressed to say contempt makes it better - it squeezes a few Elo out of weaker opponents. It falls into the realm of being a vanity of vanities. |
|
@MichaelB7 Not having contempt cost SF the qualification into the TCEC SuFi one season. |
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish. Both the NNUE and the classical evaluations are available, and can be used to assign a value to a position that is later used in alpha-beta (PVS) search to find the best move. The classical evaluation computes this value as a function of various chess concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation computes this value with a neural network based on basic inputs. The network is optimized and trained on the evalutions of millions of positions at moderate search depth. The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward. It can be evaluated efficiently on CPUs, and exploits the fact that only parts of the neural network need to be updated after a typical chess move. [The nodchip repository](https://github.com/nodchip/Stockfish) provides additional tools to train and develop the NNUE networks. This patch is the result of contributions of various authors, from various communities, including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather, rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler, dorzechowski, and vondele. This new evaluation needed various changes to fishtest and the corresponding infrastructure, for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged. The first networks have been provided by gekkehenker and sergiovieri, with the latter net (nn-97f742aaefcd.nnue) being the current default. The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option, provided the `EvalFile` option points the the network file (depending on the GUI, with full path). The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest: 60000 @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c ELO: 92.77 +-2.1 (95%) LOS: 100.0% Total: 60000 W: 24193 L: 8543 D: 27264 Ptnml(0-2): 609, 3850, 9708, 10948, 4885 40000 @ 20+0.2 th 8 https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58 ELO: 89.47 +-2.0 (95%) LOS: 100.0% Total: 40000 W: 12756 L: 2677 D: 24567 Ptnml(0-2): 74, 1583, 8550, 7776, 2017 At the same time, the impact on the classical evaluation remains minimal, causing no significant regression: sprt @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b LLR: 2.94 (-2.94,2.94) {-6.00,-4.00} Total: 34936 W: 6502 L: 6825 D: 21609 Ptnml(0-2): 571, 4082, 8434, 3861, 520 sprt @ 60+0.6 th 1 https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d LLR: 2.93 (-2.94,2.94) {-6.00,-4.00} Total: 10088 W: 1232 L: 1265 D: 7591 Ptnml(0-2): 49, 914, 3170, 843, 68 The needed networks can be found at https://tests.stockfishchess.org/nns It is recommended to use the default one as indicated by the `EvalFile` UCI option. Guidelines for testing new nets can be found at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests Integration has been discussed in various issues: official-stockfish#2823 official-stockfish#2728 The integration branch will be closed after the merge: official-stockfish#2825 https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip This will be an exciting time for computer chess, looking forward to seeing the evolution of this approach. Bench: 4746616
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish. Both the NNUE and the classical evaluations are available, and can be used to assign a value to a position that is later used in alpha-beta (PVS) search to find the best move. The classical evaluation computes this value as a function of various chess concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation computes this value with a neural network based on basic inputs. The network is optimized and trained on the evalutions of millions of positions at moderate search depth. The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward. It can be evaluated efficiently on CPUs, and exploits the fact that only parts of the neural network need to be updated after a typical chess move. [The nodchip repository](https://github.com/nodchip/Stockfish) provides additional tools to train and develop the NNUE networks. This patch is the result of contributions of various authors, from various communities, including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather, rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler, dorzechowski, and vondele. This new evaluation needed various changes to fishtest and the corresponding infrastructure, for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged. The first networks have been provided by gekkehenker and sergiovieri, with the latter net (nn-97f742aaefcd.nnue) being the current default. The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option, provided the `EvalFile` option points the the network file (depending on the GUI, with full path). The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest: 60000 @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c ELO: 92.77 +-2.1 (95%) LOS: 100.0% Total: 60000 W: 24193 L: 8543 D: 27264 Ptnml(0-2): 609, 3850, 9708, 10948, 4885 40000 @ 20+0.2 th 8 https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58 ELO: 89.47 +-2.0 (95%) LOS: 100.0% Total: 40000 W: 12756 L: 2677 D: 24567 Ptnml(0-2): 74, 1583, 8550, 7776, 2017 At the same time, the impact on the classical evaluation remains minimal, causing no significant regression: sprt @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b LLR: 2.94 (-2.94,2.94) {-6.00,-4.00} Total: 34936 W: 6502 L: 6825 D: 21609 Ptnml(0-2): 571, 4082, 8434, 3861, 520 sprt @ 60+0.6 th 1 https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d LLR: 2.93 (-2.94,2.94) {-6.00,-4.00} Total: 10088 W: 1232 L: 1265 D: 7591 Ptnml(0-2): 49, 914, 3170, 843, 68 The needed networks can be found at https://tests.stockfishchess.org/nns It is recommended to use the default one as indicated by the `EvalFile` UCI option. Guidelines for testing new nets can be found at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests Integration has been discussed in various issues: #2823 #2728 The integration branch will be closed after the merge: #2825 https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip closes #2912 This will be an exciting time for computer chess, looking forward to seeing the evolution of this approach. Bench: 4746616
|
NNUE evaluation has been merged, I'll close this issue. Thanks for the discussion. |
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish. Both the NNUE and the classical evaluations are available, and can be used to assign a value to a position that is later used in alpha-beta (PVS) search to find the best move. The classical evaluation computes this value as a function of various chess concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation computes this value with a neural network based on basic inputs. The network is optimized and trained on the evalutions of millions of positions at moderate search depth. The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward. It can be evaluated efficiently on CPUs, and exploits the fact that only parts of the neural network need to be updated after a typical chess move. [The nodchip repository](https://github.com/nodchip/Stockfish) provides additional tools to train and develop the NNUE networks. This patch is the result of contributions of various authors, from various communities, including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather, rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler, dorzechowski, and vondele. This new evaluation needed various changes to fishtest and the corresponding infrastructure, for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged. The first networks have been provided by gekkehenker and sergiovieri, with the latter net (nn-97f742aaefcd.nnue) being the current default. The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option, provided the `EvalFile` option points the the network file (depending on the GUI, with full path). The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest: 60000 @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c ELO: 92.77 +-2.1 (95%) LOS: 100.0% Total: 60000 W: 24193 L: 8543 D: 27264 Ptnml(0-2): 609, 3850, 9708, 10948, 4885 40000 @ 20+0.2 th 8 https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58 ELO: 89.47 +-2.0 (95%) LOS: 100.0% Total: 40000 W: 12756 L: 2677 D: 24567 Ptnml(0-2): 74, 1583, 8550, 7776, 2017 At the same time, the impact on the classical evaluation remains minimal, causing no significant regression: sprt @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b LLR: 2.94 (-2.94,2.94) {-6.00,-4.00} Total: 34936 W: 6502 L: 6825 D: 21609 Ptnml(0-2): 571, 4082, 8434, 3861, 520 sprt @ 60+0.6 th 1 https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d LLR: 2.93 (-2.94,2.94) {-6.00,-4.00} Total: 10088 W: 1232 L: 1265 D: 7591 Ptnml(0-2): 49, 914, 3170, 843, 68 The needed networks can be found at https://tests.stockfishchess.org/nns It is recommended to use the default one as indicated by the `EvalFile` UCI option. Guidelines for testing new nets can be found at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests Integration has been discussed in various issues: official-stockfish#2823 official-stockfish#2728 The integration branch will be closed after the merge: official-stockfish#2825 https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip closes official-stockfish#2912 This will be an exciting time for computer chess, looking forward to seeing the evolution of this approach. Bench: 4746616
There has been much discussion on SF NNUE, which apparently is already on par with SF10 (so about 70-80 elo behind current sf dev). People have been saying it can become 100elo stronger than SF, which would basically come from the eval. Since the net is apparently not very big, maybe someone can study the activations of each layer and see if we can extract some eval info from it? In any case, it's probably worth looking into this since it shows so much promise.
The text was updated successfully, but these errors were encountered: