Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SF NNUE #2728

Closed
adentong opened this issue Jun 10, 2020 · 183 comments
Closed

SF NNUE #2728

adentong opened this issue Jun 10, 2020 · 183 comments

Comments

@adentong
Copy link

There has been much discussion on SF NNUE, which apparently is already on par with SF10 (so about 70-80 elo behind current sf dev). People have been saying it can become 100elo stronger than SF, which would basically come from the eval. Since the net is apparently not very big, maybe someone can study the activations of each layer and see if we can extract some eval info from it? In any case, it's probably worth looking into this since it shows so much promise.

@ZagButNoZig
Copy link

ZagButNoZig commented Jun 10, 2020

I don't know if its the direction the devs want to go in but I think it should be considered to maybe integrate ML into SF given the impressive results.

@vondele
Copy link
Member

vondele commented Jun 10, 2020

We should be open-minded and see how things evolve... it is an interesting development. Let's see how the code base evolves, the performance goes, etc. Once we have some data and understanding, we should see what the opportunities are.

@TesseractA
Copy link

TesseractA commented Jun 22, 2020

Given that Stockfish tunes in attempts to match Leela evaluations has failed in the past, I'm not entirely sure that you can extract much useful information from another similar black box, especially since neural networks have convolution structures that make them useful and less compressible.

EDIT: I found out (anecdotally) that this Neural net doesn't use convolutions. If you want to investigate, you should probably ask on the Stockfish discord or the fork mentioned by vondele below.

@Caleb-Kang
Copy link

I don't know much about SF NNUE. What is it? Does NNUE stand for something?

@adentong
Copy link
Author

So it's been claimed on discord that NNUE is now 34elo stronger than SFDev.

@gekkehenker
Copy link

I don't think anybody claimed that besides the occassional SSS result.

NNUE definitely is much worse at 10+0.1 STC, but does quickly gain elo on SF_dev as TC increases.

@vondele
Copy link
Member

vondele commented Jun 27, 2020

Just for reference, this issue refers to the fork being developed here: https://github.com/nodchip/Stockfish with an eval function based on a neural net architecture.

@ssj100
Copy link

ssj100 commented Jul 6, 2020

Data is sounding more and more convincing on this (look at jjosh and lkaufman posts):
http://talkchess.com/forum3/viewtopic.php?f=2&t=74366&start=10#p850204

"Anecdotally", I have several test positions which SF consistently takes up to 50-100 billion nodes or more (or sometimes never finds it) to find the correct move, that SF NNUE finds within a few million nodes. The difference is night and day.

Is there any chance fishtest resources could be used for this? Or if we could somehow run one of these "patches" (SF NNUE) against "master" with SPRT elo bounds at 180+1.8? I think it might pass very fast!

@adentong
Copy link
Author

adentong commented Jul 6, 2020

@ssj100 But look at the number of games though. It's not even thousands of games, just dozens. That's hardly convincing at all. I would, however, love to see an LTC match of NNUE vs SF, though I don't know if it's supported by fishtest (probably not).
@vondele

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Jul 8, 2020

well, I think we should slowly start to think about how we can utilize fishtest to train networks and stuff like this.
This stuff seems to be really promising if it plays on level "not really worse than master" on LTC and CPUs that support AVX in just few weeks of training.
Sure, most of our hardware is quite old, but we have some modern CPUs and it can be trained even on older ones just slower...
So, what I think should be done :) - we should start to train some nets ourselves, maybe have 2 separate code bases or (even better) one code base with NN and handcrafted eval and UCI parameter to switch between them - people with older CPU can stay on handcrafted eval and people with modern CPUs can utilize NNUE.
I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :)
I know all I say will require quite a lot of work from both developers and maybe even fishtest maintainers, but some day it still needs to be done, imho.

@TesseractA
Copy link

TesseractA commented Jul 8, 2020

I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :)

"Cornercases that corrupt SF play a lot" I'll bet there's equally many (if not more) corner cases to be met with the NNUE architecture, given that even leela has lots of trouble with its own kind of corner cases, especially those of which that are both distant to mate and require pruning exponentially larger search trees. Current SF has a reasonable combination of search code and eval code to be able to direct it to finding improvements in obscure endgames and make those problems far less difficult by deliberation. This may make it easier to identify and fix specific problems. In my experience with neural networks, specific problems are far harder to fix when trying to generalize evaluation.

Also, NNUE may not provide a higher ceiling than handcrafted evals because of the inefficiency of information packing in Neural Networks as opposed to formal handcrafted evaluation. NNUE can only be so large of a network that it'll probably hit its limit and it will stop improving after a certain point, much like how Leela's network architecture has hardly improved since it first had squeeze and excitation (SE) nets. That said, it's easier to train this NNUE than Lc0 because it's got so many fewer variables, so designing improvements (in the short term at least) may come easier to it.

So I'd still be a bit skeptical (even though I predict NNUE will be better in the near future) of the long-term implications of NNUE. I fear that SF could stuck in a local minimum with NNUE when the NN stops improving and people would lose interest in the SF project instead of returning to the handcrafted evaluations with a higher Elo ceiling.

If AlphaZero came 2 years earlier and blew everyone out of the water then, it probably would have made many people abandon SF instead of realizing there is still great potential for handcrafted evaluations.

The SF project is probably one of the largest (if not the largest) open source projects of handcrafted feature recognition and in my opinion it would be a shame if it were just to become an exhibit in a github museum.

All this said, it's just my experience from watching from the Lc0 stand of things.

@Vizvezdenec
Copy link
Contributor

The difference is that 80% of elo sf gains are improvements of search. So even if eval will be "stuck" - well, it's not THAT big of a deal, tbh.
Also no one prohibits you from continuing to improve handcrafted eval if nn will get stuck.

@ssj100
Copy link

ssj100 commented Jul 9, 2020

I don't think handcrafted evaluation should be abandoned, as the possibility of it having a higher ceiling remains. That being said, as Viz mentioned, handcrafted search appears to be "unthreatened" anyway, so the "SF project" won't become an "exhibit in a github museum" regardless. People shouldn't forget that a big reason of why SF NNUE is so strong already is because of its strong search. For example, I'd predict that if Komodo NNUE was released (Komodo being the 2nd strongest CPU-alone engine), it would still get crushed by native SF.

However, my point was that it may be prudent to do some "testing on fishtest" for the NNUE component, if just to become adept to using/testing/training it. The handcrafted eval component should still continue as much as possible, but perhaps when it comes to submitting SF for tournaments etc, the strongest version of SF should be submitted at the time (whether it's native SF or SF NNUE).

@TesseractA
Copy link

TesseractA commented Jul 10, 2020

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

@noobpwnftw
Copy link
Contributor

Problem is you don't really have a way to decide which eval is correct and which is not even with shallow search. With native eval, people spot certain problems and write patches, they still often break more stuff than they fix by failing fishtest, so how is NNUE going to magically make this problem disappear is beyond me.

@gekkehenker
Copy link

gekkehenker commented Jul 10, 2020

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

Those misevaluations are mostly the result of the data its been trained on.*
It's at the end of the day still a net that has only seen a lot of depth 8 games and a bunch of depth 12 games.

Things should eventually improve, once we can get fishtest, or Leela or Noob's data to work.

Anyway, I turned skeptical about its scaling after seeing a fixed node test at 1m, 10m and 20m. But maybe Jjoshua's net has fixed that.
We'll see over at TCEC, Jjosh's net should be stronger than mine and TCEC is less likely to bork settings than CCC.

*But a lot of them will exist even if we use deeper data, SF evaluating a draw endgame as +1 is just as wrong as Leela saying +0.8 or NNUE +3.4.

@vondele
Copy link
Member

vondele commented Jul 10, 2020

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

@gekkehenker
Copy link

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

A few others have experimented with the data but had some strange behaviour.
Either because they weren't converted correctly or maybe an issue with the learning function itself.

@vondele
Copy link
Member

vondele commented Jul 10, 2020

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

@TesseractA
Copy link

TesseractA commented Jul 10, 2020

@gekkehenker it's much harder* to tune a neural network to give desired relative evaluations than it is for the handcrafted alternatives.**

*might have to be proven to be known true, but stockfish's evaluations are tuned to beat other versions of itself. that makes the patches that pass alive out of fishtest very good at introducing adversarial play, which a small neural network trained on external data could not provide to such high fidelity. what ends up happening against stronger or "drawish" opponents is the neural network tends to prefer things which itself cannot evaluate properly instead of being able to focus on generating play from its own internal strengths.

**"handcrafted alternatives" rely on far more concrete values to evaluate a position, making any small differences in evaluation which might find wins/draws effect magnified. also, the deeper the search, the more false positives which the neural network generates effects how the edges of search behave, especially drawn 50-move rule bound endgames.

@noobpwnftw being able to distinguish when our handcrafted evaluations are better to use could rely on a table of precalculated values in from a file, those of which would allow us to determine what evaluation method is better for what amount of pieces on board, and what type of pieces on the board--we can create such an evaluation-accuracy piece-table by using mean square error of an evaluation to the result of a game, for which we might have to figure out how the new network's evaluations convert to "actual" win percentage. one potential downside is that might get a bit messy if different networks have different strengths.
Then again, maybe there's a-lot of slowdown in figuring out which pieces are on board and loading the table. Maybe simply using the amount of pieces are on board or some value which measures how much the tree is branching is enough.

@gekkehenker
Copy link

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

This link contains a few Windows compiles (popcnt, avx2, bmi2) and my current strongest net:

https://workupload.com/file/ggEUrvNVgmH

It seems like the latest binaries (same goes for the binaries on Nodchip's repo) fixed a few bugs.
No longer need to adjust slowmover, 100 works perfectly now.
Extreme elo gain, on older binaries my nets were always 100+ elo weaker than SF. They now test stronger than SFDev...

It's roughly as simple as SF now. UCI option "evalfile" has to point towards the NN file.
In files above it's by default "eval\nn.bin", but this can be changed to anything now. As long as it points towards the correct binary file.

There's sadly not a lot of centralized information because it was originally nothing more than a quick port to test if NNUE works in chess too. Whatever I know is build upon quick instructions from Twitter, looking through the learner.cpp code and google translated YaneuraOu docs:

https://twitter.com/nodchip/status/993432774387249153
https://github.com/nodchip/Stockfish/blob/master/src/learn/learner.cpp
https://github.com/yaneurao/YaneuraOu/blob/master/docs/USI%E6%8B%A1%E5%BC%B5%E3%82%B3%E3%83%9E%E3%83%B3%E3%83%89.txt

@ssj100
Copy link

ssj100 commented Jul 14, 2020

Just thought it'd be important to post some real results in my testing so far.

  1. I've been testing with these conditions for many years, including with SF8, SF9, SF10, SF11, SF12dev, H5-6, K10-14.

  2. These are the general conditions:
    -GUI = cutechess
    -1-core
    -No TB
    -Time Control = 60 seconds +0.6
    -Book = Balsa_v500.pgn (500 lines mainly up to 5 moves)

  3. This is the information for each engine:
    -SF = from abrok compile "July 11" 2020, all default settings
    -SF NNUE binary component = from nodchip compile "July 13" 2020, all default settings (it's important to use this binary, as older binaries were 50-100+ elo weaker for some reason)
    [This means both engines are using a very recent version of SF's "search code". As already discussed/mentioned in many places, the functional difference between each engine is that the abrok SF obviously uses SF's "eval code", while SF NNUE completely disables this "eval code" and uses code from a trained net ("nn.bin")]
    -SF NNUE net component = gekkehenker net from 27 June 2020 (which was created entirely using SF self-play games with a binary from June 2020)
    ***Start position SF speed: ~1800Mnps
    ***Start position SF NNUE speed: ~1100Mnps (~60% of SF speed)

  4. Here is the result so far:
    SF NNUE vs SF: 78 - 53 - 369 [0.525]
    Elo difference: 17.39 +/- 15.54
    500 of 1000 games finished.

I'm going to let it run to 1000-games mainly just for future consistency.
Some musings:

  1. You can already see SF NNUE is very likely about on par with latest SF (possibly better)
  2. The NNUE concept has likely only (publicly) been experimented with in the last few weeks in computer chess
  3. @gekkehenker literally only spent a few days creating the "eval net" above and using very limited hardware resources (literally one computer with one CPU - 6 cores/12 threads)
  4. If 1. is true, this effectively means gekkehenker has, by himself, literally managed to match (or possibly surpass) the elo strength of SF's "eval code" within a few days and with a tiny fraction of "CPU hours" of fishtest. That is, he has done what SF/fishtest (with hundreds of developers, thousands of "CPU-years" and about 12-years of hand-crafted coding/testing) has managed in a fraction of time and resources
  5. It remains to be seen if scaling for SF NNUE is good, but all the data out there so far strongly suggests that it is
  6. I can only imagine what fishtest and the SF community can achieve together with its ample resources and incredible developer talent
  7. One way forward would be to split fishtest resources, to something like as follows (assuming a default of about 1500-cores is available):
    -1000-cores to continue handcraft search improvement patches
    -100-cores to continue handcraft eval improvement patches
    -400-cores to train "NNUE"
    (Clearly this proportion can be changed accordingly as per the optimal needs etc)

Anyway, thanks to @gekkehenker and nodchip for continuing to share their knowledge publicly!

@crocogoat
Copy link

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

@gekkehenker
Copy link

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Yes, the first time I saw the results of the new binaries I couldn't believe them either.
"I must have done something wrong" is what I thought.

In an era where a 5 elo patch is believed as too good to be true, a 30 elo "patch" must be impossible to believe.

@ssj100
Copy link

ssj100 commented Jul 15, 2020

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Your result is "consistent" with basically every test done so far (including mine) that used nodchip's binaries (or equivalent) from July 11th or later. Again, testing with the newer binaries is crucial (probably stick with July 13th binary until we're absolutely certain of the strength improvement), as older binaries were for some reason 50-100+ elo weaker - SF is so far ahead of the rest that it was still a relatively strong engine, around the level of Komodo 14.

It appears that the elo difference at 10+0.1 (and likely even shorter TC) is likely bigger than at 60+0.6. The elo difference seems to be around 30-50 at the shorter TCs, and around 15-35 at the longer TCs. It'd be interesting to see if fishtest can verify these numbers - ideally test at its usual TC for patches - 10+0.1 and 60+0.6 with 1-thread, and 5+0.05 and 20+0.2 with 8-threads, all to 40,000 games each or similar.

@crocogoat
Copy link

Yeah fishtest tests would be quite something if that is possible. My own test for 20+0.2 I stopped when it was giving a similar result:

20+0.2: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 506 - 292 - 1224 [0.553]
Elo difference: 36.91 +/- 9.47

and then I started the more interesting 60+0.6 and that, while with little amount of games so far, did as well:

60+0.6 hash64: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 204 - 105 - 663 [0.551]
Elo difference: 35.51 +/- 12.23

@ssj100
Copy link

ssj100 commented Jul 17, 2020

Just to follow up on my testing from above. The 1-core test finished as follows:

SF NNUE vs SF: 161 - 103 - 736 [0.529]
Elo difference: 20.17 +/- 11.02
1000 of 1000 games finished.

2-core test with exactly the same conditions as above, currently showing even better results, although sample sizes are tiny to draw any conclusions about scaling:

SF NNUE vs SF: 81 - 30 - 327 [0.558]
Elo difference: 40.64 +/- 16.14
438 of 1000 games finished.

@vondele
Copy link
Member

vondele commented Jul 18, 2020

So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4e) I get the following results:

STC (10.0+0.1 @ 1 thread)
Score of master vs nnue: 940 - 2206 - 3973  [0.411] 7119
Elo difference: -62.4 +/- 5.3, LOS: 0.0 %, DrawRatio: 55.8 %

LTC (20.0+0.2 @ 8 thread)
Score of master vs nnue: 189 - 463 - 1332  [0.431] 1984
Elo difference: -48.3 +/- 8.7, LOS: 0.0 %, DrawRatio: 67.1 %

That's a bit better than the results posted previously. The cutechess cmdline is quite standard:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn.bin -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn

@adentong
Copy link
Author

Tests on ccc seems to indicate that nnue can't handle more than 64 threads though? Is that true or is ccc nnue set up incorrectly? Anyways I highly doubt blitz tests represent the true strength difference at vltc (I'm talking about tcec conditions). I expect at best +20 elo in those conditions (which by the way was my prediction on how much better leela was back when a horde of leela fans were claiming +50 at least.).

@vondele
Copy link
Member

vondele commented Jul 18, 2020

well it is unlikely that fundamentally nnue would show worse threading behavior. After all, this is just changing eval, which is really threading-independent. However, there could be threading related bugs, or new threading-related bottlenecks that haven't been found. That could happen in a relatively new code. Another thing to consider is that there might be a difference in performance wrt. hyperthreading as the nnue has different characteristics (e.g. avx2 intensive). A first test at a higher thread count here seems fine:

VLTC (20.0+0.2 @ 16 threads)
Score of master vs nnue: 292 - 698 - 2202  [0.436] 3192
Elo difference: -44.4 +/- 6.6, LOS: 0.0 %, DrawRatio: 69.0 %

@vondele vondele added the NNUE label Aug 2, 2020
@mstembera
Copy link
Contributor

Some new unexpected results given the first ones.
SF Depth 6 vs NNUE Depth 6
2972 - 6161 - 867 [0.341] 10000 -114.8 +/- 6.8, LOS: 0.0 %, DrawRatio: 8.7 %
SF Depth 7 vs NNUE Depth 6
6092 - 2837 - 1071 [0.663] 10000 117.4 +/- 6.7, LOS: 100.0 %, DrawRatio: 10.7 %
Looks like the difference between the evals here is less than 1 ply worth of search.
On what depth training data was the network trained?

@vondele
Copy link
Member

vondele commented Aug 2, 2020

I think it was trained on depth 8 or depth 12 (@gekkehenker ?). However, I think this must not be too surprising, we know Elo gain at STC depths is something like 30-60Elo, which is less than what 1 ply of depth is worth (at around STC depths).

@gekkehenker
Copy link

I think it was trained on depth 8 or depth 12 (@gekkehenker ?). However, I think this must not be too surprising, we know Elo gain at STC depths is something like 30-60Elo, which is less than what 1 ply of depth is worth (at around STC depths).

Net was trained on both depth 8 and depth 12 games.
Was first fed the depth 8 games only, then trained the resulting net on depth 12 games.

@ssj100
Copy link

ssj100 commented Aug 4, 2020

@vondele thanks for your hard work in getting NNUE merged - just wondered what SV net is being run on fishtest now?

@vondele
Copy link
Member

vondele commented Aug 4, 2020

@ssj100
Copy link

ssj100 commented Aug 4, 2020

@vondele thanks - just wondered what the corresponding net number etc is from here:
https://www.comp.nus.edu.sg/~sergio-v/nnue/

Also which binary is used?

@vondele
Copy link
Member

vondele commented Aug 4, 2020

don't know, you should be able to find it from a matching sha256sum netname | cut -c1-12

vondele referenced this issue in vondele/Stockfish Aug 4, 2020
uploaded by Sergio Vieri

NNUE signature: 4254913
Bench: 4746616
@rooklift
Copy link

rooklift commented Aug 4, 2020

nn-97f742aaefcd.nnue is 20200801-1515.bin

@TesseractA
Copy link

has anyone tried to use NNUE in FRC? doesn't seem to work for some.

@rooklift
Copy link

rooklift commented Aug 4, 2020

Hmm worked OK for me here: https://lichess.org/yV7J1imd

@vondele
Copy link
Member

vondele commented Aug 5, 2020

I haven't tried but in principle it should work. NNUE only touches eval. Also the classical eval had almost no special handling of FRC (one term if I recall correctly).

@gekkehenker
Copy link

In my experience NNUE will play some FRC positions and crash in the rest.

@vondele
Copy link
Member

vondele commented Aug 5, 2020

hmm, will be the added code in position that might wrong in that case.

@protonspring
Copy link

I am behind the times. . . is this really ~90 ELO better than master on the same hardware?

@MichaelB7
Copy link
Contributor

MichaelB7 commented Aug 5, 2020

Correct - this will be a 100+ Elo gain merge or so - give or take a few Elo.

The mother of all merges.

@gekkehenker
Copy link

I am behind the times. . . is this really ~90 ELO better than master on the same hardware?

90 elo conservatively.

On a modern CPU with normal LTC conditions and a PGO build it's a bit stronger than that ;)

@TesseractA
Copy link

TesseractA commented Aug 5, 2020

Note there are certain incompatibilities on old hardware that would make it significantly less efficient.

Also, there are hints there is some significant elo compression at very long time controls with increment.

Also note that contempt has yet to be implemented, which has the potential to present itself as an ELO gainer.

...ALSO note that it's likely just much stronger from the start position than it is from some many-ply-long books, but that claim has yet to be sufficiently backed up.

@MichaelB7
Copy link
Contributor

I would not get too excited about contempt. Contempt was designed for use against weaker engines. Against equal or stronger engine, it’s just about worthless. So the only thing contempt does is squeeze a few extra elo out of much lower rating opponents. I would be hard pressed to say contempt makes it better - it squeezes a few Elo out of weaker opponents. It falls into the realm of being a vanity of vanities.

@mstembera
Copy link
Contributor

@MichaelB7 Not having contempt cost SF the qualification into the TCEC SuFi one season.

vondele pushed a commit to vondele/Stockfish that referenced this issue Aug 6, 2020
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish.

Both the NNUE and the classical evaluations are available, and can be used to
assign a value to a position that is later used in alpha-beta (PVS) search to find the
best move. The classical evaluation computes this value as a function of various chess
concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation
computes this value with a neural network based on basic inputs. The network is optimized
and trained on the evalutions of millions of positions at moderate search depth.

The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
of the neural network need to be updated after a typical chess move.
[The nodchip repository](https://github.com/nodchip/Stockfish) provides additional
tools to train and develop the NNUE networks.

This patch is the result of contributions of various authors, from various communities,
including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather,
rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler,
dorzechowski, and vondele.

This new evaluation needed various changes to fishtest and the corresponding infrastructure,
for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged.

The first networks have been provided by gekkehenker and sergiovieri, with the latter
net (nn-97f742aaefcd.nnue) being the current default.

The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option,
provided the `EvalFile` option points the the network file (depending on the GUI, with full path).

The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on
the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest:

60000 @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
ELO: 92.77 +-2.1 (95%) LOS: 100.0%
Total: 60000 W: 24193 L: 8543 D: 27264
Ptnml(0-2): 609, 3850, 9708, 10948, 4885

40000 @ 20+0.2 th 8
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
ELO: 89.47 +-2.0 (95%) LOS: 100.0%
Total: 40000 W: 12756 L: 2677 D: 24567
Ptnml(0-2): 74, 1583, 8550, 7776, 2017

At the same time, the impact on the classical evaluation remains minimal, causing no significant
regression:

sprt @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b
LLR: 2.94 (-2.94,2.94) {-6.00,-4.00}
Total: 34936 W: 6502 L: 6825 D: 21609
Ptnml(0-2): 571, 4082, 8434, 3861, 520

sprt @ 60+0.6 th 1
https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d
LLR: 2.93 (-2.94,2.94) {-6.00,-4.00}
Total: 10088 W: 1232 L: 1265 D: 7591
Ptnml(0-2): 49, 914, 3170, 843, 68

The needed networks can be found at https://tests.stockfishchess.org/nns
It is recommended to use the default one as indicated by the `EvalFile` UCI option.

Guidelines for testing new nets can be found at
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests

Integration has been discussed in various issues:
official-stockfish#2823
official-stockfish#2728

The integration branch will be closed after the merge:
official-stockfish#2825
https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip

This will be an exciting time for computer chess, looking forward to seeing the evolution of
this approach.

Bench: 4746616
vondele pushed a commit that referenced this issue Aug 6, 2020
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish.

Both the NNUE and the classical evaluations are available, and can be used to
assign a value to a position that is later used in alpha-beta (PVS) search to find the
best move. The classical evaluation computes this value as a function of various chess
concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation
computes this value with a neural network based on basic inputs. The network is optimized
and trained on the evalutions of millions of positions at moderate search depth.

The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
of the neural network need to be updated after a typical chess move.
[The nodchip repository](https://github.com/nodchip/Stockfish) provides additional
tools to train and develop the NNUE networks.

This patch is the result of contributions of various authors, from various communities,
including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather,
rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler,
dorzechowski, and vondele.

This new evaluation needed various changes to fishtest and the corresponding infrastructure,
for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged.

The first networks have been provided by gekkehenker and sergiovieri, with the latter
net (nn-97f742aaefcd.nnue) being the current default.

The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option,
provided the `EvalFile` option points the the network file (depending on the GUI, with full path).

The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on
the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest:

60000 @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
ELO: 92.77 +-2.1 (95%) LOS: 100.0%
Total: 60000 W: 24193 L: 8543 D: 27264
Ptnml(0-2): 609, 3850, 9708, 10948, 4885

40000 @ 20+0.2 th 8
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
ELO: 89.47 +-2.0 (95%) LOS: 100.0%
Total: 40000 W: 12756 L: 2677 D: 24567
Ptnml(0-2): 74, 1583, 8550, 7776, 2017

At the same time, the impact on the classical evaluation remains minimal, causing no significant
regression:

sprt @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b
LLR: 2.94 (-2.94,2.94) {-6.00,-4.00}
Total: 34936 W: 6502 L: 6825 D: 21609
Ptnml(0-2): 571, 4082, 8434, 3861, 520

sprt @ 60+0.6 th 1
https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d
LLR: 2.93 (-2.94,2.94) {-6.00,-4.00}
Total: 10088 W: 1232 L: 1265 D: 7591
Ptnml(0-2): 49, 914, 3170, 843, 68

The needed networks can be found at https://tests.stockfishchess.org/nns
It is recommended to use the default one as indicated by the `EvalFile` UCI option.

Guidelines for testing new nets can be found at
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests

Integration has been discussed in various issues:
#2823
#2728

The integration branch will be closed after the merge:
#2825
https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip

closes #2912

This will be an exciting time for computer chess, looking forward to seeing the evolution of
this approach.

Bench: 4746616
@vondele
Copy link
Member

vondele commented Aug 6, 2020

NNUE evaluation has been merged, I'll close this issue. Thanks for the discussion.

@vondele vondele closed this as completed Aug 6, 2020
noobpwnftw pushed a commit to noobpwnftw/Stockfish that referenced this issue Aug 15, 2020
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish.

Both the NNUE and the classical evaluations are available, and can be used to
assign a value to a position that is later used in alpha-beta (PVS) search to find the
best move. The classical evaluation computes this value as a function of various chess
concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation
computes this value with a neural network based on basic inputs. The network is optimized
and trained on the evalutions of millions of positions at moderate search depth.

The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
of the neural network need to be updated after a typical chess move.
[The nodchip repository](https://github.com/nodchip/Stockfish) provides additional
tools to train and develop the NNUE networks.

This patch is the result of contributions of various authors, from various communities,
including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather,
rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler,
dorzechowski, and vondele.

This new evaluation needed various changes to fishtest and the corresponding infrastructure,
for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged.

The first networks have been provided by gekkehenker and sergiovieri, with the latter
net (nn-97f742aaefcd.nnue) being the current default.

The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option,
provided the `EvalFile` option points the the network file (depending on the GUI, with full path).

The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on
the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest:

60000 @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
ELO: 92.77 +-2.1 (95%) LOS: 100.0%
Total: 60000 W: 24193 L: 8543 D: 27264
Ptnml(0-2): 609, 3850, 9708, 10948, 4885

40000 @ 20+0.2 th 8
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
ELO: 89.47 +-2.0 (95%) LOS: 100.0%
Total: 40000 W: 12756 L: 2677 D: 24567
Ptnml(0-2): 74, 1583, 8550, 7776, 2017

At the same time, the impact on the classical evaluation remains minimal, causing no significant
regression:

sprt @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b
LLR: 2.94 (-2.94,2.94) {-6.00,-4.00}
Total: 34936 W: 6502 L: 6825 D: 21609
Ptnml(0-2): 571, 4082, 8434, 3861, 520

sprt @ 60+0.6 th 1
https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d
LLR: 2.93 (-2.94,2.94) {-6.00,-4.00}
Total: 10088 W: 1232 L: 1265 D: 7591
Ptnml(0-2): 49, 914, 3170, 843, 68

The needed networks can be found at https://tests.stockfishchess.org/nns
It is recommended to use the default one as indicated by the `EvalFile` UCI option.

Guidelines for testing new nets can be found at
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests

Integration has been discussed in various issues:
official-stockfish#2823
official-stockfish#2728

The integration branch will be closed after the merge:
official-stockfish#2825
https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip

closes official-stockfish#2912

This will be an exciting time for computer chess, looking forward to seeing the evolution of
this approach.

Bench: 4746616
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests