-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fishtest Early Adjudication #1904
Comments
The draw rule takes effect from move 34 on according to the linked page. |
@Alayan-stk-2 That doesn't make much difference, 34 is barely midgame. The issue is that these shallow conditions exist at all: games are decided very early. |
34 sounds quite early to adjudicate especially with such loose bounds (one could be at +0.19 and the other at -0.19...), sure, but your first post made it sound as if it was true from the beginning of the game. I think it's better for an informed discussion to know clearly the facts, and not everyone will check your link. 🙂 You should try and find a way to experimentally test your hypothesis about the adjudication rules being a serious issue, with some case where you can show they add testing noise. |
@Alayan-stk-2 That would be easy, I only need the pgn of any recent test and checking number of adjudicated games vs total. |
With these adjudication bounds, likely almost all games are adjudicated. A mate is almost impossible, and only some 3-fold draws would really escape the draw adjudication. But the true question is, how often is the adjudication wrong ? Some possible experiment would be to get 100K STC games played of e.g. master vs master (and/or maybe SF10 vs SF9 to test the draw rule better) without adjudication. Then, using evals saved in pgn, analysing how many would have given a different result if they had been adjudicated. But this can't be set up on fishtest as far as I know. |
@Alayan-stk-2 Adjudication being wrong is only half the problem, Example:The high-depth analysis of adjudications shows that adjudication is right, but in time control assigned the checkmate/draw isn't reached(without adjudication) and adjudication is wrong. |
If some draws are converted to wins, or some -400 scores still give a draw, then the adjudication led to a wrong result. Adjudication being wrong is not half of the adjudication problem, it is the whole (potential) problem. Edit : You should reread my suggested testing procedure. It doesn't involve any high-depth analysis. |
@Alayan-stk-2 High-depth analysis is the only objective way to resolve the problem(though i admit it doesn't predict TC results), since time controls introduce random factors(move time variability, better time management,faster CPUs), so without adjudication vs with adjudication doesn't conclusively prove it either, though without adjudication is evidently less biased. The situation is as following: B.Adjudication severely distorts STC results, like Lazy Evaluation on steroids it cuts off whole categories of positions as "Draws" or "Wins" without resolving anything. C.Testing at same TC without adjudication vs adjudication will not uncover every flaw. D.A very long TC requires too much resources and like high-depth analysis it may bias towards evaluation at high-depth which doesn't correlate to "fast TC" behavior(just like STC passing patches can't scale to LTC, because of different average depths). |
A. This isn't relevant in that those high-depth analysis wouldn't tell us whether fishtest would be more efficient with different adjudication rules or not. B. This is the hypothesis that needs testing to be proven or disproven. The test goal should focus on this. If you want to test something else, it needs another, different test. The test I suggested (playing full STC games without adjudication with a big enough sample, then reverse-analysing what would have been the adjudication results and compare them to the actual results) has this goal in mind. The results are distorted by the adjudication if playing the full games would lead to different results in a significant amount of games, no more, no less. C. You can't do everything at once. You assert that adjudication severely distorts STC results. Then this should be tested. If you want to test another parameter, design a test aimed to check on this parameter only. |
@Alayan-stk-2 This STC test itself would have results different from A, the ground truth. Several of STC tests on same positions will show adjudication being randomly correct/wrong because the STC itself isn't that good of a test. To demonstrate how this STC variance in reached , you can run a simple local test without a book(starting position only) to see how timings create wins and losses between identical binaries of Stockfish. It should be essential to see for everyone who doesn't understand what is move time variability and why it matters. The assymetrical games of same binaries, same position and same time control(if using multiple threads this adds another random factor) really show how minor speed changes influences entire game. |
Example of draws. |
Stockfish at bullet TC plays worse chess than Stockfish at long TC and high-depth analysis. This is obvious. The point of patch-testing is that good patches will play better and will lose less/win more by being more often right about the position. |
Example of checkmates: |
@Alayan-stk-2 Since adjudication exists, and most games end in adjudication, the evaluation score sent IS important. As thought experiment imagine adjudication bounds expanded to +/- 40 centipawns for draws and -200 to resigning. Will you have any argument For adjudication? |
I don't want to interfere in your discussion, just to note that any change would make the average game length to increase with a bad impact on resource usage. Personally, if I have to chose where to allocate more resources I would chose the tightening of SPRT parameters (like we already did) instead of this one. To give some weight to this discussion it should be proved that with changed parameters the set of green tests would be different. Or eventually (easier) to recover somewhere the old discussion that yielded to current setup. It was Gary Linscott's if I remember correctly. |
@mcostalba I'm aware of resource cost, but it feels so wrong to rely on STC Stockfish eval being right, especially the 20centipawn draw interval which is typical for neutral positions where no clear advantage exists before endgame. |
some of these things have been tested recently (or adjudication being disabled, I recall for the time management patches), and it can be done on fishtest. For example, by having the engine output -300 cp for eval always, none of the games will be adjudicated as win or draw, and they will all be played to the end. However, I think that this has very little effect on what fishtest is supposed to do (i.e. filter patches for their Elo strength). |
@vondele I've started a test to check how big of impact there is on STC, here: |
so, we have completed the tests on this measuring the impact on the elo diff from sf9 - sf10 at STC. ELO: 71.03 +-2.4 std adjudication that is, no significant effect AFAICT. |
It was mentioned once on fishcooking that adjudication is worth 20% in throughput (this seems very high to me but it is in this thread). In any case globally disabling adjudication would need very strong arguments. Especially since it can be done for individual patches using the fake scores trick. But playing the devil's advocate let me point out that the above tests may as usual suffer from selection bias. If there are indeed patches that are adversely affected by adjudication then they will not make it in into master. So they would be invisible in such tests... |
@vdbergh What about making it similar to TCEC rules? |
@Chess13234 TCEC is entertainment. It has nothing to do with fishtest. I one really wants to test the effect of adjudication then one should do like A0: leave out adjudication for 10% of the games that would normally have been adjudicated and see if the adjucation would have been correct or not. In that way one can measure the effect of adjudication without incurring selection bias. |
Maybe my selection bias argument is wrong in the end since it is now not clear to me how a patch could be adversely affected by adjudication. If an engine emits a large negative score then it is convinced that it is losing. If it is not losing then obviously its evaluation is wrong and to me it seems correct that it is punished for this. A similar argument applies to emitting draw scores when winning, although perhaps less compellingly. But on the whole it seems that an engine can mostly be helped by adjudication but not unjustly punished. |
@vdbergh "If it is not losing then obviously its evaluation is wrong and to me it seems correct that it is punished for this. " "A similar argument applies to emitting draw scores when winning" Perhaps Stockfish is deficient at evaluating draws because all its life it was tuned to treat sub-20 centipawn scores as potential draws? |
Looks like this issue has spawned a partial fix by forcing more time on moves, which will reduce adjudication mistakes. |
All tests in fishtest use very shallow conditions for Adjudication:
Draw:
(applies after 34 moves from opening position)
if for 8 moves the score is within -20,20 centipawns range (20 centipawns from zero) the game is a draw.
This mean any neutral position with no clear advantage is dismissed as a draw in very few moves
Even at LTC time control many positions will not show big advantages, because they're balanced.
Loss:
If the score is -400 for 3 moves, the game is a loss.
This means any temporary setbacks,(especially in STC where engine has 100ms to move) will means cutechess will declare the game a win for the other player. This has huge implications with contempt, pruning,PSQT, lazy eval and generally means the result is just statistical gambling on (patched) Stockfish eval being correct within 100ms or even less(dependent on random time slice allocated)
TCEC rules(https://tcec.chessdom.com/ ) are in contrast:
Game adjudication
A game can be drawn by the normal 3-fold repetition rule or the 50-move rule. However, a game can also be drawn at move 35 or later if the eval from both playing engines are within +0.08 to -0.08 pawns for the last 5 moves, or 10 plies. If there is a pawn advance, or a capture of any kind, this special draw rule will reset and start over. On the GUI interface, this rule is shown as "TCEC draw rule" with a number indicating how many plies there are left until it becomes official. It will adjudicate as won for one side if both playing engines have an eval of at least 10.00 pawns (or -10.00 in case of a black win) for 5 consecutive moves, or 10 plies - this rule is in effect as soon as the game starts.
(Keep in mind at depths TCEC engines operate these scores are much more accurate than STC/LTC tests)
Its the default condition:Stockfish is always playing inside this - +20 -+400 interval and is probably deficient in some form when evaluating outside of it.
Proof:
https://github.com/glinscott/fishtest/blob/master/worker/games.py#L474
This has existed since 2013,
Before that draws were adjudicated in 2(!) moves.
official-stockfish/fishtest@745418d
Relevant fishcooking discussions found, indicating this is an old problem:
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/51QpyiiKScQ/i8CQIGqMBwAJ (mentions the tool for adjudicating non-adjucated games).
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/Z8Yokz89EyQ/D4rXWN7RPvwJ (Time saved by adjudication is not significant).
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/T4qVZWxb7Pc/3s_QoBzg41oJ (the Rationale for moving from 2move draws to 8move draws found)
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/VC-n0A3mVgg/2VRai0yqCwAJ (Mentions changes -to adjudication-)
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/7UiiXGpXzcs/kARMU31AAukJ (Removing Adjudication)-
I've added an ELO test to see the impact of adjudication in STC, however check out the discussion to understand some of the problems(I think a LTC test would be appropriate instead, since STC is too chaotic and there are additional fundamental factors that make this hard to prove)
http://tests.stockfishchess.org/tests/view/5c248c180ebc5902ba131c53
As suggested by @vondele the above test is stopped and replaced with sf_10 without Adjudication to better gauge elo changes by comparing it
ELO: 72.99 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14451 L: 6169 D: 19380 [GREEN]
http://tests.stockfishchess.org/tests/view/5c249ca00ebc5902ba131d56
with earlier test by @vdbergh to measure ELO scaling to LTC
ELO: 71.03 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14256 L: 6190 D: 19554 [GREEN]
http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed
Additional STC test to test symmetric "NoAdjudication" impact suggested by @vdbergh
ELO: 70.83 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14346 L: 6303 D: 19351 [GREEN]
http://tests.stockfishchess.org/tests/view/5c24a3390ebc5902ba131dc1
The text was updated successfully, but these errors were encountered: