Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fishtest Early Adjudication #1904

Closed
ghost opened this issue Dec 26, 2018 · 26 comments
Closed

Fishtest Early Adjudication #1904

ghost opened this issue Dec 26, 2018 · 26 comments

Comments

@ghost
Copy link

ghost commented Dec 26, 2018

All tests in fishtest use very shallow conditions for Adjudication:

Draw:
(applies after 34 moves from opening position)
if for 8 moves the score is within -20,20 centipawns range (20 centipawns from zero) the game is a draw.
This mean any neutral position with no clear advantage is dismissed as a draw in very few moves
Even at LTC time control many positions will not show big advantages, because they're balanced.

Loss:
If the score is -400 for 3 moves, the game is a loss.
This means any temporary setbacks,(especially in STC where engine has 100ms to move) will means cutechess will declare the game a win for the other player. This has huge implications with contempt, pruning,PSQT, lazy eval and generally means the result is just statistical gambling on (patched) Stockfish eval being correct within 100ms or even less(dependent on random time slice allocated)
TCEC rules(https://tcec.chessdom.com/ ) are in contrast:

Game adjudication
A game can be drawn by the normal 3-fold repetition rule or the 50-move rule. However, a game can also be drawn at move 35 or later if the eval from both playing engines are within +0.08 to -0.08 pawns for the last 5 moves, or 10 plies. If there is a pawn advance, or a capture of any kind, this special draw rule will reset and start over. On the GUI interface, this rule is shown as "TCEC draw rule" with a number indicating how many plies there are left until it becomes official. It will adjudicate as won for one side if both playing engines have an eval of at least 10.00 pawns (or -10.00 in case of a black win) for 5 consecutive moves, or 10 plies - this rule is in effect as soon as the game starts.
(Keep in mind at depths TCEC engines operate these scores are much more accurate than STC/LTC tests)

Its the default condition:Stockfish is always playing inside this - +20 -+400 interval and is probably deficient in some form when evaluating outside of it.
Proof:
https://github.com/glinscott/fishtest/blob/master/worker/games.py#L474
This has existed since 2013,
Before that draws were adjudicated in 2(!) moves.
official-stockfish/fishtest@745418d

Relevant fishcooking discussions found, indicating this is an old problem:
https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/51QpyiiKScQ/i8CQIGqMBwAJ (mentions the tool for adjudicating non-adjucated games).

https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/Z8Yokz89EyQ/D4rXWN7RPvwJ (Time saved by adjudication is not significant).

https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/T4qVZWxb7Pc/3s_QoBzg41oJ (the Rationale for moving from 2move draws to 8move draws found)

https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/VC-n0A3mVgg/2VRai0yqCwAJ (Mentions changes -to adjudication-)

https://groups.google.com/forum/?fromgroups=&hl=en#!searchin/fishcooking/adjudication|sort:relevance/fishcooking/7UiiXGpXzcs/kARMU31AAukJ (Removing Adjudication)-

I've added an ELO test to see the impact of adjudication in STC, however check out the discussion to understand some of the problems(I think a LTC test would be appropriate instead, since STC is too chaotic and there are additional fundamental factors that make this hard to prove)
http://tests.stockfishchess.org/tests/view/5c248c180ebc5902ba131c53

As suggested by @vondele the above test is stopped and replaced with sf_10 without Adjudication to better gauge elo changes by comparing it
ELO: 72.99 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14451 L: 6169 D: 19380 [GREEN]
http://tests.stockfishchess.org/tests/view/5c249ca00ebc5902ba131d56
with earlier test by @vdbergh to measure ELO scaling to LTC
ELO: 71.03 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14256 L: 6190 D: 19554 [GREEN]
http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed
Additional STC test to test symmetric "NoAdjudication" impact suggested by @vdbergh
ELO: 70.83 +-2.4 (95%) LOS: 100.0%
Total: 40000 W: 14346 L: 6303 D: 19351 [GREEN]
http://tests.stockfishchess.org/tests/view/5c24a3390ebc5902ba131dc1

@Alayan-stk-2
Copy link

The draw rule takes effect from move 34 on according to the linked page.

@ghost
Copy link
Author

ghost commented Dec 26, 2018

@Alayan-stk-2 That doesn't make much difference, 34 is barely midgame. The issue is that these shallow conditions exist at all: games are decided very early.
this is more important than books, SPRT and time controls.
These is fundamentally not chess tournaments but some pathological statistics experiment gone wrong.

@Alayan-stk-2
Copy link

34 sounds quite early to adjudicate especially with such loose bounds (one could be at +0.19 and the other at -0.19...), sure, but your first post made it sound as if it was true from the beginning of the game. I think it's better for an informed discussion to know clearly the facts, and not everyone will check your link. 🙂

You should try and find a way to experimentally test your hypothesis about the adjudication rules being a serious issue, with some case where you can show they add testing noise.

@ghost
Copy link
Author

ghost commented Dec 26, 2018

@Alayan-stk-2 That would be easy, I only need the pgn of any recent test and checking number of adjudicated games vs total.

@Alayan-stk-2
Copy link

With these adjudication bounds, likely almost all games are adjudicated. A mate is almost impossible, and only some 3-fold draws would really escape the draw adjudication. But the true question is, how often is the adjudication wrong ?

Some possible experiment would be to get 100K STC games played of e.g. master vs master (and/or maybe SF10 vs SF9 to test the draw rule better) without adjudication. Then, using evals saved in pgn, analysing how many would have given a different result if they had been adjudicated.

But this can't be set up on fishtest as far as I know.

@ghost
Copy link
Author

ghost commented Dec 26, 2018

@Alayan-stk-2 Adjudication being wrong is only half the problem,
the real testing of engines stops at adjudication and their flaws in reaching mate/draw(i.e. some draws are converted to wins and not all -400 scores result in checkmate because one of engines has worse evaluation/pruning or misses a crucial line). The problem is ignoring everything in favor of positions within the eval range of +/-0.2 +/-4.0 and declaring that "chess".

Example:The high-depth analysis of adjudications shows that adjudication is right, but in time control assigned the checkmate/draw isn't reached(without adjudication) and adjudication is wrong.

@Alayan-stk-2
Copy link

Alayan-stk-2 commented Dec 26, 2018

If some draws are converted to wins, or some -400 scores still give a draw, then the adjudication led to a wrong result. Adjudication being wrong is not half of the adjudication problem, it is the whole (potential) problem.

Edit : You should reread my suggested testing procedure. It doesn't involve any high-depth analysis.

@ghost
Copy link
Author

ghost commented Dec 26, 2018

@Alayan-stk-2 High-depth analysis is the only objective way to resolve the problem(though i admit it doesn't predict TC results), since time controls introduce random factors(move time variability, better time management,faster CPUs), so without adjudication vs with adjudication doesn't conclusively prove it either, though without adjudication is evidently less biased.
It will require testing at somewhat longer TC(LTC?) the results of STC without adjudication to remove/reduce STC move time variability influences(the "STC randomness").

The situation is as following:
A.High-depth analysis is objective way to determine the truth of position,(is it a draw or win?). It may not correlate to actul game results in short TC but its the "ground truth" which Stockfish should strive to reach.

B.Adjudication severely distorts STC results, like Lazy Evaluation on steroids it cuts off whole categories of positions as "Draws" or "Wins" without resolving anything.

C.Testing at same TC without adjudication vs adjudication will not uncover every flaw.
Move time variability(as I mentioned in the SPRT thread) will
changes adjudications to true/false dependent on whether enough nodes are searched per move. A slightly longer time control that smoothes the random move time variability is required, but not too long to increase average searched depth.

D.A very long TC requires too much resources and like high-depth analysis it may bias towards evaluation at high-depth which doesn't correlate to "fast TC" behavior(just like STC passing patches can't scale to LTC, because of different average depths).

@Alayan-stk-2
Copy link

A. This isn't relevant in that those high-depth analysis wouldn't tell us whether fishtest would be more efficient with different adjudication rules or not.

B. This is the hypothesis that needs testing to be proven or disproven. The test goal should focus on this. If you want to test something else, it needs another, different test. The test I suggested (playing full STC games without adjudication with a big enough sample, then reverse-analysing what would have been the adjudication results and compare them to the actual results) has this goal in mind. The results are distorted by the adjudication if playing the full games would lead to different results in a significant amount of games, no more, no less.

C. You can't do everything at once. You assert that adjudication severely distorts STC results. Then this should be tested. If you want to test another parameter, design a test aimed to check on this parameter only.

@ghost
Copy link
Author

ghost commented Dec 26, 2018

@Alayan-stk-2
1.I agree that such STC test without adjudication should be run. I just point out some reasoning flaws.
Example1:
Game A has a checkmate, but at move X it can adjudicated as a draw. The adjudication is actually correct, but at STC the stockfish can't reach the draw and loses instead.
Example2:
Game reaches a Draw, but at move X is has a -400 score for the losing side(which would be adjudicated as loss and is actually correct), however at STC the enemy Stockfish can't reach checkmate and draws instead. The adjudication was correct but cannot predict the result reached at STC.

This STC test itself would have results different from A, the ground truth. Several of STC tests on same positions will show adjudication being randomly correct/wrong because the STC itself isn't that good of a test.

To demonstrate how this STC variance in reached , you can run a simple local test without a book(starting position only) to see how timings create wins and losses between identical binaries of Stockfish. It should be essential to see for everyone who doesn't understand what is move time variability and why it matters. The assymetrical games of same binaries, same position and same time control(if using multiple threads this adds another random factor) really show how minor speed changes influences entire game.

@ghost
Copy link
Author

ghost commented Dec 27, 2018

Example of draws.
1.Draw adjudication
2.Leads to
A.Draw, but adjudication is wrong: the game by analysis ends in checkmate. The adjudication is correct at predicting game result.This is not detected by reverse-analysis suggested.
B.Draw, and adjudication is correct: the game is drawn by analysis and in STC. This is ignored.
C.Checkmate, and adjudication is correct(in high-depth analysis): the game simply doesn't reach the draw. This is detected by reverse-analysis suggested but is actually wrong in high-depth analysis will show its drawn.
D.Checkmate, and the adjudication is wrong:this is the case that would be found in reverse analysis.

@Alayan-stk-2
Copy link

Stockfish at bullet TC plays worse chess than Stockfish at long TC and high-depth analysis. This is obvious.

The point of patch-testing is that good patches will play better and will lose less/win more by being more often right about the position.

@ghost
Copy link
Author

ghost commented Dec 27, 2018

Example of checkmates:
1.Loss at -400 centipawns adjudicated as checkmate.
2.Leads without adjudication to a
A.Draw, and adjudication is correct in high-depth analysis:
reverse-analysis will show this as incorrect adjudication.
B.Draw,and adjudication is incorrect in high-depth analysis:
reverse-analysis will be consistent with reality in this case.
C.Checkmate, and adjudication is shown as correct in high-depth analysis. This is ignored.
D.Checkmate, and adjudication is as incorrect in high-depth analysis. This will be ignored by reverse-analysis.
E.Reverse checkmate.and adjudication is shown as correct by high-depth analysis, the game simply turned around.
Reverse-analysis will flag the adjudication as incorrect but in reality it was flaw in gameplay leading to the unexpected loss.
D.Reverse checkmate, and adjudication is shown to be incorrect. This will be found both by high-depth and reverse analysis method suggested.

@ghost
Copy link
Author

ghost commented Dec 27, 2018

@Alayan-stk-2 Since adjudication exists, and most games end in adjudication, the evaluation score sent IS important.
Imagine if bullet games were decided by adjudication.
Imagine adjudication at depth 8. Its obviusly has lots of errors.
Now imagine adjudication at depth 9. Still lots of errors.
At which depth can one be confident that adjudication is correct? Does a streak of neutral(0.19 -0.19) scores mean the position will end in a draw, even at depth 40?

As thought experiment imagine adjudication bounds expanded to +/- 40 centipawns for draws and -200 to resigning. Will you have any argument For adjudication?
If you change your opinion, how about +/- 30 centipawns draws and -300 resigning?

@mcostalba
Copy link

mcostalba commented Dec 27, 2018

I don't want to interfere in your discussion, just to note that any change would make the average game length to increase with a bad impact on resource usage.

Personally, if I have to chose where to allocate more resources I would chose the tightening of SPRT parameters (like we already did) instead of this one.

To give some weight to this discussion it should be proved that with changed parameters the set of green tests would be different. Or eventually (easier) to recover somewhere the old discussion that yielded to current setup. It was Gary Linscott's if I remember correctly.

@ghost
Copy link
Author

ghost commented Dec 27, 2018

@mcostalba I'm aware of resource cost, but it feels so wrong to rely on STC Stockfish eval being right, especially the 20centipawn draw interval which is typical for neutral positions where no clear advantage exists before endgame.
The -400 centipawn resign within 3 moves also looks premature, since it could a be temporary state where enemy doesn't actually have advantage(the -400 doesn't have to be purely material).

@vondele
Copy link
Member

vondele commented Dec 27, 2018

some of these things have been tested recently (or adjudication being disabled, I recall for the time management patches), and it can be done on fishtest. For example, by having the engine output -300 cp for eval always, none of the games will be adjudicated as win or draw, and they will all be played to the end. However, I think that this has very little effect on what fishtest is supposed to do (i.e. filter patches for their Elo strength).

@ghost
Copy link
Author

ghost commented Dec 27, 2018

@vondele I've started a test to check how big of impact there is on STC, here:
http://tests.stockfishchess.org/tests/view/5c248c180ebc5902ba131c53
Edit; test stopped and replaced with two new tests to better compare with recent @vdbergh ELO scaling tests.

@vondele
Copy link
Member

vondele commented Dec 29, 2018

so, we have completed the tests on this measuring the impact on the elo diff from sf9 - sf10 at STC.

ELO: 71.03 +-2.4 std adjudication
ELO: 70.83 +-2.4 symmetric no adjudication
ELO: 72.99 +-2.4 non-symmetric no adjudication

that is, no significant effect AFAICT.

@vdbergh
Copy link
Contributor

vdbergh commented Dec 29, 2018

It was mentioned once on fishcooking that adjudication is worth 20% in throughput (this seems very high to me but it is in this thread).

https://groups.google.com/forum/?fromgroups=#!searchin/fishcooking/adjudication$20-bryan$20mindbreaker%7Csort:date/fishcooking/7UiiXGpXzcs/4JgtDVww10QJ

In any case globally disabling adjudication would need very strong arguments. Especially since it can be done for individual patches using the fake scores trick.

But playing the devil's advocate let me point out that the above tests may as usual suffer from selection bias. If there are indeed patches that are adversely affected by adjudication then they will not make it in into master. So they would be invisible in such tests...

@ghost
Copy link
Author

ghost commented Dec 29, 2018

@vdbergh What about making it similar to TCEC rules?

@vdbergh
Copy link
Contributor

vdbergh commented Dec 29, 2018

@Chess13234 TCEC is entertainment. It has nothing to do with fishtest.

I one really wants to test the effect of adjudication then one should do like A0: leave out adjudication for 10% of the games that would normally have been adjudicated and see if the adjucation would have been correct or not. In that way one can measure the effect of adjudication without incurring selection bias.

@vdbergh
Copy link
Contributor

vdbergh commented Dec 29, 2018

Maybe my selection bias argument is wrong in the end since it is now not clear to me how a patch could be adversely affected by adjudication. If an engine emits a large negative score then it is convinced that it is losing. If it is not losing then obviously its evaluation is wrong and to me it seems correct that it is punished for this. A similar argument applies to emitting draw scores when winning, although perhaps less compellingly.

But on the whole it seems that an engine can mostly be helped by adjudication but not unjustly punished.

@ghost
Copy link
Author

ghost commented Dec 29, 2018

@vdbergh "If it is not losing then obviously its evaluation is wrong and to me it seems correct that it is punished for this. "
Wasn't the common mantra here "Stockfish was never designed to evaluate positions, just finding the best moves"?

"A similar argument applies to emitting draw scores when winning" Perhaps Stockfish is deficient at evaluating draws because all its life it was tuned to treat sub-20 centipawn scores as potential draws?

@mcostalba
Copy link

After @vondele test I guess we can safely close this one. @snicolet ?

@ghost ghost closed this as completed Jan 1, 2019
@ghost
Copy link
Author

ghost commented Jun 3, 2020

Looks like this issue has spawned a partial fix by forcing more time on moves, which will reduce adjudication mistakes.
#2707

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants