New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of book tests #3323
Comments
It would be interesting to see a comparison test with Pohl's UHO : https://www.sp-cc.de/unbalanced-human-openings.htm They probably lack some variety coverage, but they are a good example of openings selected to be close to the draw/win limit. Such openings are worse at differentiating engines of very different strength, or weak engines, but in tournament conditions with big hardware they are very good compared to balanced openings. I'm not sure where fishtest LTC currently sits, but the draw rates are so high than making the possibility of a "weak side win" really tiny might not matter. |
Isn't there a correlation between
|
You can define it as how similar it is to an opening distribution from a reference database of human games. E.g. https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence |
Here is a similar series of tests (originally initiated by Vondele) for a much larger Elo difference (which includes in particular the 100 Elo jump due to the introduction of NNUE). One can observe that Drawkiller still wins... This book is definitely interesting. The column Bias should be taken with a grain of salt since the model to compute it assumes small Elo differences, and moreover there is not enough information to compute a confidence interval.
|
Before NNUE I think there was no such direct correlation. IIRC noob_3moves.epd and 2moves_v1.pgn had similar sensitivity, despite having radically different properties. Since the introduction of NNUE the draw ratio has gone through the roof and it seems the resulting Elo compression can no longer be compensated for by the reduced noise associated with a higher draw ratio. EDIT: Personally I do not think "realistic openings" are important. 99.999...% of the positions encountered in search are not realistic so the engine must know how to handle them. However IMHO the draw killer book goes too far. It magnifies one aspect of chess (opposite castling) that is apparently very Elo sensitive. But it goes without saying that this will prevent the engine from learning about other castling situations. |
Added the results for the "Unbalanced Human Openings" by Stefan Pohl to the OP. |
If you're interested in testing books for lower draw rates but with more positions (and without hand-selection of specific types of positions): I have some of my own 3-move-books that I have generated with SF NNUE and grouped by evaluation: |
@zz4032 I didn't really want to test books perse (it takes a lot of resources). I just wanted to verify the suggestion by many people that the draw ratio may have become so high that an ultra balanced book like noob_3moves.epd is no longer the best choice for Fishtest. It seems that the tests confirm this. |
@vdbergh thanks for starting this investigation. I added the additional tests mostly because they reflect one year of progress, coinciding with 1yr of my role as maintainer. Progress has been unimaginable.. Back to your tests... first, I think we should really push for pentanomial statics and normalized Elo as basic tools to use for high level computer chess. I have the feeling these are the right tools to use at this level. I would be in favor of making that more explicitly used in fishtest. Concerning books, let me say I do like our noob_3moves.epd, so I'll be a bit biased. Looking at normalized Elo, books seem a little closer to each other than in Elo, which I think is a good thing. Yet, one could argue that there is a 2x gain possible by moving to a different book (in number of games needed to decide the typical SPRT). Having a large (normalized) Elo is clearly an important property of a book, but is not the only one. Those need considering as well:
While book size is quantifiable easily, book content is not. I would argue that having a book that is mostly reasonable positions close to startpos is a good approach to train the engine for generically strong play. I know there are arguments for other kind of books. Another question is, what would we like to achieve with a book change, and could we achieve this by other means as well (obviously with other side effects). In particular, SPRT bounds could be adjust to make patches easier to pass (and moved to normalized Elo, so we can keep them constant for longer, and more properly adjusted to the kind of test running), and maybe we could consider changing TC (Elo differences are larger at lower TC as well). The later point is maybe controversial, but we moved to 60s to have high quality of play.... our 10s now is miles better play than we had at 60s when we introduced it. |
Slightly off topic, but might give some ideas..
|
One problem for books with low resolution, or to be more precise with resolution which drops significantly as TC increases, is that the search optimizations will be too specialized for fishtest conditions and losing value. High resolution positions pose a difficult puzzle, so its much more beneficial to tune for them. In other words you would prefer to maximize performance for needing less time to not make crucial mistakes, than to do less mistakes on average when we use low time. So lowering LTC is opposing, and high resolution aiding. Being generic with book is of course also an asset, helping the optimization to be more generically applicable. Having the whole book at the same length reduces genericality. Why not to construct a variable length book by taking the highest resolution part of all normal books? Also, I am very excited as the importance of the concept of the scaling of resolution was made clear to me just now. I will attempt to explain further with example. So we have 2 books, one with slightly lower resolution but more uniform across TC and 2nd with better but with LTC drop noticeably overlapping elo compression. Using the first will mean:
So I think its valuable and cheap to measure the scaling of resolution for books. |
Would it make sense to use more than 1 book and adjust normalized bounds in relation to their resolution too? Because I'm not that sure that just with normalized elo plus adjusting confidence accordingly is the same. That would mean that resolution is unimportant to selectability and economy, and just position coverage essential as @vondele hinted. This idea is interesting. In other words, even with 98% drawrate (as long as there are no openings which produce 100% and waste resources) progress rate and test length could be kept stable just by reducing the win-loss demand. This somehow is hard to believe, but if to an extent true then lowering LTC fully makes sense. |
I think the idea of shortening LTC because quality of play has increased a lot is forgetting that testing at a longer TC isn't primarily about having test games above some absolute chess strength. While the observation that what works at short TC also often helps at longer TC has been key to generalize mass statistical testing that made a lot of engine progress possible ; there is still a distortion between what is optimal for different TC/hardware budgets. The scaling of the search tune in #2260 was a stark reminder of this fact. The use cases where Stockfish's strength matter most are those with deep searches - analysis, engine tournaments, etc. Shortening the TC would increase the distortion and while it would still be gaining strength at all TCs, the testing process would statistically optimize less for the long searches that matter the most. |
Yes, there is a problem with having shorter LTC. |
I added some more STC data. If ever the framework becomes empty again then it might be a good idea to do some tests with a more recent pair of stockfishes. Of course the normalized Elo values will be different but one may hope that the relative ratios between the different books remain similar. |
@vdbergh I ran a comparable test on a new book very recently. Could you add it https://tests.stockfishchess.org/tests/view/60fdc57ad8a6b65b2f3a79d8 |
Done. The book is on par with 2moves_v1.pgn and noob_3moves.epd. It's a very small sample size but it seems plausible that at STC any "reasonable" book (not too many moves, good coverage) essentially has the same resolution (independently of draw ratio or bias, as long as they are not extreme). The draw killer book cheats by reducing coverage (the issue of castling vs not castling cannot occur in a game started from a draw killer opening). |
It could of course be that Elo compression does exist for the very high draw ratios currently in Fishtest (even when normalized Elo is used) which we are not seeing in these tests. This is not so easy to test since one needs to use engines which are fairly close together in strength, but then one needs a very large number of games to get sufficiently precise results. |
I'll do an LTC test for that new book as well. I'll start it once the current RT is done. |
I added the LTC data for bjbraams_chessdb_198350_lines.epd to the table. This seems like an interesting book! |
Let's make bjbraams_chessdb_198350_lines.epd the new default book! :) |
To make theoretical prediction about the merits of biased books one needs a draw model and the validity of the conclusions then depends on the validity of the model. This being said, assuming the BayesElo model one obtains the following picture. https://hardy.uhasselt.be/Fishtest/sensitivity.png Here the "draw ratio" is the one which would occur with equal engines with a perfectly balanced book (the noob_3moves book is a good approximation). The vertical axis is the "normalized Elo magnification" caused by the book. Note For biased books the approximate formula EDIT: I tried to include the picture directly in this comment but for some reason this does not work (even though I have done it many times in the past). Anyway below is the command. Perhaps it will start working magically in the future |
Here is another interesting picture. It says that the optimal bias occurs when the draw ratio is 50% (still assuming the BayesElo model). This is rather appealing from the viewpoint of maximum entropy (but I see no a priori reason why the entropy argument would be valid). https://hardy.uhasselt.be/Fishtest/draw_ratio.png Win and loss ratios are for the color that has the advantage. Since the openings are repeated with reversed colors, the total win and loss ratios will be close to 25% at the "optimal bias" (assuming the engines are equal strength). EDIT: The "entropy argument" is as follows. If one uses a sufficiently biased book, the outcome of a game becomes binary (win,draw for the color that has the advantage). Then one may guess that one can extract maximum information from a test if the two possible outcomes occur with approximately the same frequency. However I see no reason why this would be a valid argument. I guess I could do a similar computation for the Davidson draw model. EDIT2: Still need to do the computation for the Davidson model but I am relatively convinced that the nice 50% optimal draw ratio is an artifact of the BayesElo model. Explanation: If |
By my observations in practice, I would say that 50% is not optimal, but 60%-80% is. The optimal number however for sure will be different for different engine pairs and different books. @vdbergh Regarding maximum information extraction I have the following proposition, which is unclear to me if its throughout sufficient or partially viable: 50% could be optimal in case that average game length of decisive and drawn games is the same. But of course its expected that drawn games last longer on average than decisive ones, so they put more tests to the engines, albeit easier ones, hence more information. So, to sum up, sustained complexity (unclear outcome) of positions is also important, along with draw rate. We would prefer positions that resolve into draw/win at long games, than positions which resolve fast. |
Regarding bjbraams_chessdb_198350_lines.epd , I find most impressive to have the least decline of elo spread from STC to LTC, indicating sustained uncertainty. (As opposed to increase of determinism with TC). For sure this asset will be even more important at higher drawrates, stronger engines, and closer in strength engine pairs, than the ones tested here. I too think it will be beneficial to start using it, in order to also observe its behavior at target conditions. |
I did the computation for the Davidson model and the results are very similar to the ones for the BayesElo model (of course these are just models, they were mainly selected for their mathematical elegance and it is not clear how well they agree with reality, especially under extreme conditions). https://hardy.uhasselt.be/Fishtest/draw_ratio_davidson.png The optimal draw ratio is still around 50%. The gain obtainable from using biased books is still huge. |
Interesting. It seems that any model which has the property that the win ratio is given by This assumption holds on the nose for Glenn-David and BayesElo. It also holds asymptotically for Davidson, as one may check that BayesElo (officially Rao-Kupper) and Davidson are equivalent for large biases. |
Switches to a new book by Stefan Pohl: UHO_XXL_+0.90_+1.19.epd See official-stockfish/books@63b0f20 See discussion and tests in official-stockfish/Stockfish#3323
Switches to a new book by Stefan Pohl: UHO_XXL_+0.90_+1.19.epd See official-stockfish/books@63b0f20 See discussion and tests in official-stockfish/Stockfish#3323
default book updated to UHO_XXL_+0.90_+1.19.epd thanks all for book contributions and testing. |
@vdbergh Since your computations for Davidson model / Bayes elo found 50% draw-rate to be optimal, and indeed it looks to be/makes sense, but the lowered draw-rate brings down normalized elo, I would like to ask if there is a clean (or easily approximative) way to compute optimal for normalised elo. I also think its a good idea to try this other extreme for some months, rapidly getting rid of previous selection bias along the way, but ultimately imo the golden balance is most likely to lay somewhere in the middle, around 70%. (DD to WL ratio around 2:1) Anyway this for the future, happy respins and have fun! |
STC's seem to be taking a lot more games to resolve. Could we also have an STC closed strength measurement for normalized elo comparison? If the difference is indeed big, then some adjustments might be needed. For LTC nElo is just 22% less, (minus selection bias) so no worries. |
STCs are not taking a lot more games to resolve. |
Regarding the bias in the books... could this be countered by mirroring every other position (+ flip side to move)? |
STC's definitely take much longer than before, mainly due to lowered nElo spread due to very low drawrate. It has similar effect to narrowing bound width. The respins visualy exaggerate the effect. Whenever a drastic change happens all previous statistics change. Bounds need to be recalibrated or at least measured. https://tests.stockfishchess.org/tests/view/612286dccb26afb615a5e769 Normalized elo calculation in SPRTs is demanding a certain nElo confidence one way or the other for completion. So a lower nElo spread simply requires more games for same nElo confidence. |
The testing data is here, 2 data points missing. Normalized elo: We see that
By extrapolating the expectation for UHO_0.9 STC nElo spread is to be much lower, but its really unclear how much, so the missing data is crucial. We already know that
So we can concieve
Time and tests will show to which extent, but its clear that performance of books is TC dependent. |
The STC tests should not take more time since the SPRT bounds are expressed in normalized Elo. The worst case expected duration is still 116K games. The right tail of the distribution falls off more slowly than for the normal distribution (the falloff is exponential). It follows that very long tests will happen more frequently than one would expect under the normal distribution. But the average duration should stay below 116K games. EDIT: I computed the average duration of the last 26 finished tests
and got 82745, with 95% confidence interval (determined by resampling) given by [55017, 117572]. I'd say there is nothing to worry about. |
Do I understand correctly that when applied to the starting position, this procedure would yield the starting position with black to move? I guess it depends on the NN training whether the NN can handle such positions sensibly. I actually understand why in Stehan Pohl's books the bias is nearly constant from the white point of view. This allows one to use Ordo or BayesElo. These programs use models with a "WhiteAdvantage" parameter. The models assume that the White Advantage is constant. If this assumption does not hold then the programs give incorrect results. The pentanomial statististics (for head to head matches) on the other hand do not depend on any model so there is no constraint on the bias for the individual positions. |
The net would see that position exactly the same as it sees the start position. The transformation I described is invariant from the perspective of the net. If anything would change it would be from search or perhaps classical eval; and also, hackily, bias could be brought to 0. |
I guess you mean that the average bias would be zero. I agree with that. Note that the RMSbias would not change (this is the square root of the average of the squared biases). So one would still need pentanomial statistics. |
@vdbergh Thanks for the reassuring response. So could we say that the benefit of book performance on given bounds exclusively affects the accuracy of resolution? If so then it would mean that wider bounds with better performing book could provide equal confidence at less games on average, right? I was under the impression that given bounds provide equal confidence upon resolution, with varying productivity. Book performance measured in nElo. |
With the current setup every test receives a budget of the order of 100K games (independently of anything). How effectively this budget is used should depend on the book. Before the bounds were expressed in normalized Elo, every change of testing conditions required an adjustment of the bounds, which was a painful process. |
A 10% time odds test (60+0.6 vs 66+0.66).
The nElo ratio is This result is quite different from #3323 (comment) . It seems likely that the latter test suffered from selection bias. EDIT: Added results for 8moves_v3.pgn to get a conversion factor between UHO Elo and RT Elo. |
Impressively contrasting result, indicating high deviation of optimization across these very different position subsets. Great result regarding UHO productivity, as all testing so far points great scaling with TC (resistance to elo compression), indicating low amount of deterministic results with higher depth searches. (discovered forced wins/draws) Regarding LTC confidence, a simple calculation of Elo to nElo analogies at TC odds test gives us an x2.07 performance! (*) Assessing the combined STC+LTC confidence is another reason for STC time odds test.
|
We have been using the new book now for somewhat more than 2 months. I've started 3 tests with the old noob_3moves.epd, the new UHO_XXL_+0.90_+1.19.epd and the RT 8moves_v3.pgn books, on the patches committed to master since Aug 22nd (SPRT tested on the UHO book). https://tests.stockfishchess.org/tests/view/61891ae3d7a085ad008ef282 this test was triggered by the question if progress on the new book and progress on the old book are comparable. |
Thanks. I will put the results in a table tonight. However to be able to compare with #3323 (comment) it might be useful to also do the tests at LTC. That way we can see if it is the UHO book that benefited from selection bias this time. |
A less costly alternative is to redo #3323 (comment) for STC. Maybe I'll just do that. Then we also get some new scaling information with respect to TC. |
NKONSTANTAKIS or vdbergh, can you clarify "50% UHO extra benefit"? What measure? The measure that makes sense to me is how many games are needed (all else equal) to decide the sign (only the sign) of the elo difference between two instances of SF, within some agreed level of uncertainty. |
By comparing the 3 tests on the 3 different books of same reference version elo spreaded via time odds, soas to not be affected by selection bias of UHO specific optimization, vs the 3 tests of dev vs the old reference version. We see that the 2 balanced books have almost similar performance, while on UHO book that we optimize for we get 23.4 vs 14.5. So, taking everything into account its very near 50%. This is after ~3 months of patches at UHO and is expected to increase. |
In other words dev has became as stronger vs the last version before book swap as if that version had 10% more time against itself at STC and balanced books, and 50% more at UHO. This means that performance at balanced books also benefited by UHO progress, taking 2/3 of the gains. |
I will close this as an issue. We have quite successfully adopted UHO as the training book. We did find a recent case where there is quite a difference between the Elo measured on the 8moves and the UHO book (#3937) but overall the experience is good. Several major tournaments have adopted UHO-style books as well. We can reopen the issue at a later point if useful. |
What these experiments have thought us is that book tests based on prior commits are completely useless because of selection bias. The only objective book tests are time odds tests - like this one #3323 (comment) |
@vdbergh I'll link to your test of the DFRC book https://tests.stockfishchess.org/tests/view/62c551fe50dcbecf5fc0cfe8 let us know what your interpretation of those results are. |
The book tests are finished so I am summarizing the results here. From a technical point of view the only column that is important is "Normalized Elo" since it determines the number of games needed to detect the given strength difference (the relation is inversely quadratic, for SPRT the approximate formula is
640000/(n.e.)^2
)For a similar series of test (reported on Jan 3, 2020) see official-stockfish/fishtest#472 (comment) . The issue in loc. cit. also gives more information about the various books. The books themselves can be found in the Fishtest book repository.
Some STC tests to check scaling.
EDIT: Added 8mvs_big_+80_+109.epd. This book with 25857 positions is the biggest in the series Unbalanced_Human_Openings_V2.0, created by Stefan Pohl. See https://www.sp-cc.de/unbalanced-human-openings.htm.
EDIT2: Added data for the special purpose books "endgames.epd" and "closedpos.epd".
EDIT3: I started adding STC data to check scaling.
EDIT4: Some more STC data. I did the tests a long time ago but never got around adding the results.
EDIT5: STC data for bjbraams_chessdb_198350_lines.epd added.
EDIT6: LTC data for bjbraams_chessdb_198350_lines.epd added. Strong performance.
EDIT7: Data for the new UHO series by Stefan Pohl.
The text was updated successfully, but these errors were encountered: