Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of book tests #3323

Closed
vdbergh opened this issue Jan 25, 2021 · 83 comments
Closed

Summary of book tests #3323

vdbergh opened this issue Jan 25, 2021 · 83 comments

Comments

@vdbergh
Copy link
Contributor

vdbergh commented Jan 25, 2021

The book tests are finished so I am summarizing the results here. From a technical point of view the only column that is important is "Normalized Elo" since it determines the number of games needed to detect the given strength difference (the relation is inversely quadratic, for SPRT the approximate formula is 640000/(n.e.)^2)

For a similar series of test (reported on Jan 3, 2020) see official-stockfish/fishtest#472 (comment) . The issue in loc. cit. also gives more information about the various books. The books themselves can be found in the Fishtest book repository.

Name Draw ratio Bias(Elo) Elo Normalized Elo Test
Drawkiller_balanced_big.epd (LTC) 0.63 70 70.6 [69.0, 72.1] 128.6 [125.6, 131.6] Test
bjbraams_chessdb_198350_lines.epd (LTC) 0.65 122 57.3 [56.0, 58.6] 123.1 [120.0, 126.2] Test
hybrid_book_beta.pgn (LTC) 0.70 66 59.7 [58.3, 61.1] 121.0 [118.2, 123.9] Test
UHO_XXL_+0.80_+1.09.epd (LTC) 0.55 165 57.4 [56.0, 58.8] 119.0 [115.9, 122.1] Test
UHO_XXL_+0.90_+1.19.epd (LTC) 0.49 189 57.6 [56.2, 58.9] 118.0 [114.9, 121.1] Test
8mvs_big_+80_+109.epd (LTC) 0.56 156 56.6 [55.2, 58.0] 116.0 [112.9, 119.1] Test
2moves_v1.pgn (LTC) 0.76 77 47.0 [45.8, 48.2] 110.5 [107.7, 113.3] Test
UHO_XXL_+1.00_+1.29.epd (LTC) 0.43 219 52.9 [51.5, 54.3] 108.7 [105.6, 111.8] Test
closedpos.epd (LTC) 0.79 54 42.0 [40.8, 43.2] 101.7 [99.0, 104.5] Test
noob_3moves.epd (LTC) 0.83 23 37.6 [36.5, 38.7] 96.9 [94.3, 99.5] Test
8moves_v3.pgn (LTC) 0.83 58 32.4 [31.4, 33.4] 87.6 [84.9, 90.2] Test
endgames.epd (LTC) 0.63 181 24.6 [23.6, 25.6] 66.5 [63.8, 69.3] Test

Some STC tests to check scaling.

Name Draw ratio Bias(Elo) Elo Normalized Elo Test
Drawkiller_balanced_big.epd (STC) 0.47 62 86.8 [84.8, 88.8] 130.9 [127.8, 134.0] Test
2moves_v1.pgn (STC) 0.63 77 64.3 [62.8, 65.9] 118.6 [115.6, 121.5] Test
bjbraams_chessdb_198350_lines.epd (STC) 0.57 121 63.8 [62.2, 65.3] 117.7 [114.7, 120.83 Test
UHO_XXL_+0.80_+1.09.epd (STC) 0.50 154 63.0 [61.4, 64.6] 114.8 [111.7, 117.9] Test
noob_3moves.epd (STC) 0.70 32 59.7 [58.2, 61.2] 114.2 [111.7, 117.1] Test
UHO_XXL_+0.90_+1.19.epd (STC) 0.46 171 61.8 [60.3, 63.4] 112.3 [109.2, 115.4] Test
UHO_XXL_+1.00_+1.29.epd (STC) 0.43 193 60.0 [58.5, 61.6] 110.3 [107.2, 113.4] Test
8moves_v3.pgn (STC) 0.71 61 49.0 [47.6, 50.4] 98.8 [95.9, 101.7] Test

EDIT: Added 8mvs_big_+80_+109.epd. This book with 25857 positions is the biggest in the series Unbalanced_Human_Openings_V2.0, created by Stefan Pohl. See https://www.sp-cc.de/unbalanced-human-openings.htm.

EDIT2: Added data for the special purpose books "endgames.epd" and "closedpos.epd".

EDIT3: I started adding STC data to check scaling.

EDIT4: Some more STC data. I did the tests a long time ago but never got around adding the results.

EDIT5: STC data for bjbraams_chessdb_198350_lines.epd added.

EDIT6: LTC data for bjbraams_chessdb_198350_lines.epd added. Strong performance.

EDIT7: Data for the new UHO series by Stefan Pohl.

@Alayan-stk-2
Copy link

It would be interesting to see a comparison test with Pohl's UHO : https://www.sp-cc.de/unbalanced-human-openings.htm

They probably lack some variety coverage, but they are a good example of openings selected to be close to the draw/win limit. Such openings are worse at differentiating engines of very different strength, or weak engines, but in tournament conditions with big hardware they are very good compared to balanced openings.

I'm not sure where fishtest LTC currently sits, but the draw rates are so high than making the possibility of a "weak side win" really tiny might not matter.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 25, 2021

Yes that would be interesting.

Before a book can be tested it has to be uploaded to the book repository. I guess I can just make a pull request...

EDIT. I just made one. @vondele @snicolet

@zz4032
Copy link

zz4032 commented Jan 26, 2021

Isn't there a correlation between Normalized Elo and Draw ratio?
It looks like a book is generally more appropriate for engine testing, if

  • Draw ratio is lower / closer to 0.50?,
  • Bias(Elo) is lower (more balanced)?,
  • it includes more realistic openings (cannot really be expressed in numbers)?

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Jan 26, 2021

  • it includes more realistic openings (cannot really be expressed in numbers)?

You can define it as how similar it is to an opening distribution from a reference database of human games. E.g. https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 28, 2021

Here is a similar series of tests (originally initiated by Vondele) for a much larger Elo difference (which includes in particular the 100 Elo jump due to the introduction of NNUE). One can observe that Drawkiller still wins... This book is definitely interesting.

The column Bias should be taken with a grain of salt since the model to compute it assumes small Elo differences, and moreover there is not enough information to compute a confidence interval.

Name Draw ratio Bias(Elo) Elo Normalized Elo Test
Drawkiller_balanced_big.epd 0.26 41 271.3 [268.7, 273.9] 426.0 [420.3, 431.9] Test
noob_3moves.epd (STC) 0.28 7 270.9 [268.27, 273.5] 416.3 [410.6, 422.0] Test
noob_3moves.epd 0.39 35 227.7 [225.6, 229.9] 385.4 [380.8, 390.0] Test
8moves_v3.pgn 0.47 64 182.9 [181.0, 184.7] 328.5 [324.6, 332.4] Test

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 28, 2021

Isn't there a correlation between Normalized Elo and Draw ratio?
It looks like a book is generally more appropriate for engine testing, if

  • Draw ratio is lower / closer to 0.50?,
  • Bias(Elo) is lower (more balanced)?,
  • it includes more realistic openings (cannot really be expressed in numbers)?

Before NNUE I think there was no such direct correlation. IIRC noob_3moves.epd and 2moves_v1.pgn had similar sensitivity, despite having radically different properties.

Since the introduction of NNUE the draw ratio has gone through the roof and it seems the resulting Elo compression can no longer be compensated for by the reduced noise associated with a higher draw ratio.

EDIT: Personally I do not think "realistic openings" are important. 99.999...% of the positions encountered in search are not realistic so the engine must know how to handle them.

However IMHO the draw killer book goes too far. It magnifies one aspect of chess (opposite castling) that is apparently very Elo sensitive. But it goes without saying that this will prevent the engine from learning about other castling situations.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 28, 2021

Added the results for the "Unbalanced Human Openings" by Stefan Pohl to the OP.

@zz4032
Copy link

zz4032 commented Jan 28, 2021

If you're interested in testing books for lower draw rates but with more positions (and without hand-selection of specific types of positions): I have some of my own 3-move-books that I have generated with SF NNUE and grouped by evaluation:
3moves_cp.zip
3moves_cp50-79_152057pos or 3moves_cp80-112_101530pos could be nice candidates for higher Normalized Elo, while not too unbalanced.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 28, 2021

@zz4032 I didn't really want to test books perse (it takes a lot of resources). I just wanted to verify the suggestion by many people that the draw ratio may have become so high that an ultra balanced book like noob_3moves.epd is no longer the best choice for Fishtest. It seems that the tests confirm this.

@vondele
Copy link
Member

vondele commented Jan 30, 2021

@vdbergh thanks for starting this investigation. I added the additional tests mostly because they reflect one year of progress, coinciding with 1yr of my role as maintainer. Progress has been unimaginable..

Back to your tests... first, I think we should really push for pentanomial statics and normalized Elo as basic tools to use for high level computer chess. I have the feeling these are the right tools to use at this level. I would be in favor of making that more explicitly used in fishtest.

Concerning books, let me say I do like our noob_3moves.epd, so I'll be a bit biased.

Looking at normalized Elo, books seem a little closer to each other than in Elo, which I think is a good thing. Yet, one could argue that there is a 2x gain possible by moving to a different book (in number of games needed to decide the typical SPRT). Having a large (normalized) Elo is clearly an important property of a book, but is not the only one. Those need considering as well:

  • book size: Ideally each game played in fishtest starts from a different opening, to reduce correlations between games played. Since we quite commonly play 100k - 300k games, books should be 50k - 150k positions.
  • book content: the engine will naturally be optimized for the type of positions that are in the book. If our book is too specific, we will become pretty specific in what we learn, and become relatively speaking worse in other context.

While book size is quantifiable easily, book content is not. I would argue that having a book that is mostly reasonable positions close to startpos is a good approach to train the engine for generically strong play. I know there are arguments for other kind of books.

Another question is, what would we like to achieve with a book change, and could we achieve this by other means as well (obviously with other side effects). In particular, SPRT bounds could be adjust to make patches easier to pass (and moved to normalized Elo, so we can keep them constant for longer, and more properly adjusted to the kind of test running), and maybe we could consider changing TC (Elo differences are larger at lower TC as well). The later point is maybe controversial, but we moved to 60s to have high quality of play.... our 10s now is miles better play than we had at 60s when we introduced it.

@miguel-l
Copy link
Contributor

Slightly off topic, but might give some ideas..
My idea for nice book content is something like a mix of the following:

  • chess 960 - this is probably already difficult, since afaik chess960 is not supported by fishtest, but I feel like this would help very much as middlegames that arise from it will varied.
  • Modified starting positions - once you get the above this is pretty easy to generate by modifying fens. Some examples of what I'm thinking: Take all 960 fens and remove a rook from one side and a bishop from the other, Or remove both queens, Or force OCB from the start, Or remove queenside castling rights for white and kingside for black, etc, etc. I think this would cover imbalanced and positions close to endgames. Programmatic modification of fens might end up with 'impossible' positions, but it might not matter much.
  • Positions from other books - Take 2 or 3moves or some other existing book and take positions from them. This should boost the distribution so that the book still contains many positions that are close to standard chess.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Jan 31, 2021

One problem for books with low resolution, or to be more precise with resolution which drops significantly as TC increases, is that the search optimizations will be too specialized for fishtest conditions and losing value.

High resolution positions pose a difficult puzzle, so its much more beneficial to tune for them. In other words you would prefer to maximize performance for needing less time to not make crucial mistakes, than to do less mistakes on average when we use low time.

So lowering LTC is opposing, and high resolution aiding.

Being generic with book is of course also an asset, helping the optimization to be more generically applicable.
So, even though we dont tune eval, concerns for overly specialized max resolution book, like drawkiller are understood. Even if we don't want to research the generic applicability of tuning for drawkiller and go safe, why not to aim for both assets of genericality and resolution, just by usage of normal positions?

Having the whole book at the same length reduces genericality. Why not to construct a variable length book by taking the highest resolution part of all normal books?

Also, I am very excited as the importance of the concept of the scaling of resolution was made clear to me just now. I will attempt to explain further with example.

So we have 2 books, one with slightly lower resolution but more uniform across TC and 2nd with better but with LTC drop noticeably overlapping elo compression. Using the first will mean:

  1. That the STC result will be more correlated with LTC result, thus increased confidence.
  2. That the general optimization will scale better with time.
  3. That the general optimization will be more uniform.

So I think its valuable and cheap to measure the scaling of resolution for books.

@NKONSTANTAKIS
Copy link

Would it make sense to use more than 1 book and adjust normalized bounds in relation to their resolution too?
This way the target test length can be similar, cross book performance comparable, and target midpoint elo different (lower for low res, higher for high).

Because I'm not that sure that just with normalized elo plus adjusting confidence accordingly is the same. That would mean that resolution is unimportant to selectability and economy, and just position coverage essential as @vondele hinted. This idea is interesting.

In other words, even with 98% drawrate (as long as there are no openings which produce 100% and waste resources) progress rate and test length could be kept stable just by reducing the win-loss demand. This somehow is hard to believe, but if to an extent true then lowering LTC fully makes sense.

@Alayan-stk-2
Copy link

Another question is, what would we like to achieve with a book change, and could we achieve this by other means as well (obviously with other side effects). In particular, SPRT bounds could be adjust to make patches easier to pass (and moved to normalized Elo, so we can keep them constant for longer, and more properly adjusted to the kind of test running), and maybe we could consider changing TC (Elo differences are larger at lower TC as well). The later point is maybe controversial, but we moved to 60s to have high quality of play.... our 10s now is miles better play than we had at 60s when we introduced it.

I think the idea of shortening LTC because quality of play has increased a lot is forgetting that testing at a longer TC isn't primarily about having test games above some absolute chess strength.

While the observation that what works at short TC also often helps at longer TC has been key to generalize mass statistical testing that made a lot of engine progress possible ; there is still a distortion between what is optimal for different TC/hardware budgets. The scaling of the search tune in #2260 was a stark reminder of this fact.

The use cases where Stockfish's strength matter most are those with deep searches - analysis, engine tournaments, etc. Shortening the TC would increase the distortion and while it would still be gaining strength at all TCs, the testing process would statistically optimize less for the long searches that matter the most.

@snicolet snicolet added the books label Feb 10, 2021
@Vizvezdenec
Copy link
Contributor

Yes, there is a problem with having shorter LTC.
Problem is that search in a lot of cases is non-linear, I can name you like 5 different ideas that were gods of STC but at LTC they were utter trash.
Risk matrix, skipping fail high resolve, lmr history, lowering counter move history pruning threshold - this is what I get by 5 seconds of digging in my memory :)
If anything I would prefer maybe ever higher LTC but with much lower draw rate. I don't know how bias increase affect anything, tbh, so I'm not sure that let's say hybrid book beta will be good for it.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 27, 2021

I added some more STC data.

If ever the framework becomes empty again then it might be a good idea to do some tests with a more recent pair of stockfishes. Of course the normalized Elo values will be different but one may hope that the relative ratios between the different books remain similar.

@vondele
Copy link
Member

vondele commented Jul 27, 2021

@vdbergh I ran a comparable test on a new book very recently. Could you add it https://tests.stockfishchess.org/tests/view/60fdc57ad8a6b65b2f3a79d8

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 27, 2021

@vondele

Done. The book is on par with 2moves_v1.pgn and noob_3moves.epd. It's a very small sample size but it seems plausible that at STC any "reasonable" book (not too many moves, good coverage) essentially has the same resolution (independently of draw ratio or bias, as long as they are not extreme). The draw killer book cheats by reducing coverage (the issue of castling vs not castling cannot occur in a game started from a draw killer opening).

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 27, 2021

It could of course be that Elo compression does exist for the very high draw ratios currently in Fishtest (even when normalized Elo is used) which we are not seeing in these tests. This is not so easy to test since one needs to use engines which are fairly close together in strength, but then one needs a very large number of games to get sufficiently precise results.

@vondele
Copy link
Member

vondele commented Jul 27, 2021

I'll do an LTC test for that new book as well. I'll start it once the current RT is done.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 2, 2021

I added the LTC data for bjbraams_chessdb_198350_lines.epd to the table. This seems like an interesting book!

@SFisGOD
Copy link
Contributor

SFisGOD commented Aug 3, 2021

Let's make bjbraams_chessdb_198350_lines.epd the new default book! :)

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 4, 2021

To make theoretical prediction about the merits of biased books one needs a draw model and the validity of the conclusions then depends on the validity of the model. This being said, assuming the BayesElo model one obtains the following picture.

https://hardy.uhasselt.be/Fishtest/sensitivity.png

Here the "draw ratio" is the one which would occur with equal engines with a perfectly balanced book (the noob_3moves book is a good approximation). The vertical axis is the "normalized Elo magnification" caused by the book.

Note For biased books the approximate formula n.e.=elo/sqrt(1-d) is no longer correct. The correct formula is pentanomial t-value x normalization constant. Since here we are considering relative normalized Elo, the normalization constant does not matter.

EDIT: I tried to include the picture directly in this comment but for some reason this does not work (even though I have done it many times in the past). Anyway below is the command. Perhaps it will start working magically in the future

Broken picture

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 4, 2021

Here is another interesting picture. It says that the optimal bias occurs when the draw ratio is 50% (still assuming the BayesElo model). This is rather appealing from the viewpoint of maximum entropy (but I see no a priori reason why the entropy argument would be valid).

https://hardy.uhasselt.be/Fishtest/draw_ratio.png

Win and loss ratios are for the color that has the advantage. Since the openings are repeated with reversed colors, the total win and loss ratios will be close to 25% at the "optimal bias" (assuming the engines are equal strength).

EDIT: The "entropy argument" is as follows. If one uses a sufficiently biased book, the outcome of a game becomes binary (win,draw for the color that has the advantage). Then one may guess that one can extract maximum information from a test if the two possible outcomes occur with approximately the same frequency. However I see no reason why this would be a valid argument.

I guess I could do a similar computation for the Davidson draw model.

EDIT2: Still need to do the computation for the Davidson model but I am relatively convinced that the nice 50% optimal draw ratio is an artifact of the BayesElo model. Explanation: If d is the draw ratio and the book is biased enough so that game outcomes may be assumed to be binary then the pentanomial variance is proportional to d*(1-d) which achieves a maximum for d=1/2 (in other words: the opposite of what one wants). So the only way to gain something is to overcome Elo compression (which affects the numerator of normalized Elo). But this depends on the Elo/Draw model.

@NKONSTANTAKIS
Copy link

By my observations in practice, I would say that 50% is not optimal, but 60%-80% is. The optimal number however for sure will be different for different engine pairs and different books.

@vdbergh Regarding maximum information extraction I have the following proposition, which is unclear to me if its throughout sufficient or partially viable: 50% could be optimal in case that average game length of decisive and drawn games is the same.

But of course its expected that drawn games last longer on average than decisive ones, so they put more tests to the engines, albeit easier ones, hence more information.
An extreme example of the opposite case are positions which require just a single good move to be found, and this move is found 50% of the time. So this 50% derives from a very good test, but just a single one.
Early forced draws by repetition/perpetual etc. fall in this category.

So, to sum up, sustained complexity (unclear outcome) of positions is also important, along with draw rate. We would prefer positions that resolve into draw/win at long games, than positions which resolve fast.

@NKONSTANTAKIS
Copy link

Regarding bjbraams_chessdb_198350_lines.epd , I find most impressive to have the least decline of elo spread from STC to LTC, indicating sustained uncertainty. (As opposed to increase of determinism with TC). For sure this asset will be even more important at higher drawrates, stronger engines, and closer in strength engine pairs, than the ones tested here.

I too think it will be beneficial to start using it, in order to also observe its behavior at target conditions.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 5, 2021

I did the computation for the Davidson model

https://hardy.uhasselt.be/Fishtest/Paired%20Comparisons%20with%20Ties_%20Modeling%20Game%20Outcomes%20in%20Chess.pdf

and the results are very similar to the ones for the BayesElo model (of course these are just models, they were mainly selected for their mathematical elegance and it is not clear how well they agree with reality, especially under extreme conditions).

https://hardy.uhasselt.be/Fishtest/draw_ratio_davidson.png

The optimal draw ratio is still around 50%. The gain obtainable from using biased books is still huge.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 9, 2021

Interesting. It seems that any model which has the property that the win ratio is given by f(elo+bias-draw_elo) where is f is some (sufficiently nice) ascending function satisfying f(elo)=1-f(-elo) has the property that for large draw_elo, the optimal bias is given by bias=draw_elo which corresponds to a draw ratio of 50% between equal strength engines.

This assumption holds on the nose for Glenn-David and BayesElo.

It also holds asymptotically for Davidson, as one may check that BayesElo (officially Rao-Kupper) and Davidson are equivalent for large biases.

vondele added a commit to vondele/fishtest that referenced this issue Aug 22, 2021
Switches to a new book by Stefan Pohl: UHO_XXL_+0.90_+1.19.epd

See
official-stockfish/books@63b0f20

See discussion and tests in
official-stockfish/Stockfish#3323
ppigazzini pushed a commit to official-stockfish/fishtest that referenced this issue Aug 22, 2021
Switches to a new book by Stefan Pohl: UHO_XXL_+0.90_+1.19.epd

See
official-stockfish/books@63b0f20

See discussion and tests in
official-stockfish/Stockfish#3323
@vondele
Copy link
Member

vondele commented Aug 22, 2021

default book updated to UHO_XXL_+0.90_+1.19.epd

thanks all for book contributions and testing.

@NKONSTANTAKIS
Copy link

@vdbergh Since your computations for Davidson model / Bayes elo found 50% draw-rate to be optimal, and indeed it looks to be/makes sense, but the lowered draw-rate brings down normalized elo, I would like to ask if there is a clean (or easily approximative) way to compute optimal for normalised elo.

I also think its a good idea to try this other extreme for some months, rapidly getting rid of previous selection bias along the way, but ultimately imo the golden balance is most likely to lay somewhere in the middle, around 70%. (DD to WL ratio around 2:1)

Anyway this for the future, happy respins and have fun!

@NKONSTANTAKIS
Copy link

STC's seem to be taking a lot more games to resolve. Could we also have an STC closed strength measurement for normalized elo comparison? If the difference is indeed big, then some adjustments might be needed. For LTC nElo is just 22% less, (minus selection bias) so no worries.

@Vizvezdenec
Copy link
Contributor

STCs are not taking a lot more games to resolve.
It's an illusion created by the fact that most running STCs are respins of patches that passed STCs previously so they just don't fail fast as your average fishtest patch.

@Sopel97
Copy link
Member

Sopel97 commented Aug 23, 2021

Regarding the bias in the books... could this be countered by mirroring every other position (+ flip side to move)?

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Aug 24, 2021

STC's definitely take much longer than before, mainly due to lowered nElo spread due to very low drawrate. It has similar effect to narrowing bound width. The respins visualy exaggerate the effect. Whenever a drastic change happens all previous statistics change. Bounds need to be recalibrated or at least measured. https://tests.stockfishchess.org/tests/view/612286dccb26afb615a5e769
400K to resolve. Its similar situation with HCE testing sharing bounds selected for LTC NNUE, resolving crazy slow often, 600K unresolved fauzi tune I recall. This was due to lowered drawrate of HCE. Hence introduction of nElo for universal bounds.

Normalized elo calculation in SPRTs is demanding a certain nElo confidence one way or the other for completion. So a lower nElo spread simply requires more games for same nElo confidence.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Aug 24, 2021

The testing data is here, 2 data points missing. Normalized elo:
Wide elo gap:
LTC: 97 noob_3moves 118 UHO_0.9
STC: 114.2 112.3
Narrow elo:
LTC: 32.85 25.62
STC: ? ?

We see that

  1. narrow elo brings down perf of UHO_0.9 compared to 3moves.
  2. STC perf was lower at wide elo.

By extrapolating the expectation for UHO_0.9 STC nElo spread is to be much lower, but its really unclear how much, so the missing data is crucial.

We already know that

  1. Borderline brings huge elo spread boost at very high thread/TC (vondele)
  2. As TC increases balanced books lose value.

So we can concieve

  1. STC instability pairs dubious with critical book, as a single mistake can decide the outcome. @vdbergh has colorfully described this phenomenon as "rolling the dice at a given moment"
  2. A balanced book increases the meaningfulness of STC by providing a simpler task to engines which checks their long lasting ability to not blunder.

Time and tests will show to which extent, but its clear that performance of books is TC dependent.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 24, 2021

The STC tests should not take more time since the SPRT bounds are expressed in normalized Elo. The worst case expected duration is still 116K games.

The right tail of the distribution falls off more slowly than for the normal distribution (the falloff is exponential). It follows that very long tests will happen more frequently than one would expect under the normal distribution. But the average duration should stay below 116K games.

EDIT: I computed the average duration of the last 26 finished tests

[79184,166488,179504,34680,46760,48064,17272,178160,66728,85056,71520,112528,50120,11120,72880,15728,199104,25144,396448,34608,34608,49464,31384,21792,68704,54328]

and got 82745, with 95% confidence interval (determined by resampling) given by [55017, 117572]. I'd say there is nothing to worry about.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 24, 2021

Regarding the bias in the books... could this be countered by mirroring every other position (+ flip side to move)?

Do I understand correctly that when applied to the starting position, this procedure would yield the starting position with black to move? I guess it depends on the NN training whether the NN can handle such positions sensibly.

I actually understand why in Stehan Pohl's books the bias is nearly constant from the white point of view. This allows one to use Ordo or BayesElo. These programs use models with a "WhiteAdvantage" parameter. The models assume that the White Advantage is constant. If this assumption does not hold then the programs give incorrect results.

The pentanomial statististics (for head to head matches) on the other hand do not depend on any model so there is no constraint on the bias for the individual positions.

@Sopel97
Copy link
Member

Sopel97 commented Aug 24, 2021

Do I understand correctly that when applied to the starting position, this procedure would yield the starting position with black to move? I guess it depends on the NN training whether the NN can handle such positions sensibly.

The net would see that position exactly the same as it sees the start position. The transformation I described is invariant from the perspective of the net. If anything would change it would be from search or perhaps classical eval; and also, hackily, bias could be brought to 0.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 24, 2021

Do I understand correctly that when applied to the starting position, this procedure would yield the starting position with black to move? I guess it depends on the NN training whether the NN can handle such positions sensibly.

The net would see that position exactly the same as it sees the start position. The transformation I described is invariant from the perspective of the net. If anything would change it would be from search; and also, hackily, bias could be brought to 0.

I guess you mean that the average bias would be zero. I agree with that. Note that the RMSbias would not change (this is the square root of the average of the squared biases). So one would still need pentanomial statistics.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Aug 24, 2021

@vdbergh Thanks for the reassuring response. So could we say that the benefit of book performance on given bounds exclusively affects the accuracy of resolution? If so then it would mean that wider bounds with better performing book could provide equal confidence at less games on average, right? I was under the impression that given bounds provide equal confidence upon resolution, with varying productivity. Book performance measured in nElo.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 24, 2021

@NKONSTANTAKIS

With the current setup every test receives a budget of the order of 100K games (independently of anything). How effectively this budget is used should depend on the book.

Before the bounds were expressed in normalized Elo, every change of testing conditions required an adjustment of the bounds, which was a painful process.

@vdbergh
Copy link
Contributor Author

vdbergh commented Sep 2, 2021

A 10% time odds test (60+0.6 vs 66+0.66).

Name Draw ratio Bias(Elo) Elo Normalized Elo Test
UHO_XXL_+0.90_+1.19.epd (LTC) 0.50 212 10.94 [9.68, 12.21] 24.05 [21.26, 26.84] Test
noob_3moves.epd (LTC) 0.95 18 3.39 [2.78, 4.00] 15.46 [12.72, 18.20] Test
8moves_v3.pgn 0.91 54 3.94 [3.21, 4.68] 14.97 [12.21, 17.74] Test

The nElo ratio is 24.05/15.46=1.56 with (approximate) confidence interval [1.22,1.89].

This result is quite different from #3323 (comment) . It seems likely that the latter test suffered from selection bias.

EDIT: Added results for 8moves_v3.pgn to get a conversion factor between UHO Elo and RT Elo.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Sep 3, 2021

Impressively contrasting result, indicating high deviation of optimization across these very different position subsets.
I find valuable to research the deviation of optimization across TC too, as I recon quite possible that the critical nature of positions has weakened the STC-LTC correlation. But not sure how, any ideas?

Great result regarding UHO productivity, as all testing so far points great scaling with TC (resistance to elo compression), indicating low amount of deterministic results with higher depth searches. (discovered forced wins/draws)
This does not necessarily mean a weak STC performance, but its a concern.
An STC time odds comparison would safeguard that UHO starts from positive standpoint.

Regarding LTC confidence, a simple calculation of Elo to nElo analogies at TC odds test gives us an x2.07 performance! (*) Assessing the combined STC+LTC confidence is another reason for STC time odds test.

  • (As between equal elo, the higher nElo would be superior, and between equal nElo the higher elo, I find this in-between arbitrary metric more practically indicative than both nElo x1.56 and Elo x3.23 performances)

@vondele
Copy link
Member

vondele commented Nov 8, 2021

We have been using the new book now for somewhat more than 2 months.

I've started 3 tests with the old noob_3moves.epd, the new UHO_XXL_+0.90_+1.19.epd and the RT 8moves_v3.pgn books, on the patches committed to master since Aug 22nd (SPRT tested on the UHO book).

https://tests.stockfishchess.org/tests/view/61891ae3d7a085ad008ef282
https://tests.stockfishchess.org/tests/view/61891ac2d7a085ad008ef280
https://tests.stockfishchess.org/tests/view/61891b0ad7a085ad008ef286

this test was triggered by the question if progress on the new book and progress on the old book are comparable.

@vdbergh
Copy link
Contributor Author

vdbergh commented Nov 9, 2021

Thanks. I will put the results in a table tonight. However to be able to compare with #3323 (comment) it might be useful to also do the tests at LTC. That way we can see if it is the UHO book that benefited from selection bias this time.

@vdbergh
Copy link
Contributor Author

vdbergh commented Nov 9, 2021

A less costly alternative is to redo #3323 (comment) for STC. Maybe I'll just do that. Then we also get some new scaling information with respect to TC.

@bjbraams
Copy link

bjbraams commented Nov 24, 2021

NKONSTANTAKIS or vdbergh, can you clarify "50% UHO extra benefit"? What measure? The measure that makes sense to me is how many games are needed (all else equal) to decide the sign (only the sign) of the elo difference between two instances of SF, within some agreed level of uncertainty.

@NKONSTANTAKIS
Copy link

By comparing the 3 tests on the 3 different books of same reference version elo spreaded via time odds, soas to not be affected by selection bias of UHO specific optimization, vs the 3 tests of dev vs the old reference version.

We see that the 2 balanced books have almost similar performance, while on UHO book that we optimize for we get 23.4 vs 14.5.

So, taking everything into account its very near 50%. This is after ~3 months of patches at UHO and is expected to increase.

@NKONSTANTAKIS
Copy link

In other words dev has became as stronger vs the last version before book swap as if that version had 10% more time against itself at STC and balanced books, and 50% more at UHO.

This means that performance at balanced books also benefited by UHO progress, taking 2/3 of the gains.

@vondele
Copy link
Member

vondele commented Feb 20, 2022

I will close this as an issue. We have quite successfully adopted UHO as the training book. We did find a recent case where there is quite a difference between the Elo measured on the 8moves and the UHO book (#3937) but overall the experience is good. Several major tournaments have adopted UHO-style books as well.

We can reopen the issue at a later point if useful.

@vondele vondele closed this as completed Feb 20, 2022
@vdbergh
Copy link
Contributor Author

vdbergh commented Feb 20, 2022

What these experiments have thought us is that book tests based on prior commits are completely useless because of selection bias. The only objective book tests are time odds tests - like this one #3323 (comment)

@vondele
Copy link
Member

vondele commented Jul 7, 2022

@vdbergh I'll link to your test of the DFRC book https://tests.stockfishchess.org/tests/view/62c551fe50dcbecf5fc0cfe8 let us know what your interpretation of those results are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests