Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Provide WDL statistics #2778

Closed
wants to merge 6 commits into from

Conversation

vondele
Copy link
Member

@vondele vondele commented Jun 28, 2020

A number of engines, GUIs and tournaments start to report WDL estimates
along or instead of scores. This patch enables reporting of those stats
in a more or less standard way (http://www.talkchess.com/forum3/viewtopic.php?t=72140)

info depth 59 seldepth 78 multipv 1 score cp 80 wdl 313 675 12 lowerbound nodes 3568397124 nps 2847848 hashfull 1000 tbhits 0 time 1253015 pv f3h4

The model this reporting uses is based on data derived from a few million fishtest LTC games,
given a score and a game ply, a win rate is provided that matches rather closely,
especially in the intermediate range [0.05, 0.95] that data. Some data is shown at
https://github.com/glinscott/fishtest/wiki/UsefulData#win-loss-draw-statistics-of-ltc-games-on-fishtest
Making the conversion game ply dependent is important for a good fit, and is in line
with experience that a +1 score in the early midgame is more likely a win than in the late endgame.

Even when enabled, the printing of the info causes no significant overhead.

Passed STC:
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 197112 W: 37226 L: 37347 D: 122539
Ptnml(0-2): 2591, 21025, 51464, 20866, 2610
https://tests.stockfishchess.org/tests/view/5ef79ef4f993893290cc146b

The PR is for discussion, the comments will be updated with a comparison of model and data.

Bench: 4789930

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

This is the data derived from fishtest, showing win rate as a function of score and move number:

win_rate_data

This is the corresponding model result
win_rate_model

@joergoster
Copy link
Contributor

Hmm, I didn't know that we have to provide 3 values. There is quite some computational effort involved now.

I would move this into UCI namespace, maybe UCI::wdl() similar to UCI::value().
How do you handle TB wins/mates? At least mate scores should be given as 1000 0 0, right?

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

yes, could be moved to UCI namespace.

The computational effort is there, but this is only computed once per depth, so negligible at higher depth.

All large scores (including TB and mate) get clamped to +- 1000cp, which should yield 1000 0 0.

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

To make it a little easier to appreciate the fit between data and model, rather than a 2D contour plot, I have a comparison for the winrate based on data and on model, taken at move 30:

data_vs_model_at_move_30

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

Finally, the polynomial fit of a, b, which captures the move number dependence of these parameters is:

Fit_as_bs

@snicolet
Copy link
Member

snicolet commented Jun 28, 2020

Can we try to run alpha-beta on the win rate instead of usual eval? Or transform back the win rate to usual eval value (but taking implicitly into account the game ply) and then use our alpha-beta?

@snicolet
Copy link
Member

Very nice graphics, by the way, I wish we would all have the skills to produce more of graphs like that to ease the discussions!

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

Yes, I think this data should be exploitable, since it suggests that for two positions with similar eval one should use something closer to the root, but the current full form (exp, etc.) is too expensive to call in eval.

(Edit: that's basically, the curve that fits 'a' in the above graph. That's the score needed to have 50% winrate, as a function of move)

Edit2: The transformation can be done with exponentials, it is the two curves as / bs above that are needed.

I had a few tries a couple of days ago with linearized forms see e.g. https://tests.stockfishchess.org/tests/view/5ee92722563bc7aa755ffc48 but that was just 'adjustment by hand', I didn't have the formulae yet.

Note that the data is for the evaluation after search, the positions leading to this eval will be quite a bit deeper.

:-) I'm learning to use mathplotlib...

@AlexandreMasta
Copy link

AlexandreMasta commented Jun 28, 2020

Very nice graphics but why nerf the engine for an information that can be calculated by the GUI as TCEC is doing?

I don´t think this kind of information values any nerf to the engine. The engine already gives you the centipawns score.

We need elo patches not nerf patches

Congrats for the graphs and code anyway.

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

@AlexandreMasta takes a couple of weeks of data collection and analysis to do that for one engine. Unlikely GUIs (or TCEC) can put that effort in, this would lead to pretty different results. So far, I have not seen any approaches that take the move number into account, either. But yes, in principle it could be upstreamed to all GUIs. So, that's one point why I put this up for discussion.

@AlexandreMasta
Copy link

I know you had lots of work. I´m sure everybody here is grateful for your work. It is very nice indeed. I´m just saying I don´t think this should be implemented as default weakening the engine for this info. Maybe as a "turn on / turn off" feature in the UCI options should be much better.

So...it is nice but make it so that user can turn it off to provide the best performance of the engine.

Congrats!

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

have a look at the patch... the option is there. And is doesn't make the engine weaker as the testing shows (and is quite obvious as well from the code).

@AlexandreMasta
Copy link

AlexandreMasta commented Jun 28, 2020

have a look at the patch... the option is there. And is doesn't make the engine weaker as the testing shows (and is quite obvious as well from the code).

Very nice then! Maybe just turn it off as default. My sincere congratulations! I really was thinking this would be implemented as always ON. Nice job. With your lead SF is achieving great results!

Cheers

@Alayan-stk-2
Copy link

That's very nice data, but one big disclaimer with all WDL approaches is that the draw ratio and the "weak side win" ratio are extremely TC/hardware dependent. Even with the work to get a nice fit from fishtest data, things can change drastically with different test conditions.

Some other limitations :

  • Fishtest's adjudication might skew data that is close to 400cp.

  • Position counting bias : games that are won late will see an increasing eval, so eval won't stay for many moves say between 300cp and 400cp. However, a drawn game with a high static eval could stay for dozen of plies inside this band. So WDL statistics derived from the proportion of positions will inflate a lot the draw probability.

@vondele
Copy link
Member Author

vondele commented Jun 28, 2020

@Alayan-stk-2 yes, I wanted to point that out. There is likely TC dependence that is not captured in the model. I wonder, however, if TC means the scores are different, or the effect of a given score is different. That's not entirely clear.

Yes, adjudication plays a role. The model is only retaining data with eval < 400cp. There is some effect of adjudication, on win rate, but mostly outside the interval [0.05, 0.95], which i mentioned.

The counting 'bias' is basically related to the definition. The model is correct for the definition (i.e. picking one position and one eval). Other definitions lead to other graphs.

@Alayan-stk-2
Copy link

@Alayan-stk-2 yes, I wanted to point that out. There is likely TC dependence that is not captured in the model. I wonder, however, if TC means the scores are different, or the effect of a given score is different. That's not entirely clear.

I think it's both. When a position is clearly winning, eval is higher at longer TC as the engine can search deeper. In non-clearly winning position, the eval gets more reliable with more time, especially 0.00 draws.

But blunders and mistakes are much more likely at shorter TCs, so even if short and long search do happen to give the same eval, WDL will differ. Simplest example is 0.00, as this eval can happen at any TC. At TCEC conditions, a 0.00 is a big statement, the draw probability is really high. At bullet, the expected 3-fold is likely to be avoidable, a blunder later in the game is frequent.

By the way, another issue with WDL is how to rate 0.00. 0.00 is a symmetrical evals but in most positions evaluated as such, unless 100% D, one side has a much bigger win probability than the other... We can't easily extract this information from SF.

The counting 'bias' is basically related to the definition. The model is correct for the definition (i.e. picking one position and one eval). Other definitions lead to other graphs.

Let's take a simplified dataset to illustrate why the counting bias leads to results going against expectations.

1 game with a static fortress +3 eval from move 50 to move 98. This game is drawn.

49 games with a +3 eval at move 50, 51... to 98, each for one move only.

98% of games having reached a +3 eval in the dataset between moves 50 and move 98 were won. So if a +3 happens in a game, the best prediction is 98% win.

But 50% of positions with +3 eval between move 50 and move 98 happened in that one fortress game and the position-based model gives 50% win 50% draw for +3.

@nguyenpham
Copy link
Contributor

Good idea. I think chess GUIs can do (calculate WDL) but surely the data can't fit for all engines. Thus it is best if a chess engine does itself.

Just a small suggestion: when working with float/double variants, we may see the negative zero (-0) which is considered as zero and there is no problem if we keep that number as a float/double. However, after rounding up and multi with a large number to convert it into an integer ranging from 0 - 1000, the number may become a small negative integer (such as -1). It looks weird and raises some questions from users. This problem is sometimes observed by Lc0.

We can fix in a simple way by using std::max:

int win_rate_model(int ply, Value v) {
  ...
  return std::max(0, int(0.5 + 1000 / (1 + std::exp((a - x) / b))));
}

  int wdl_d = std::max(0, 1000 - wdl_w - wdl_l);

@vondele
Copy link
Member Author

vondele commented Jun 29, 2020

@nguyenpham I think that rounding error should not happen (the clamping of the eval, plus the round to nearest takes care of that already).

@nguyenpham
Copy link
Contributor

nguyenpham commented Jun 29, 2020

w, d, l could be zero, leading formulas/variants of the function win_rate_model could be zero too and may be stored as negative zeros. It is only safe if the current, as well as future data which you use to create the formula, have all values of w, d, l far from zero.

BTW, that won't happen frequently (zero is a rare case anyway) and that is not serious issue. You may go ahead and be back to fix later if we see the problem happen ;)

@gonzalezjo
Copy link

This and your visualizations are fantastic.

@hazzl
Copy link

hazzl commented Jun 29, 2020

How about limiting the output to depths >7 (for example) to alleviate the cost?

@vondele
Copy link
Member Author

vondele commented Jun 29, 2020

Computational cost is really overestimated here. The full wdl calculation is about 60 cycles, runs at 42614851 calls per second, so adds about a microsecond overhead per move at TCEC conditions.

Meanwhile, I also verified that on the full valid input range of plies and evals there is no case where the wdl is outside the [0, 1000] interval.

@snicolet
Copy link
Member

snicolet commented Jun 29, 2020

@vondele not a comment about the PR request per se, but do you have a similar model and similar graphs using material left on the board (using piece count, or something very simple like Q=9 R=5 B=N=3 P=1, or even pos.non_pawn_material()) ?

If the violet -> yellow gradient was purely vertical, or even parabolic, then we could probably very easily correct our evaluation function to get comparable winning probabilities for opening and endgames, and that could turn into an Elo gainer..

@vondele
Copy link
Member Author

vondele commented Jun 29, 2020

yes, also that's available already... but not analysed to a full model, is more complex in the interesting part:
https://github.com/glinscott/fishtest/wiki/UsefulData#for-positions-grouped-by-material-value-summing-pieces-using-values-1-3-3-5-9

@snicolet
Copy link
Member

@vondele
Oh, that's cool, the following graph in particular shows that we may gain a lot by recalibrating the evaluation function when material is 22 pawns or less: http://cassio.free.fr/stockfish/win-probability-by-score-and-material.jpg , to avoid the diagonal region at the bottom (this region is the reason why irreversible moves and endgames are often too attractive for SF at the moment).

@vondele
Copy link
Member Author

vondele commented Jun 29, 2020

@snicolet patches welcome ;-)

However, one has to be careful with the interpretation of the graphs, i.e. the 'position bias' argument by Alayant. The graphs tell you what the probability is for a random position in fishtest games to be won, for a given material count and score. Won endgames with few pieces are quickly over, draw endgames will drag on, this aspect is part of the graph. Things might look very different for random positions encountered in a typical endgame search. Nevertheless, getting some ideas for patches was the main purpose of these graphs :-), the more people are intrigued by them, the better..

@snicolet
Copy link
Member

I have pushed some tests, for instance:
https://tests.stockfishchess.org/tests/view/5efa0d76020eec13834a9736

snicolet referenced this pull request in snicolet/Stockfish Jun 29, 2020
@AlexandreMasta
Copy link

Tests will show!

Awesome work and findings to come.

@MichaelB7
Copy link
Contributor

MichaelB7 commented Jun 30, 2020

Just a note of thanks to everyone who helped pulled this together and especially to @vondele for the lead he has taken to make this happen. This is a very desirable enhancement. +1 seems clearly inadequate , so let’s make it a +1000 😊

Edit : FWIW, Both cutechess and Fritz GUI show the WDL data. Which is excellent as they were enhanced to show that data from LC0 and other NN engines.

@vondele
Copy link
Member Author

vondele commented Jun 30, 2020

@MichaelB7 thanks for confirming it works in some GUIs. I'll merge it in the following round.

@vondele vondele added to be merged Will be merged shortly and removed discussion needed labels Jun 30, 2020
@snicolet
Copy link
Member

snicolet commented Jun 30, 2020

@vondele

Using doubles instead of floats is very slightly faster for me (for two parallel bench at bench 20).

I have implemented that and corrected some typos in the comments and readme in this commit:
snicolet/Stockfish@b7ccd21
https://github.com/snicolet/Stockfish/commit/b7ccd21605c6acf045d951e39e846cda3f423a65.diff

@vondele
Copy link
Member Author

vondele commented Jun 30, 2020

@snicolet thanks for the careful review.... I need clearly to pay more attention when writing comments.

The speed difference won't be measurable with a bench, but using doubles is somewhat more consistent with the rest of the code.

@snicolet
Copy link
Member

@vondele There were also some typos in the readme :-)

    If enabled, show approximate WDL statistics as part of the engine output.
    These WDL numbers model expected game outcomes for a given evaluation and
    game ply as obtained during fishtest LTC games.

@vondele vondele closed this in 1100688 Jul 1, 2020
@vondele
Copy link
Member Author

vondele commented Jul 1, 2020

Thanks for the discussion and comments.

@Coolchessguykevin
Copy link

@vondele Aquarium GUI does not support this feature, it seems. And I see it set as ON by default in the new UCI-parameter.

Should Aquarium users off it? Or even with the option turned on, it will not affect the output PV analysis?

Thanks!

@vondele
Copy link
Member Author

vondele commented Jul 1, 2020

@Coolchessguykevin in principle, a UCI compliant GUI should have no problems with it, i.e. will just ignore the additional output. Nothing should be affected by it. However, if a GUI fails to deal with this output (e.g. crashes), a user can switch it off, till the GUI is fixed.

vondele added a commit to vondele/Stockfish that referenced this pull request Jun 23, 2021
This updates the WDL model based on the LTC statistics in June this year (10M games),
so from pre-NNUE to NNUE based results.

(for old results see, official-stockfish#2778)

As before the fit by the model to the data is quite good.

No functional change
vondele added a commit to vondele/Stockfish that referenced this pull request Jun 28, 2021
This updates the WDL model based on the LTC statistics in June this year (10M games),
so from pre-NNUE to NNUE based results.

(for old results see, official-stockfish#2778)

As before the fit by the model to the data is quite good.

closes official-stockfish#3582

No functional change
MichaelB7 pushed a commit to MichaelB7/Stockfish that referenced this pull request Jul 6, 2021
This updates the WDL model based on the LTC statistics in June this year (10M games),
so from pre-NNUE to NNUE based results.

(for old results see, official-stockfish#2778)

As before the fit by the model to the data is quite good.

closes official-stockfish#3582

No functional change
vondele added a commit to vondele/Stockfish that referenced this pull request Apr 15, 2022
This updates the WDL model based on the LTC statistics for the last month (8M games).

for old results see:
official-stockfish#3582
official-stockfish#2778

the model changed a bit from the past, some images to follow in the PR

No functional change.
vondele added a commit to vondele/Stockfish that referenced this pull request Apr 16, 2022
This updates the WDL model based on the LTC statistics for the last month (8M games).

for old results see:
official-stockfish#3582
official-stockfish#2778

the model changed a bit from the past, some images to follow in the PR

closes official-stockfish#3981

No functional change.
vondele added a commit to vondele/Stockfish that referenced this pull request Aug 6, 2022
This updates the WDL model based on the LTC statistics for the two weeks (3M games).

for old results see:

official-stockfish#3981
official-stockfish#3582
official-stockfish#2778

closes official-stockfish#4115

No functional change.
PikaCat-OuO pushed a commit to official-pikafish/Pikafish that referenced this pull request Oct 7, 2022
This updates the WDL model based on the LTC statistics for the two weeks (3M games).

for old results see:

official-stockfish/Stockfish#3981
official-stockfish/Stockfish#3582
official-stockfish/Stockfish#2778

closes official-stockfish/Stockfish#4115

No functional change.

(cherry picked from commit e639c45)
dav1312 pushed a commit to dav1312/Stockfish that referenced this pull request Oct 21, 2022
This updates the WDL model based on the LTC statistics for the last month (8M games).

for old results see:
official-stockfish#3582
official-stockfish#2778

the model changed a bit from the past, some images to follow in the PR

closes official-stockfish#3981

No functional change.
Joachim26 pushed a commit to Joachim26/StockfishNPS that referenced this pull request Nov 29, 2023
This updates the WDL model based on the LTC statistics for the last month (8M games).

for old results see:
official-stockfish#3582
official-stockfish#2778

the model changed a bit from the past, some images to follow in the PR

closes official-stockfish#3981

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet