-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize evaluation #4216
Normalize evaluation #4216
Conversation
88ccea9
to
bd9fad7
Compare
bd9fad7
to
87866ba
Compare
Seems like Probably the best way out is to remove the |
87866ba
to
3779450
Compare
This should also be applied to Stockfish/src/nnue/evaluate_nnue.cpp Line 250 in d09653d
eval command looks after such change.
|
@Sopel97 Good point. Maybe it's better to move this EDIT: indeed, |
3779450
to
3588664
Compare
|
I moved Internal2Pawn to uci.h |
209c8c7
to
24f05e9
Compare
uci.h missing from changes. Other than that I think it's good now. |
24f05e9
to
6261ab9
Compare
I must be breaking my record for the most forced pushes to a PR. Anyway, thanks for the feedback! |
Here we go again :). Maybe replace (It's constexpr starting from C++23, hallelujah) |
6261ab9
to
7c72f3f
Compare
A suggestion for an alternative to the
|
Why win rate and not expected score? As a rule of thumb 1 centipawn equals 1 Elo, https://www.chessprogramming.org/Pawn_Advantage,_Win_Percentage,_and_Elo . Edit: I was too quick. That reference also talks about win percentage and not expected score. I remembered it incorrectly. Edit2: This makes no sense. They say the win percentage is 50% if there is no pawn advantage. Perhaps ignoring draws? |
expected score and win rate are related (score = 0.5 * ( 1 + win_rate(eval) - win_rate(-eval))), so if we assume win_rate(-eval) = 0 (for large evals, this more or less holds), we see this results roughly in an expected score of 0.75 for a 100cp advantage. I guess there is also some confusion win_rate as used in SF code is the probability of win. Some use 'winning percentage' (like https://www.3dkingdoms.com/chess/elo.htm) as the match score (like all draws is a winning percentage of 50). I could have a look at the pgns I have to see if there could be some other mapping. |
@vondele You are right. I read some of the discussion around that reference, and by win percentage they really mean match score. So 1cp=1elo (according to the reference). This yields a 64% score for 100cp. |
I think equating 100cp to some win rate (as in SF code sense, i.e. probability of winning), 50% in this case, makes most sense from a practical perspective. Users will be interested in odds of winning the game, expected outcome is strange to reason about IMHO (let alone elo in this context). |
@ddobbelaere The win rate for 0cp obviously depends strongly on the tc. Elo also depends on the tc but I think less strongly so. For example 0 cp is always 0 elo (50% expectedly score). |
@vdbergh You are right. However, as @vondele mentioned, we are in the regime with fixed relation between win probability and expected score (big advantage, no loss assumed). In this case, win probability is more relatable to a chess player I think, expected outcome only convolutes things. And yes, 50% and 100cp are nice round numbers, that's a plus IMHO. It would be so nice to give our users this hold: "SF 100cp is 50% win probability with (near) perfect play". Funnily enough, while 0cp win rate depends on tc, I think the situation for 100cp is much more tricky, as the eval itself also depends on it (if tc goes to infinity, eval goes to zero or +/- infinity, loosely speaking). But this point (dependence on tc) deserves more attention maybe (e.g. what's the situation for STC?). |
@ddobbelaere Chess players understand Elo very well. With the match score system a 100cp advantage means that the opponent needs to be 100elo stronger to equalize. To me this seems very easy to understand. |
@ddobbelaere The point I want to make is that the match score system also gives clear information in the case of unequal opponents. With the win rate system this is much less so. |
To be clear: what I am proposing is
Edit: of course this is a conceptual description. The last two steps can be simplified to
|
I made a prototype implementation. See here https://github.com/vdbergh/Stockfish/tree/objective_eval Special score are currently not treated separately, so this still needs a bit of work I guess. |
Ok I fixed a bunch of bugs and now treat mate scores specially. |
My suggestions for this patch:
|
As I understand it, running the script after the PR should lead to a graph with the vertical line between blue and turquoise (the frontier labelled "0.500") aligning with the vertical 100 score? |
I personally like @vdbergh his suggestion (PR #4218) more than this PR (I changed my mind...). It is conceptually simpler and provides a non-linear relation between the internal value and the reported UCI score in centipawns. With this PR, the relation is by definition linear. The fact that it relates to the earlier referenced paper (and has an easy rule of thumb: "1cp means 1 elo handicap") is also a plus IMHO. This way, more focus is being put onto the WDL model (and it's derived cp value, now strictly defined) and less on some internal value. |
it has also the disadvantages of its benefits: introducing another complexification and another mathematical model for what is only an esthetic problem. |
The scripts used to create those graphs are at https://github.com/vondele/WLD_model they would need updating once we introduce the new normalization (I have those changes already locally and will push if this is merged). The input are game pgn downloaded to fishtest (typically millions of games used), so will take a few weeks before we could regenerate. Of course, this should result at 0.5 being near 100cp or move 32. |
In ShashChess, I started from the idea the initiaI position static eval is 32 (15cp). This gave me the best resuIt. I hope this can heIp. |
7c72f3f
to
11b09b7
Compare
Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes official-stockfish#4155 closes official-stockfish#4216 No functional change
11b09b7
to
f0d8c19
Compare
Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes official-stockfish#4155 closes official-stockfish#4216 No functional change
f0d8c19
to
ad2aa8c
Compare
Reintroduced mctsThreads option Stockfish patch Author: Joost VandeVondele Date: Sat Nov 5 09:15:53 2022 +0100 Timestamp: 1667636153 Normalize evaluation Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes official-stockfish/Stockfish#4155 closes official-stockfish/Stockfish#4216 No functional change
This may have the added effect that other engines (who, like it or not, will try to match Stockfish's centipawn evals) will have an easier time doing so if they can assume from the outset that "50% winrate = +100 centipaws" e.g. LeelaChessZero/lc0#1193 |
I'd be quite happy to see more engines adopt the same normalization, it really appears useful, several has done so already and with the current version of the tool https://github.com/vondele/WLD_model this is actually easy. For engines that have intrinsically a WLD evaluation, like Lc0, there is nice way to turn that into an eval that is consistent with this convention and results in a nice agreement between Leela and SF (see LeelaChessZero/lc0#1791) |
Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.
The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):
June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)
With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model
fixes #4155
closes #4216
No functional change
--
Note to practitioners: the eval inflation has been fixed in this patch by fixing 100cp to mean 50% win chance, and decoupling this conversion from PawnValueEg. This conversion is somewhat arbitrary, only the relative ranking of positions is important for an engine, which is designed the find the best move. Generally, it might be better to directly use the wdl values (available with the option UCI_ShowWDL) in analysis, or focus directly on the bestmove and PV lines provided.