Normalize evaluation #4216

vondele · 2022-10-31T20:04:22Z

Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.

The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):

June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)

With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model

fixes #4155
closes #4216

No functional change

--
Note to practitioners: the eval inflation has been fixed in this patch by fixing 100cp to mean 50% win chance, and decoupling this conversion from PawnValueEg. This conversion is somewhat arbitrary, only the relative ranking of positions is important for an engine, which is designed the find the best move. Generally, it might be better to directly use the wdl values (available with the option UCI_ShowWDL) in analysis, or focus directly on the bestmove and PV lines provided.

src/uci.cpp

ddobbelaere · 2022-11-01T08:28:08Z

Seems like std::exp is not constexpr by spec (hence Clang CI failing), but GCC allows this anyway, my bad.

Probably the best way out is to remove the constexpr from win_rate_model again and replace static_assert by assert, as done here: 9dcec48.

Sopel97 · 2022-11-01T08:31:43Z

This should also be applied to

Stockfish/src/nnue/evaluate_nnue.cpp

Line 250 in d09653d

static void format_cp_aligned_dot(Value v, char* buffer) {

. I'm curious how the eval command looks after such change.

ddobbelaere · 2022-11-01T08:39:23Z

@Sopel97 Good point. Maybe it's better to move this Internal2Pawn somewhere higher up then (e.g. to types.h)? This would have the advantage that we can static_assert inside win_rate_model.

EDIT: indeed, static_assert(348 == std::round(as[0] + as[1] + as[2] + as[3])); seems to work inside win_rate_model after marking as as constexpr (and maybe do the same for consistency for bs).

Sopel97 · 2022-11-01T08:42:22Z

win_rate_model should be a class IMO (currently would function only as a namespace)

vondele · 2022-11-01T08:43:29Z

I moved Internal2Pawn to uci.h

Sopel97 · 2022-11-01T08:55:07Z

uci.h missing from changes. Other than that I think it's good now.

vondele · 2022-11-01T08:57:43Z

I must be breaking my record for the most forced pushes to a PR. Anyway, thanks for the feedback!

ddobbelaere · 2022-11-01T09:01:48Z

Here we go again :). Maybe replace std::round (which is again not constexpr by spec, but GCC allows it, rightfully so IMHO) to static_cast<int>?

(It's constexpr starting from C++23, hallelujah)

Sopel97 · 2022-11-02T17:11:08Z

A suggestion for an alternative to the Internal2Pawn name.

SearchScorePawnValue

vdbergh · 2022-11-02T20:39:27Z

Why win rate and not expected score? As a rule of thumb 1 centipawn equals 1 Elo, https://www.chessprogramming.org/Pawn_Advantage,_Win_Percentage,_and_Elo .

Edit: I was too quick. That reference also talks about win percentage and not expected score. I remembered it incorrectly.

Edit2: This makes no sense. They say the win percentage is 50% if there is no pawn advantage. Perhaps ignoring draws?

vondele · 2022-11-02T20:52:29Z

expected score and win rate are related (score = 0.5 * ( 1 + win_rate(eval) - win_rate(-eval))), so if we assume win_rate(-eval) = 0 (for large evals, this more or less holds), we see this results roughly in an expected score of 0.75 for a 100cp advantage.

I guess there is also some confusion win_rate as used in SF code is the probability of win. Some use 'winning percentage' (like https://www.3dkingdoms.com/chess/elo.htm) as the match score (like all draws is a winning percentage of 50).

I could have a look at the pgns I have to see if there could be some other mapping.

vdbergh · 2022-11-02T21:03:06Z

@vondele You are right. I read some of the discussion around that reference, and by win percentage they really mean match score. So 1cp=1elo (according to the reference). This yields a 64% score for 100cp.

ddobbelaere · 2022-11-02T21:19:23Z

I think equating 100cp to some win rate (as in SF code sense, i.e. probability of winning), 50% in this case, makes most sense from a practical perspective. Users will be interested in odds of winning the game, expected outcome is strange to reason about IMHO (let alone elo in this context).

vdbergh · 2022-11-02T21:30:12Z

@ddobbelaere The win rate for 0cp obviously depends strongly on the tc. Elo also depends on the tc but I think less strongly so. For example 0 cp is always 0 elo (50% expectedly score).

ddobbelaere · 2022-11-02T21:50:10Z

@vdbergh You are right. However, as @vondele mentioned, we are in the regime with fixed relation between win probability and expected score (big advantage, no loss assumed). In this case, win probability is more relatable to a chess player I think, expected outcome only convolutes things. And yes, 50% and 100cp are nice round numbers, that's a plus IMHO.

It would be so nice to give our users this hold: "SF 100cp is 50% win probability with (near) perfect play".

Funnily enough, while 0cp win rate depends on tc, I think the situation for 100cp is much more tricky, as the eval itself also depends on it (if tc goes to infinity, eval goes to zero or +/- infinity, loosely speaking). But this point (dependence on tc) deserves more attention maybe (e.g. what's the situation for STC?).

vdbergh · 2022-11-03T04:30:05Z

@ddobbelaere Chess players understand Elo very well. With the match score system a 100cp advantage means that the opponent needs to be 100elo stronger to equalize. To me this seems very easy to understand.

vdbergh · 2022-11-03T04:35:13Z

@ddobbelaere The point I want to make is that the match score system also gives clear information in the case of unequal opponents. With the win rate system this is much less so.

vdbergh · 2022-11-03T05:22:30Z

To be clear: what I am proposing is

(w,d,l) (Vondele's formula) --> score=w+(1/2)d --> elo=-400*log10(1/score-1) --> pawn_eval = elo/100

Edit: of course this is a conceptual description. The last two steps can be simplified to

pawn_eval=4*log10(score/(1-score))

vdbergh · 2022-11-03T08:55:39Z

I made a prototype implementation. See here

https://github.com/vdbergh/Stockfish/tree/objective_eval

Special score are currently not treated separately, so this still needs a bit of work I guess.

vdbergh · 2022-11-03T09:35:14Z

Ok I fixed a bunch of bugs and now treat mate scores specially.

snicolet · 2022-11-03T10:17:11Z

My suggestions for this patch:

rename UCI::Internal2Pawn to UCI::kNornalization, and use the later form UCI::kNornalization everywhere.
add the following comment in uci.h :

// The constant we use to renormalize the internal Stockfish scores for UCI outputs.
// This value is currently chosen such that when Stockfish outputs an advantage of
// "100 centipawns" (in the UCI protocol sense) for a position, the engine has a
// probability of win of 50% in selfplay at fishtest LTC time control (around 2 minutes 
// per game). To recalibrate this constant, use the scripts in /tests/normalize/ from
// time to time.
 const int kNormalization = 348;

create a subdirectory in /tests/normalize and put there the scripts we could use to create the graphics shown in https://discord.com/channels/435943710472011776/813919248455827515/1036719860618637322

snicolet · 2022-11-03T10:23:26Z

As I understand it, running the script after the PR should lead to a graph with the vertical line between blue and turquoise (the frontier labelled "0.500") aligning with the vertical 100 score?

ddobbelaere · 2022-11-03T11:06:46Z

I personally like @vdbergh his suggestion (PR #4218) more than this PR (I changed my mind...). It is conceptually simpler and provides a non-linear relation between the internal value and the reported UCI score in centipawns. With this PR, the relation is by definition linear. The fact that it relates to the earlier referenced paper (and has an easy rule of thumb: "1cp means 1 elo handicap") is also a plus IMHO.

This way, more focus is being put onto the WDL model (and it's derived cp value, now strictly defined) and less on some internal value.

snicolet · 2022-11-03T11:13:33Z

it has also the disadvantages of its benefits: introducing another complexification and another mathematical model for what is only an esthetic problem.

ddobbelaere · 2022-11-03T11:15:07Z

@snicolet Sure, it might be that the resulting UCI output in #4218 will be too "confusing" for SF users, precisely because of its non-linearity with might lead to compression/decompression of high or low evals. I don't know. This should be investigated at least.

vondele · 2022-11-04T07:39:57Z

create a subdirectory in /tests/normalize and put there the scripts we could use to create the graphics shownin https://discord.com/channels/435943710472011776/813919248455827515/1036719860618637322

The scripts used to create those graphs are at https://github.com/vondele/WLD_model they would need updating once we introduce the new normalization (I have those changes already locally and will push if this is merged).

The input are game pgn downloaded to fishtest (typically millions of games used), so will take a few weeks before we could regenerate. Of course, this should result at 0.5 being near 100cp or move 32.

amchess · 2022-11-04T09:28:01Z

In ShashChess, I started from the idea the initiaI position static eval is 32 (15cp). This gave me the best resuIt. I hope this can heIp.
This means a modification:
Internal2Pawn = 578
but it varies based on the net...

Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes official-stockfish#4155 closes official-stockfish#4216 No functional change

after merge of official-stockfish/Stockfish#4216

Reintroduced mctsThreads option Stockfish patch Author: Joost VandeVondele Date: Sat Nov 5 09:15:53 2022 +0100 Timestamp: 1667636153 Normalize evaluation Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes official-stockfish/Stockfish#4155 closes official-stockfish/Stockfish#4216 No functional change

yuzisee · 2023-03-20T10:08:28Z

This may have the added effect that other engines (who, like it or not, will try to match Stockfish's centipawn evals) will have an easier time doing so if they can assume from the outset that "50% winrate = +100 centipaws" e.g. LeelaChessZero/lc0#1193

vondele · 2023-03-20T11:57:15Z

I'd be quite happy to see more engines adopt the same normalization, it really appears useful, several has done so already and with the current version of the tool https://github.com/vondele/WLD_model this is actually easy. For engines that have intrinsically a WLD evaluation, like Lc0, there is nice way to turn that into an eval that is consistent with this convention and results in a nice agreement between Leela and SF (see LeelaChessZero/lc0#1791)

vondele force-pushed the normalize_eval branch from 88ccea9 to bd9fad7 Compare October 31, 2022 20:10

ddobbelaere suggested changes Nov 1, 2022

View reviewed changes

src/uci.cpp Outdated Show resolved Hide resolved

vondele force-pushed the normalize_eval branch from bd9fad7 to 87866ba Compare November 1, 2022 08:13

vondele force-pushed the normalize_eval branch from 87866ba to 3779450 Compare November 1, 2022 08:31

vondele force-pushed the normalize_eval branch from 3779450 to 3588664 Compare November 1, 2022 08:42

vondele force-pushed the normalize_eval branch 2 times, most recently from 209c8c7 to 24f05e9 Compare November 1, 2022 08:51

vondele force-pushed the normalize_eval branch from 24f05e9 to 6261ab9 Compare November 1, 2022 08:57

vondele force-pushed the normalize_eval branch from 6261ab9 to 7c72f3f Compare November 1, 2022 09:02

vondele mentioned this pull request Nov 2, 2022

Meaning of centipawn eval and inflation over time #4155

Closed

vdbergh mentioned this pull request Nov 3, 2022

Objective pawn eval #4218

Closed

vondele force-pushed the normalize_eval branch from 7c72f3f to 11b09b7 Compare November 5, 2022 08:05

vondele force-pushed the normalize_eval branch from 11b09b7 to f0d8c19 Compare November 5, 2022 08:08

vondele force-pushed the normalize_eval branch from f0d8c19 to ad2aa8c Compare November 5, 2022 08:16

vondele merged commit ad2aa8c into official-stockfish:master Nov 5, 2022

vondele added a commit to official-stockfish/WDL_model that referenced this pull request Nov 5, 2022

Adjust NormalizeToPawnValue to current SF default

5ac5cf5

after merge of official-stockfish/Stockfish#4216

snicolet added the to be merged Will be merged shortly label Nov 5, 2022

LovelyChess mentioned this pull request Feb 2, 2023

Update WLD model #4373

Closed

yuzisee mentioned this pull request Mar 20, 2023

SF-like centipawn formula as distinct option. LeelaChessZero/lc0#1477

Closed

Naphthalin mentioned this pull request Mar 20, 2023

WDL Conversion for more realistic WDL and contempt LeelaChessZero/lc0#1791

Merged

yuzisee mentioned this pull request Aug 25, 2023

How does the graph differ from Lichess's? rooklift/nibbler#242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize evaluation #4216

Normalize evaluation #4216

vondele commented Oct 31, 2022 •

edited by snicolet

Loading

ddobbelaere commented Nov 1, 2022 •

edited

Loading

Sopel97 commented Nov 1, 2022

ddobbelaere commented Nov 1, 2022 •

edited

Loading

Sopel97 commented Nov 1, 2022

vondele commented Nov 1, 2022

Sopel97 commented Nov 1, 2022

vondele commented Nov 1, 2022

ddobbelaere commented Nov 1, 2022 •

edited

Loading

Sopel97 commented Nov 2, 2022

vdbergh commented Nov 2, 2022 •

edited

Loading

vondele commented Nov 2, 2022

vdbergh commented Nov 2, 2022

ddobbelaere commented Nov 2, 2022

vdbergh commented Nov 2, 2022

ddobbelaere commented Nov 2, 2022 •

edited

Loading

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022 •

edited

Loading

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022

snicolet commented Nov 3, 2022 •

edited

Loading

snicolet commented Nov 3, 2022

ddobbelaere commented Nov 3, 2022 •

edited

Loading

snicolet commented Nov 3, 2022

ddobbelaere commented Nov 3, 2022

vondele commented Nov 4, 2022

amchess commented Nov 4, 2022 •

edited

Loading

yuzisee commented Mar 20, 2023

vondele commented Mar 20, 2023

Normalize evaluation #4216

Normalize evaluation #4216

Conversation

vondele commented Oct 31, 2022 • edited by snicolet Loading

ddobbelaere commented Nov 1, 2022 • edited Loading

Sopel97 commented Nov 1, 2022

ddobbelaere commented Nov 1, 2022 • edited Loading

Sopel97 commented Nov 1, 2022

vondele commented Nov 1, 2022

Sopel97 commented Nov 1, 2022

vondele commented Nov 1, 2022

ddobbelaere commented Nov 1, 2022 • edited Loading

Sopel97 commented Nov 2, 2022

vdbergh commented Nov 2, 2022 • edited Loading

vondele commented Nov 2, 2022

vdbergh commented Nov 2, 2022

ddobbelaere commented Nov 2, 2022

vdbergh commented Nov 2, 2022

ddobbelaere commented Nov 2, 2022 • edited Loading

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022 • edited Loading

vdbergh commented Nov 3, 2022

vdbergh commented Nov 3, 2022

snicolet commented Nov 3, 2022 • edited Loading

snicolet commented Nov 3, 2022

ddobbelaere commented Nov 3, 2022 • edited Loading

snicolet commented Nov 3, 2022

ddobbelaere commented Nov 3, 2022

vondele commented Nov 4, 2022

amchess commented Nov 4, 2022 • edited Loading

yuzisee commented Mar 20, 2023

vondele commented Mar 20, 2023

vondele commented Oct 31, 2022 •

edited by snicolet

Loading

ddobbelaere commented Nov 1, 2022 •

edited

Loading

ddobbelaere commented Nov 1, 2022 •

edited

Loading

ddobbelaere commented Nov 1, 2022 •

edited

Loading

vdbergh commented Nov 2, 2022 •

edited

Loading

ddobbelaere commented Nov 2, 2022 •

edited

Loading

vdbergh commented Nov 3, 2022 •

edited

Loading

snicolet commented Nov 3, 2022 •

edited

Loading

ddobbelaere commented Nov 3, 2022 •

edited

Loading

amchess commented Nov 4, 2022 •

edited

Loading