Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize evaluation #4216

Merged
merged 1 commit into from
Nov 5, 2022
Merged

Conversation

vondele
Copy link
Member

@vondele vondele commented Oct 31, 2022

Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.

The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):

June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)

With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model

fixes #4155
closes #4216

No functional change

--
Note to practitioners: the eval inflation has been fixed in this patch by fixing 100cp to mean 50% win chance, and decoupling this conversion from PawnValueEg. This conversion is somewhat arbitrary, only the relative ranking of positions is important for an engine, which is designed the find the best move. Generally, it might be better to directly use the wdl values (available with the option UCI_ShowWDL) in analysis, or focus directly on the bestmove and PV lines provided.

src/uci.cpp Outdated Show resolved Hide resolved
@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 1, 2022

Seems like std::exp is not constexpr by spec (hence Clang CI failing), but GCC allows this anyway, my bad.

Probably the best way out is to remove the constexpr from win_rate_model again and replace static_assert by assert, as done here: 9dcec48.

@Sopel97
Copy link
Member

Sopel97 commented Nov 1, 2022

This should also be applied to

static void format_cp_aligned_dot(Value v, char* buffer) {
. I'm curious how the eval command looks after such change.

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 1, 2022

@Sopel97 Good point. Maybe it's better to move this Internal2Pawn somewhere higher up then (e.g. to types.h)? This would have the advantage that we can static_assert inside win_rate_model.

EDIT: indeed, static_assert(348 == std::round(as[0] + as[1] + as[2] + as[3])); seems to work inside win_rate_model after marking as as constexpr (and maybe do the same for consistency for bs).

@Sopel97
Copy link
Member

Sopel97 commented Nov 1, 2022

win_rate_model should be a class IMO (currently would function only as a namespace)

@vondele
Copy link
Member Author

vondele commented Nov 1, 2022

I moved Internal2Pawn to uci.h

@vondele vondele force-pushed the normalize_eval branch 2 times, most recently from 209c8c7 to 24f05e9 Compare November 1, 2022 08:51
@Sopel97
Copy link
Member

Sopel97 commented Nov 1, 2022

uci.h missing from changes. Other than that I think it's good now.

@vondele
Copy link
Member Author

vondele commented Nov 1, 2022

I must be breaking my record for the most forced pushes to a PR. Anyway, thanks for the feedback!

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 1, 2022

Here we go again :). Maybe replace std::round (which is again not constexpr by spec, but GCC allows it, rightfully so IMHO) to static_cast<int>?

(It's constexpr starting from C++23, hallelujah)

@Sopel97
Copy link
Member

Sopel97 commented Nov 2, 2022

A suggestion for an alternative to the Internal2Pawn name.

SearchScorePawnValue

@vdbergh
Copy link
Contributor

vdbergh commented Nov 2, 2022

Why win rate and not expected score? As a rule of thumb 1 centipawn equals 1 Elo, https://www.chessprogramming.org/Pawn_Advantage,_Win_Percentage,_and_Elo .

Edit: I was too quick. That reference also talks about win percentage and not expected score. I remembered it incorrectly.

Edit2: This makes no sense. They say the win percentage is 50% if there is no pawn advantage. Perhaps ignoring draws?

@vondele
Copy link
Member Author

vondele commented Nov 2, 2022

expected score and win rate are related (score = 0.5 * ( 1 + win_rate(eval) - win_rate(-eval))), so if we assume win_rate(-eval) = 0 (for large evals, this more or less holds), we see this results roughly in an expected score of 0.75 for a 100cp advantage.

I guess there is also some confusion win_rate as used in SF code is the probability of win. Some use 'winning percentage' (like https://www.3dkingdoms.com/chess/elo.htm) as the match score (like all draws is a winning percentage of 50).

I could have a look at the pgns I have to see if there could be some other mapping.

@vdbergh
Copy link
Contributor

vdbergh commented Nov 2, 2022

@vondele You are right. I read some of the discussion around that reference, and by win percentage they really mean match score. So 1cp=1elo (according to the reference). This yields a 64% score for 100cp.

@ddobbelaere
Copy link
Contributor

I think equating 100cp to some win rate (as in SF code sense, i.e. probability of winning), 50% in this case, makes most sense from a practical perspective. Users will be interested in odds of winning the game, expected outcome is strange to reason about IMHO (let alone elo in this context).

@vdbergh
Copy link
Contributor

vdbergh commented Nov 2, 2022

@ddobbelaere The win rate for 0cp obviously depends strongly on the tc. Elo also depends on the tc but I think less strongly so. For example 0 cp is always 0 elo (50% expectedly score).

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 2, 2022

@vdbergh You are right. However, as @vondele mentioned, we are in the regime with fixed relation between win probability and expected score (big advantage, no loss assumed). In this case, win probability is more relatable to a chess player I think, expected outcome only convolutes things. And yes, 50% and 100cp are nice round numbers, that's a plus IMHO.

It would be so nice to give our users this hold: "SF 100cp is 50% win probability with (near) perfect play".

Funnily enough, while 0cp win rate depends on tc, I think the situation for 100cp is much more tricky, as the eval itself also depends on it (if tc goes to infinity, eval goes to zero or +/- infinity, loosely speaking). But this point (dependence on tc) deserves more attention maybe (e.g. what's the situation for STC?).

@vdbergh
Copy link
Contributor

vdbergh commented Nov 3, 2022

@ddobbelaere Chess players understand Elo very well. With the match score system a 100cp advantage means that the opponent needs to be 100elo stronger to equalize. To me this seems very easy to understand.

@vdbergh
Copy link
Contributor

vdbergh commented Nov 3, 2022

@ddobbelaere The point I want to make is that the match score system also gives clear information in the case of unequal opponents. With the win rate system this is much less so.

@vdbergh
Copy link
Contributor

vdbergh commented Nov 3, 2022

To be clear: what I am proposing is

(w,d,l) (Vondele's formula) --> score=w+(1/2)d --> elo=-400*log10(1/score-1) --> pawn_eval = elo/100

Edit: of course this is a conceptual description. The last two steps can be simplified to

pawn_eval=4*log10(score/(1-score))

@vdbergh
Copy link
Contributor

vdbergh commented Nov 3, 2022

I made a prototype implementation. See here

https://github.com/vdbergh/Stockfish/tree/objective_eval

Special score are currently not treated separately, so this still needs a bit of work I guess.

@vdbergh
Copy link
Contributor

vdbergh commented Nov 3, 2022

Ok I fixed a bunch of bugs and now treat mate scores specially.

@vdbergh vdbergh mentioned this pull request Nov 3, 2022
@snicolet
Copy link
Member

snicolet commented Nov 3, 2022

My suggestions for this patch:

  1. rename UCI::Internal2Pawn to UCI::kNornalization, and use the later form UCI::kNornalization everywhere.
  2. add the following comment in uci.h :
// The constant we use to renormalize the internal Stockfish scores for UCI outputs.
// This value is currently chosen such that when Stockfish outputs an advantage of
// "100 centipawns" (in the UCI protocol sense) for a position, the engine has a
// probability of win of 50% in selfplay at fishtest LTC time control (around 2 minutes 
// per game). To recalibrate this constant, use the scripts in /tests/normalize/ from
// time to time.
 const int kNormalization = 348;
  1. create a subdirectory in /tests/normalize and put there the scripts we could use to create the graphics shown in https://discord.com/channels/435943710472011776/813919248455827515/1036719860618637322

image

@snicolet
Copy link
Member

snicolet commented Nov 3, 2022

As I understand it, running the script after the PR should lead to a graph with the vertical line between blue and turquoise (the frontier labelled "0.500") aligning with the vertical 100 score?

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 3, 2022

I personally like @vdbergh his suggestion (PR #4218) more than this PR (I changed my mind...). It is conceptually simpler and provides a non-linear relation between the internal value and the reported UCI score in centipawns. With this PR, the relation is by definition linear. The fact that it relates to the earlier referenced paper (and has an easy rule of thumb: "1cp means 1 elo handicap") is also a plus IMHO.

This way, more focus is being put onto the WDL model (and it's derived cp value, now strictly defined) and less on some internal value.

@snicolet
Copy link
Member

snicolet commented Nov 3, 2022

it has also the disadvantages of its benefits: introducing another complexification and another mathematical model for what is only an esthetic problem.

@ddobbelaere
Copy link
Contributor

@snicolet Sure, it might be that the resulting UCI output in #4218 will be too "confusing" for SF users, precisely because of its non-linearity with might lead to compression/decompression of high or low evals. I don't know. This should be investigated at least.

@vondele
Copy link
Member Author

vondele commented Nov 4, 2022

  1. create a subdirectory in /tests/normalize and put there the scripts we could use to create the graphics shownin https://discord.com/channels/435943710472011776/813919248455827515/1036719860618637322

The scripts used to create those graphs are at https://github.com/vondele/WLD_model they would need updating once we introduce the new normalization (I have those changes already locally and will push if this is merged).

The input are game pgn downloaded to fishtest (typically millions of games used), so will take a few weeks before we could regenerate. Of course, this should result at 0.5 being near 100cp or move 32.

@amchess
Copy link

amchess commented Nov 4, 2022

In ShashChess, I started from the idea the initiaI position static eval is 32 (15cp). This gave me the best resuIt. I hope this can heIp.
This means a modification:
Internal2Pawn = 578
but it varies based on the net...

vondele added a commit to vondele/Stockfish that referenced this pull request Nov 5, 2022
Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.

The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):

June 2020  :   113cp (237)
June 2021  :   115cp (240)
April 2022 :   134cp (279)
July 2022  :   167cp (348)

With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model

fixes official-stockfish#4155
closes official-stockfish#4216

No functional change
Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.

The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):

June 2020  :   113cp (237)
June 2021  :   115cp (240)
April 2022 :   134cp (279)
July 2022  :   167cp (348)

With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model

fixes official-stockfish#4155
closes official-stockfish#4216

No functional change
@vondele vondele merged commit ad2aa8c into official-stockfish:master Nov 5, 2022
vondele added a commit to official-stockfish/WDL_model that referenced this pull request Nov 5, 2022
@snicolet snicolet added the to be merged Will be merged shortly label Nov 5, 2022
amchess pushed a commit to amchess/ShashChess that referenced this pull request Nov 6, 2022
Reintroduced mctsThreads option
Stockfish patch
Author: Joost VandeVondele
Date: Sat Nov 5 09:15:53 2022 +0100
Timestamp: 1667636153
Normalize evaluation

Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.

The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):

June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)

With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model

fixes official-stockfish/Stockfish#4155
closes official-stockfish/Stockfish#4216

No functional change
@LovelyChess LovelyChess mentioned this pull request Feb 2, 2023
@yuzisee
Copy link

yuzisee commented Mar 20, 2023

This may have the added effect that other engines (who, like it or not, will try to match Stockfish's centipawn evals) will have an easier time doing so if they can assume from the outset that "50% winrate = +100 centipaws" e.g. LeelaChessZero/lc0#1193

@vondele
Copy link
Member Author

vondele commented Mar 20, 2023

I'd be quite happy to see more engines adopt the same normalization, it really appears useful, several has done so already and with the current version of the tool https://github.com/vondele/WLD_model this is actually easy. For engines that have intrinsically a WLD evaluation, like Lc0, there is nice way to turn that into an eval that is consistent with this convention and results in a nice agreement between Leela and SF (see LeelaChessZero/lc0#1791)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Meaning of centipawn eval and inflation over time
7 participants