Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change of the sent to the gui score #1868

Closed
amchess opened this issue Dec 13, 2018 · 22 comments
Closed

Change of the sent to the gui score #1868

amchess opened this issue Dec 13, 2018 · 22 comments

Comments

@amchess
Copy link

amchess commented Dec 13, 2018

The Stockfish score is not at all aligned with the gui meaning, like other knowns engines.
Based on my tests, in my opinion, on the uci.cpp file, simply I propose to make this modification (not affecting the playing strenght):
ss << "cp " << v * 100 / PawnValueEg;
to
ss << "cp " << v * 70 / PawnValueEg;

@noobpwnftw
Copy link
Contributor

Centipawn scoring has no meaning nevertheless, just people happen to agree that the measurement is based on a pawn's worth is roughly 100, as you quoted in its original form.

Changing it to any value that you see fit is simple because you can make the change and compile the code, but it also has no meaning to have it applied to everyone else.

@ghost
Copy link

ghost commented Dec 13, 2018

@noobpwnftw the "GUI" may adjudicate "Resign at N centipawns" earlier in the 100 / PawnValueEg vs 70 / PawnValueEg;

@MichaelB7
Copy link
Contributor

I agree with @noobpwnftw , just increase the N value to make it resign later or change the code for your own personal use. There is no right or wrong answer - it is was the user decides.

@amchess
Copy link
Author

amchess commented Dec 14, 2018

In my opinion, the problem isn't the resign you can set via the gui but the coherence with the same gui (in cp) score. We know we can't change it and it's the following:
|score| < 25 =
25 < |score| < 70 light preference
70 < |score| < 140 preference
|score| > 140 decisive advantage

For example, if you launch, in infinite analysis and disactivated contempt, Stockfish, the score visualized by the gui is about 36 and to me it's too optimistic.
Other respectable engines, like komodo, are more precise and show 25.
In terms of Shashin's theory this seems to me more realistic: a Capablanca-Tal/Caos position for the turn advantage.
So, 25 * 100/36 is about 70.
I know it's a very simple modification and for that very humbly I suggested it.
It's sure the actual Stockfish score sent to the and visualized by the gui is misleading.

@ddugovic
Copy link

What is this "the GUI" which has the following properties (that requires changing this engine used on millions of devices)?

We know we can't change it and it's the following:
|score| < 25 =
25 < |score| < 70 light preference
70 < |score| < 140 preference
|score| > 140 decisive advantage

@amchess
Copy link
Author

amchess commented Dec 14, 2018

Every known gui: Fritz, Arena, etc...
It's a convention and you can see it referenced on the main guis manuals and with a search by google.
Anyway, the strength of engine is out of question: simply, the numbers prove it isn't aligned with this convention. It's impossible, for example, the initial position is of Tal's type (in terms of Shashin)...
This modification doesn't impact the playing strength, but like this, it's more useful for the user: simply, he hasn't to multiply the score for 0.70....
Finally, I'm not the first at all to notice the score shown by stockfish is too optimistic...
Andrea

@ddugovic
Copy link

According to Stack Exchange "decisive advantage" threshold is 150 between engines, 200 between GMs, etc.

Using Stockfish evaluations, a regression was done which yielded a "winning chances" curve. Perhaps a similar experiment could be conducted for other engines:
https://chesscomputer.tumblr.com/post/98632536555/using-the-stockfish-position-evaluation-score-to

@Rocky640
Copy link

Internally, we can continue to process our scores as we like.
However I like how Houdini publish the score for the gui:, it uses some "calibrated centipawns"

From chessbase:
Houdini 4 uses calibrated evaluations in which engine scores correlate directly with the win expectancy in the position. A +1.00 pawn advantage gives a 80% chance of winning the game against an equal opponent at blitz time control. At +2.00 the engine will win 95% of the time, and at +3.00 about 99% of the time. If the advantage is +0.50, expect to win nearly 50% of the time.

The winning chance curve published in the post just above is from 2014. one would have to repeat the experiment, and such result could be use to produce a calibrated output.

@amchess
Copy link
Author

amchess commented Dec 14, 2018

This is not the question at all.
I know there are these studies, but the problem still remain:
the incompatibility of Stockfish evaluation with gui's score.
I'm a IM correspondence chess player and I know a lot of strong players (also GM) who don't faith in Stockfish evaluation for this reason and don't use it in analysis.

@noobpwnftw
Copy link
Contributor

@amchess I do not understand, if that is not the question, why is it not possible for people to adjust the centipawn scale as they see fit by building a custom version of Stockfish?
While the better(I think) solution is that we may apply certain calibrations to SF's centipawn value to fit winning chances more "naturally" as @Rocky640 mentioned. I think that is what you really want.

@amchess
Copy link
Author

amchess commented Dec 15, 2018

Only one final thing: I made this modification on my derivative ShashChess and a correspondence chess GM told me its evaluation is aligned, for example, with Houdini and Komodo (and the guis).
So, it's useful for him, contrary to Stockfish:
even if it's the strongest engine for matches playing, he doesn't use it in analysis mode.
He agrees with me.
For this, I proposed it and I think it's the simplest one, not affecting at all the playing strenght, but only the visualized score.
If you think it's no good, it's your opinion and of course you can do what you want, even if, sorry, I don't understand your justifications.
Andrea

@man4
Copy link

man4 commented Dec 15, 2018

For what it's worth, I do think the current Stockfish evaluation is "inflated," in the sense that if you analyze the start position with odds, the evals are much higher in magnitude than you'd expect.
For example (I used low depths of around 20 so that the search doesn't dominate and find a position much better than the root):

b1-knight odds
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/R1BQKBNR w KQkq - 0 1
Eval: ~-4.6

a1-rook odds
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/1NBQKBNR w Kkq - 0 1
Eval: ~-6.4

queen odds
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNB1KBNR w KQkq - 0 1
Eval: ~-13.4

g2-pawn odds (I didn't want to cherrypick the worst pawn to remove, which is likely f2, so I picked one of the worse ones, keeping in mind that this is balanced by the fact that it's white to move)
rnbqkbnr/pppppppp/8/8/8/8/PPPPPP1P/RNBQKBNR w KQkq - 0 1
Eval ~-1.3

Normally you'd expect (as an approximation) -3 for knight odds at low depth, -5 for rook and -9 for queen, but these evals are higher by about 30-50%.

As it turns out, PawnValueEg used to be 258 a few months back; it is now 208, so the eval scale has been shifted up by ~25% in the last few months. Meanwhile, it's not clear that 208 is the true "average value of a pawn," because things like piece-square tables and bonuses can shift the average. (For example, one could decrease a piece value by 100 and increase the corresponding PSQT by 100 without affecting the true piece value.)

If people think this is an issue, one possibility that doesn't require periodically calibrating output vs. winrate is to use v * 300 / KnightValueMg or v * 300 / KnightValueEg, rather than v * 100 / PawnValueEg (or similar ideas). A look at the git blame history suggests that KnightValueMg has changed a lot less in percent terms than PawnValueEg, making the formula with KnightValueMg more robust over time. Also, this change would bring the piece odds evals more in line with what us humans might expect: currently, 300/KnightValueMg = 100/261 and 300/KnightValueEg = 100/288.

Of course, there are objections to this approach:

  1. the choice of knight value is quite arbitrary. I have to admit this is true.
  2. a "centipawn" is no longer 1/100 of a pawn. However, it wasn't previously either, because things like PSQT and pawn bonuses skew the "true" value of a pawn.

I just wanted to give this as a suggestion to keep the discussion going, because not only is adjudication affected, but also many people use Stockfish to analyze human games and would benefit from knowing that, for example, +1 does correspond to a pawn up on average.

@Vizvezdenec
Copy link
Contributor

since when depth 20 is "low"?
It's enough to see that w/o knight white position is much worse than simply not having a knight :)

@man4
Copy link

man4 commented Dec 15, 2018

since when depth 20 is "low"?
It's enough to see that w/o knight white position is much worse than simply not having a knight :)

It's "low" enough that the eval is quite stable around depth 15-25. It's enough to see that white has a much worse position by a knight, but that's about it. This is why I chose a quiet position rather than a tactical position.

But I realize an argument can still be made that depth 20 is not "low." Instead, let's look at the static evals instead of depth 20 evals:
b1-knight odds -4.01
a1-rook odds -5.91
queen odds -13.04
g2-pawn odds -0.89

So, it's true that these evals appear less inflated than before, but they're still inflated by 20-40%, and the point still stands. (For the g2-pawn, from benefit of hindsight, removal of g2 gives partial compensation by allowing immediate fianchettoing. The static eval sees a +0.31 mobility bonus as a result, so this wasn't a great example of a position worth -1.)

This is a reflection of the fact that the ratio KnightValueEg/PawnValueEg is currently about 4.16, when in reality a knight is closer to 3 pawns on average. One way to put this is that the PSQT and bonuses have a more positive effect on pawns than they do on knights (such must be the case in order to compensate the low PawnValueMg and PawnValueEg). A rough glance does suggest there are more bonuses than penalties for terms involving pawns.

@MichaelB7
Copy link
Contributor

I’m not opposed to changing this - I would just hope that any changes are well thought out as I don’t believe the real solution is a simple multiple x by y. (Although if that is the solution , it would not bother me ).

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Dec 15, 2018

g2 pawn odds are -0.89.
This just means that sf thinks that in startposition knight = 4,5 pawns, rook = 6,5 pawns etc.
Actually it's what sf has as KnightValueMg/PawnValueMg and RookValueMg/PawnValueMg - in midgames sf thinks that knight >> 3 pawns, Rook >> 5 pawns etc.
So if anything we have deflated pawn value and real eval of sf not having a knight would be < -5.

@man4
Copy link

man4 commented Dec 15, 2018

g2 pawn odds are -0.89.

The sole reason for this is the +0.31 mobility bonus in the static eval, because the f1-bishop now has 2 squares instead of 0. (The bonus is -48 for 0 squares and 16 for 2 squares, so (16 - (-48)) * 100 / PawnValueEg = 31 centipawns.) After removing this irrelevant term, the eval becomes -1.2.

This just means that sf thinks that in startposition knight = 4,5 pawns, rook = 6,5 pawns etc.
Actually it's what sf has as KnightValueMg/PawnValueMg and RookValueMg/PawnValueMg - in midgames sf thinks that knight >> 3 pawns, Rook >> 5 pawns etc.

PawnValueMg is not the actual value of a pawn, and KnightValueMg is not the actual value of a knight. As I stated in previous posts, the piece square table and other bonuses like mobility skew the piece values. If I change PawnValueMg from 136 to 36 and increase the PSQT for midgame pawns by 100, does Stockfish now think a knight is worth KnightValueMg/36=22 pawns?

Here's a good way to find how many pawns a knight is worth:

  1. Gather a set of representative positions with a knight on the board, from real chess games
  2. In each position, remove a knight, and record the difference in static eval
  3. Average all values from step 2 to get the piece value of a knight
  4. Repeat steps 1-3 for pawns
  5. Divide the piece value of a knight by the piece value of a pawn

@Hanamuke
Copy link

Hanamuke commented Dec 20, 2018

I agree with @Vizvezdenec , our pawn eval is "deflated" because this raw eval is a small part of pawn evaluation relatively to other pieces. @man4 is also right that our evaluation features are not orthogonal, it would maybe be nice to change piece value/PSQT so that PSQTs average to 0. Unfortunately this is not a trivial change since we use pawn value in other places in the code.

Edit : To illustrate, the knight PSQT averages to -12 and -14 for mg and eg respectively. To be fair, it would be more interesting to measure what is the average PSQT of pieces in a game. Also raw piece value is used for SEE, so it would never be non functional to remove 100 to the raw value to add 100 to the psqt.

@Hanamuke
Copy link

To get back to the discussion, I think that what should be done is to have a clearer correspondence between win rate and evaluation, maybe like Houdini does.
Also as a reminder, Stockfish was never designed to evaluate positions, but to find the best moves.

@ghost
Copy link

ghost commented Dec 26, 2018

Score sent to "gui"(in this case cutechess-cli) has major implications since
fishtest early adjudication uses very shallow conditions for adjudication for draw/loss, see issue #1904

@ddugovic
Copy link

Also as a reminder, Stockfish was never designed to evaluate positions, but to find the best moves.

What I'm curious about here is: if PSQT and pawn values were re-normalized to increase a pawn's value and reduce the bonuses (which I assume would be a functional change) and/or other pawn parameters, could Elo be gained? Granted, these bonuses are already quite low, and this is a rather open-ended question, but still:

Stockfish/src/psqt.cpp

Lines 93 to 102 in 14c4a40

constexpr Score PBonus[RANK_NB][FILE_NB] =
{ // Pawn
{ S( 0, 0), S( 0, 0), S( 0, 0), S( 0, 0), S( 0, 0), S( 0, 0), S( 0, 0), S( 0, 0) },
{ S( 0,-11), S( -3, -4), S( 13, -1), S( 19, -4), S( 16, 17), S( 13, 7), S( 4, 4), S( -4,-13) },
{ S(-16, -8), S(-12, -6), S( 20, -3), S( 21, 0), S( 25,-11), S( 29, 3), S( 0, 0), S(-27, -1) },
{ S(-11, 3), S(-17, 6), S( 11,-10), S( 21, 1), S( 32, -6), S( 19,-11), S( -5, 0), S(-14, -2) },
{ S( 4, 13), S( 6, 7), S( -8, 3), S( 3, -5), S( 8,-15), S( -2, -1), S(-19, 9), S( -5, 13) },
{ S( -5, 25), S(-19, 20), S( 7, 16), S( 8, 12), S( -7, 21), S( -2, 3), S(-10, -4), S(-16, 15) },
{ S(-10, 6), S( 9, -5), S( -7, 16), S(-12, 27), S( -7, 15), S( -8, 11), S( 16, -7), S( -8, 4) }
};

Stockfish/src/evaluate.cpp

Lines 154 to 173 in 14c4a40

// Assorted bonuses and penalties
constexpr Score BishopPawns = S( 3, 7);
constexpr Score CloseEnemies = S( 8, 0);
constexpr Score CorneredBishop = S( 50, 50);
constexpr Score Hanging = S( 69, 36);
constexpr Score KingProtector = S( 7, 8);
constexpr Score KnightOnQueen = S( 16, 12);
constexpr Score LongDiagonalBishop = S( 45, 0);
constexpr Score MinorBehindPawn = S( 18, 3);
constexpr Score PawnlessFlank = S( 17, 95);
constexpr Score RestrictedPiece = S( 7, 7);
constexpr Score RookOnPawn = S( 10, 32);
constexpr Score SliderOnQueen = S( 59, 18);
constexpr Score ThreatByKing = S( 24, 89);
constexpr Score ThreatByPawnPush = S( 48, 39);
constexpr Score ThreatByRank = S( 13, 0);
constexpr Score ThreatBySafePawn = S(173, 94);
constexpr Score TrappedRook = S( 96, 4);
constexpr Score WeakQueen = S( 49, 15);
constexpr Score WeakUnopposedPawn = S( 12, 23);

@Hanamuke
Copy link

Hanamuke commented Dec 29, 2018

@ddugovic Following my comment I had done the following tuning, but without success. Feel free to do it however you like though, I agree that there is probably elo to be gained.

http://tests.stockfishchess.org/tests/view/5c1cb6980ebc5902ba128ff2
http://tests.stockfishchess.org/tests/view/5c1d5fc60ebc5902ba12964e

(There was a first tuning in which i forgot to set the base engine to my tuning branch, and i took the output value for the second tuning nonetheless, figuring it should be ok, if you want to redo it properly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants