Make the displayed evaluation more symmetric #4163

atumanian · 2022-09-12T17:12:53Z

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game.
The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication.
I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship.

STC

Test result
Data and graph

LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518

LTC

Test result
Data and graph

LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24

After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't.
Fixes at least partly: #4142 (but please don't close the issue yet, because the eval may be still asymmetric).
Closes: #4163

Bench: 4425574

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game. The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship. STC [Test result](https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6) [Data and graph](official-stockfish#4150 (comment)) ``` LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 ``` LTC [Test result](https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6) [Data and graph](official-stockfish#4150 (comment)) ``` LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 ``` After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't. Fixes at least partly: official-stockfish#4142 (but please don't close the issue yet, because the eval may be still asymmetric). Closes: official-stockfish#4163 Bench: 4425574

atumanian · 2022-09-12T18:10:19Z

By the way, the patch affects also the calculation of trend, which is used in the classical evaluation. Do we have to test this also without NNUE?

vondele · 2022-09-12T20:45:44Z

it is not needed to run without NNUE. However, I would like to see one more test, as this changes the optimism quite a bit. Can you run a test of master vs sf 13 and this patch vs sf 13, both 100'000 games at STC, UHO book. Let's see if there is an effect in that case.

atumanian · 2022-09-12T21:06:39Z

it is not needed to run without NNUE. However, I would like to see one more test, as this changes the optimism quite a bit. Can you run a test of master vs sf 13 and this patch vs sf 13, both 100'000 games at STC, UHO book. Let's see if there is an effect in that case.

Ok, just a minute...
...The tests already started.

atumanian · 2022-09-13T13:13:44Z

Here are the results:

This patch vs SF 13:

Elo: 141.66 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48182 L: 9528 D: 42290
Ptnml(0-2): 96, 1426, 13277, 30130, 5071
nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13

Master vs SF 13:

Elo: 143.26 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48525 L: 9479 D: 41996
Ptnml(0-2): 94, 1537, 13098, 29771, 5500
nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63

The master is better by 1.6 Elo, which, however, is not statistically significant here.

Disservin · 2022-09-13T13:46:41Z

So no clear regression against weaker version, I guess thats good ;)

dav1312 · 2022-09-13T18:53:57Z

359 more 1.5 pairs and 429 less 2 pairs 🤔 interesting

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574

atumanian · 2022-09-17T13:02:25Z

@vondele thanks.

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574 (cherry picked from commit 96745b3)

atumanian force-pushed the symmetric_eval branch from af6b887 to 4f599c0 Compare September 12, 2022 17:30

vondele added the to be merged Will be merged shortly label Sep 17, 2022

vondele closed this in 154e7af Sep 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the displayed evaluation more symmetric #4163

Make the displayed evaluation more symmetric #4163

atumanian commented Sep 12, 2022 •

edited

atumanian commented Sep 12, 2022

vondele commented Sep 12, 2022

atumanian commented Sep 12, 2022 •

edited

atumanian commented Sep 13, 2022 •

edited

Disservin commented Sep 13, 2022

dav1312 commented Sep 13, 2022

atumanian commented Sep 17, 2022

Make the displayed evaluation more symmetric #4163

Make the displayed evaluation more symmetric #4163

Conversation

atumanian commented Sep 12, 2022 • edited

STC

LTC

atumanian commented Sep 12, 2022

vondele commented Sep 12, 2022

atumanian commented Sep 12, 2022 • edited

atumanian commented Sep 13, 2022 • edited

Disservin commented Sep 13, 2022

dav1312 commented Sep 13, 2022

atumanian commented Sep 17, 2022

atumanian commented Sep 12, 2022 •

edited

atumanian commented Sep 12, 2022 •

edited

atumanian commented Sep 13, 2022 •

edited