Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the displayed evaluation more symmetric #4163

Closed

Conversation

atumanian
Copy link

@atumanian atumanian commented Sep 12, 2022

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game.
The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication.
I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship.

STC

Test result
Data and graph

LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518 

LTC

Test result
Data and graph

LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24

After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't.
Fixes at least partly: #4142 (but please don't close the issue yet, because the eval may be still asymmetric).
Closes: #4163

Bench: 4425574

This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game.
The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication.
I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship.

STC
[Test result](https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6)
[Data and graph](official-stockfish#4150 (comment))
```
LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518
```

LTC
[Test result](https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6)
[Data and graph](official-stockfish#4150 (comment))
```
LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24
```

After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't.
Fixes at least partly: official-stockfish#4142 (but please don't close the issue yet, because the eval may be still asymmetric).
Closes: official-stockfish#4163

Bench: 4425574
@atumanian
Copy link
Author

By the way, the patch affects also the calculation of trend, which is used in the classical evaluation. Do we have to test this also without NNUE?

@vondele
Copy link
Member

vondele commented Sep 12, 2022

it is not needed to run without NNUE. However, I would like to see one more test, as this changes the optimism quite a bit. Can you run a test of master vs sf 13 and this patch vs sf 13, both 100'000 games at STC, UHO book. Let's see if there is an effect in that case.

@atumanian
Copy link
Author

atumanian commented Sep 12, 2022

it is not needed to run without NNUE. However, I would like to see one more test, as this changes the optimism quite a bit. Can you run a test of master vs sf 13 and this patch vs sf 13, both 100'000 games at STC, UHO book. Let's see if there is an effect in that case.

Ok, just a minute...
...The tests already started.

@atumanian
Copy link
Author

atumanian commented Sep 13, 2022

Here are the results:

This patch vs SF 13:

Elo: 141.66 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48182 L: 9528 D: 42290
Ptnml(0-2): 96, 1426, 13277, 30130, 5071
nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 

Master vs SF 13:

Elo: 143.26 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48525 L: 9479 D: 41996
Ptnml(0-2): 94, 1537, 13098, 29771, 5500
nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63

The master is better by 1.6 Elo, which, however, is not statistically significant here.

@Disservin
Copy link
Member

So no clear regression against weaker version, I guess thats good ;)

@dav1312
Copy link
Contributor

dav1312 commented Sep 13, 2022

359 more 1.5 pairs and 429 less 2 pairs 🤔 interesting

@vondele vondele added the to be merged Will be merged shortly label Sep 17, 2022
@vondele vondele closed this in 154e7af Sep 17, 2022
PikaCat-OuO pushed a commit to official-pikafish/Pikafish that referenced this pull request Sep 17, 2022
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration.
As a side effect, this removes the parameters which make the relationship between the displayed evaluation value
and the expected game result asymmetric.

I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship:

STC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
See also official-stockfish/Stockfish#4142

passed STC:
https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6
LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518

passed LTC:
https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6
LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24

Furthermore, this does not measurably impact Elo strength against weaker engines,
as demonstrated in a match of master and patch vs SF13:

This patch vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb
Elo: 141.66 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48182 L: 9528 D: 42290
Ptnml(0-2): 96, 1426, 13277, 30130, 5071
nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13

Master vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff
Elo: 143.26 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48525 L: 9479 D: 41996
Ptnml(0-2): 94, 1537, 13098, 29771, 5500
nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63

closes: official-stockfish/Stockfish#4163

Bench: 4425574
@atumanian
Copy link
Author

@vondele thanks.

PikaCat-OuO pushed a commit to official-pikafish/Pikafish that referenced this pull request Sep 19, 2022
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration.
As a side effect, this removes the parameters which make the relationship between the displayed evaluation value
and the expected game result asymmetric.

I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship:

STC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
See also official-stockfish/Stockfish#4142

passed STC:
https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6
LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518

passed LTC:
https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6
LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24

Furthermore, this does not measurably impact Elo strength against weaker engines,
as demonstrated in a match of master and patch vs SF13:

This patch vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb
Elo: 141.66 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48182 L: 9528 D: 42290
Ptnml(0-2): 96, 1426, 13277, 30130, 5071
nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13

Master vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff
Elo: 143.26 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48525 L: 9479 D: 41996
Ptnml(0-2): 94, 1537, 13098, 29771, 5500
nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63

closes: official-stockfish/Stockfish#4163

Bench: 4425574
PikaCat-OuO pushed a commit to official-pikafish/Pikafish that referenced this pull request Oct 7, 2022
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration.
As a side effect, this removes the parameters which make the relationship between the displayed evaluation value
and the expected game result asymmetric.

I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship:

STC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment))
See also official-stockfish/Stockfish#4142

passed STC:
https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6
LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 108016 W: 28903 L: 28760 D: 50353
Ptnml(0-2): 461, 12075, 28850, 12104, 518

passed LTC:
https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6
LLR: 3.01 (-2.94,2.94) <-1.75,0.25>
Total: 34792 W: 9412 L: 9209 D: 16171
Ptnml(0-2): 24, 3374, 10397, 3577, 24

Furthermore, this does not measurably impact Elo strength against weaker engines,
as demonstrated in a match of master and patch vs SF13:

This patch vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb
Elo: 141.66 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48182 L: 9528 D: 42290
Ptnml(0-2): 96, 1426, 13277, 30130, 5071
nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13

Master vs SF 13:
https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff
Elo: 143.26 +-1.2 (95%) LOS: 100.0%
Total: 100000 W: 48525 L: 9479 D: 41996
Ptnml(0-2): 94, 1537, 13098, 29771, 5500
nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63

closes: official-stockfish/Stockfish#4163

Bench: 4425574
(cherry picked from commit 96745b3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants