-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the displayed evaluation more symmetric #4163
Conversation
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game. The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship. STC [Test result](https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6) [Data and graph](official-stockfish#4150 (comment)) ``` LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 ``` LTC [Test result](https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6) [Data and graph](official-stockfish#4150 (comment)) ``` LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 ``` After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't. Fixes at least partly: official-stockfish#4142 (but please don't close the issue yet, because the eval may be still asymmetric). Closes: official-stockfish#4163 Bench: 4425574
af6b887
to
4f599c0
Compare
By the way, the patch affects also the calculation of trend, which is used in the classical evaluation. Do we have to test this also without NNUE? |
it is not needed to run without NNUE. However, I would like to see one more test, as this changes the optimism quite a bit. Can you run a test of master vs sf 13 and this patch vs sf 13, both 100'000 games at STC, UHO book. Let's see if there is an effect in that case. |
Ok, just a minute... |
Here are the results:
The master is better by 1.6 Elo, which, however, is not statistically significant here. |
So no clear regression against weaker version, I guess thats good ;) |
359 more 1.5 pairs and 429 less 2 pairs 🤔 interesting |
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574
@vondele thanks. |
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. As a side effect, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship: STC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) LTC: [Data and graph](official-stockfish/Stockfish#4150 (comment)) See also official-stockfish/Stockfish#4142 passed STC: https://tests.stockfishchess.org/tests/view/6313f44b8202a039920e27e6 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 108016 W: 28903 L: 28760 D: 50353 Ptnml(0-2): 461, 12075, 28850, 12104, 518 passed LTC: https://tests.stockfishchess.org/tests/view/631de45db85daa436625dfe6 LLR: 3.01 (-2.94,2.94) <-1.75,0.25> Total: 34792 W: 9412 L: 9209 D: 16171 Ptnml(0-2): 24, 3374, 10397, 3577, 24 Furthermore, this does not measurably impact Elo strength against weaker engines, as demonstrated in a match of master and patch vs SF13: This patch vs SF 13: https://tests.stockfishchess.org/tests/view/631fa34ae1612778c344c6eb Elo: 141.66 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48182 L: 9528 D: 42290 Ptnml(0-2): 96, 1426, 13277, 30130, 5071 nElo: 284.13 +-3.3 (95%) PairsRatio: 23.13 Master vs SF 13: https://tests.stockfishchess.org/tests/view/631fa3ece1612778c344c6ff Elo: 143.26 +-1.2 (95%) LOS: 100.0% Total: 100000 W: 48525 L: 9479 D: 41996 Ptnml(0-2): 94, 1537, 13098, 29771, 5500 nElo: 281.70 +-3.3 (95%) PairsRatio: 21.63 closes: official-stockfish/Stockfish#4163 Bench: 4425574 (cherry picked from commit 96745b3)
This patch simplifies the formulas used to compute the trend and optimism values before each search iteration. Also and more importantly, this removes the parameters which make the relationship between the displayed evaluation value and the expected game result asymmetric. Here by symmetricity we mean that f(x) + f(-x) = 1, where f(x) is the expected score when the evaluation shows x. This has implications in usability for analysis, where more symmetric eval values mean less fluctuations of eval during a traverse of a game.
The patch has passed both STC and LTC simplification tests. The patch in the LTC test (which is functionally equivalent to the version in the present pull request) is slightly different from the patch used in the STC: in the LTC version the change is applied to the latest version of master and the fraction of two integers is rounded to an integer. Both was tested without automatic adjudication.
I've also provided links to the results of isotonic regression analysis of the relationship between the evaluation and game result (statistical data and a graph) for both tests, which demonstrate that the new version has a more symmetric relationship.
STC
Test result
Data and graph
LTC
Test result
Data and graph
After the patch is commited I suggest running more tests to verify if the eval-result relationship is symmetric and thinking what can be changed to make it even more symmetric in case it still isn't.
Fixes at least partly: #4142 (but please don't close the issue yet, because the eval may be still asymmetric).
Closes: #4163
Bench: 4425574