Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Elo estimates for terms in search. #2401

Closed
wants to merge 1 commit into from

Conversation

vondele
Copy link
Member

@vondele vondele commented Nov 9, 2019

This updates estimates from 1.5yr ago, and adds missing terms.
All tests run at 10+0.1 (STC), 20000 games, error bars +- 3 Elo.

http://tests.stockfishchess.org/tests/view/5dc58b620ebc5902562bbd47
http://tests.stockfishchess.org/tests/view/5dc58b240ebc5902562bbd3f
http://tests.stockfishchess.org/tests/view/5dc58b810ebc5902562bbd4b
http://tests.stockfishchess.org/tests/view/5dc58b170ebc5902562bbd3d
http://tests.stockfishchess.org/tests/view/5dc58c0c0ebc5902562bbd59
http://tests.stockfishchess.org/tests/view/5dc58e800ebc5902562bbd83
http://tests.stockfishchess.org/tests/view/5dc58c560ebc5902562bbd63
http://tests.stockfishchess.org/tests/view/5dc58b320ebc5902562bbd41
http://tests.stockfishchess.org/tests/view/5dc58c490ebc5902562bbd61
http://tests.stockfishchess.org/tests/view/5dc58b700ebc5902562bbd49
http://tests.stockfishchess.org/tests/view/5dc58e450ebc5902562bbd7e
http://tests.stockfishchess.org/tests/view/5dc58af40ebc5902562bbd38
http://tests.stockfishchess.org/tests/view/5dc58b030ebc5902562bbd3b
http://tests.stockfishchess.org/tests/view/5dc58beb0ebc5902562bbd55
http://tests.stockfishchess.org/tests/view/5dc58b8e0ebc5902562bbd4d
http://tests.stockfishchess.org/tests/view/5dc58c190ebc5902562bbd5b
http://tests.stockfishchess.org/tests/view/5dc58ac30ebc5902562bbd34
http://tests.stockfishchess.org/tests/view/5dc58c290ebc5902562bbd5d
http://tests.stockfishchess.org/tests/view/5dc58c380ebc5902562bbd5f
http://tests.stockfishchess.org/tests/view/5dc58bfa0ebc5902562bbd57
http://tests.stockfishchess.org/tests/view/5dc58b4f0ebc5902562bbd45
http://tests.stockfishchess.org/tests/view/5dc58ae30ebc5902562bbd36

Noteworthy changes are step 7 (futility pruning) going from ~30 to ~49 Elo and step 14 (pruning at shallow depth) going from ~170 to ~204 Elo.

@Rocky640 made the suggestion to look at time control dependence of these terms.
I picked two large terms (early futility pruning and singular extension), so with
small relative error. It turns out it is actually quite interesting (see figure 1).
Contrary to my expectation, the Elo gain for early futility pruning is pretty time
control sensitive, while singular extension gain is not.

Figure 1:
elo_search_tc

Going back to the old measurement of futility pruning (30 Elo vs today 50 Elo),
the code is actually identical but the margins have changed. It seems like a nice
example of how connected terms in search really are, i.e. the value of early futility
pruning increased significantly due to changes elsewhere in search.

No functional change.

This updates estimates from 1.5yr ago, and adds missing terms.
All tests run at 10+0.1 (STC), 20000 games, error bars +- 3 Elo.

http://tests.stockfishchess.org/tests/view/5dc58b620ebc5902562bbd47
http://tests.stockfishchess.org/tests/view/5dc58b240ebc5902562bbd3f
http://tests.stockfishchess.org/tests/view/5dc58b810ebc5902562bbd4b
http://tests.stockfishchess.org/tests/view/5dc58b170ebc5902562bbd3d
http://tests.stockfishchess.org/tests/view/5dc58c0c0ebc5902562bbd59
http://tests.stockfishchess.org/tests/view/5dc58e800ebc5902562bbd83
http://tests.stockfishchess.org/tests/view/5dc58c560ebc5902562bbd63
http://tests.stockfishchess.org/tests/view/5dc58b320ebc5902562bbd41
http://tests.stockfishchess.org/tests/view/5dc58c490ebc5902562bbd61
http://tests.stockfishchess.org/tests/view/5dc58b700ebc5902562bbd49
http://tests.stockfishchess.org/tests/view/5dc58e450ebc5902562bbd7e
http://tests.stockfishchess.org/tests/view/5dc58af40ebc5902562bbd38
http://tests.stockfishchess.org/tests/view/5dc58b030ebc5902562bbd3b
http://tests.stockfishchess.org/tests/view/5dc58beb0ebc5902562bbd55
http://tests.stockfishchess.org/tests/view/5dc58b8e0ebc5902562bbd4d
http://tests.stockfishchess.org/tests/view/5dc58c190ebc5902562bbd5b
http://tests.stockfishchess.org/tests/view/5dc58ac30ebc5902562bbd34
http://tests.stockfishchess.org/tests/view/5dc58c290ebc5902562bbd5d
http://tests.stockfishchess.org/tests/view/5dc58c380ebc5902562bbd5f
http://tests.stockfishchess.org/tests/view/5dc58bfa0ebc5902562bbd57
http://tests.stockfishchess.org/tests/view/5dc58b4f0ebc5902562bbd45
http://tests.stockfishchess.org/tests/view/5dc58ae30ebc5902562bbd36

Noteworthy changes are step 7 (futility pruning) going from ~30 to ~49 Elo and step 14 (pruning at shallow depth) going from ~170 to ~204 Elo.

No functional change.
@Rocky640
Copy link

Would it make sense to run some of those measurements at LTC ? This would highlight the depth sensitive areas of search.

@vondele
Copy link
Member Author

vondele commented Nov 10, 2019

Interesting, but I would suspect that for the terms with small Elo impact we would need much more accurate estimates to do such a test. Is there any of the large Elo terms that you would expect to be TC sensitive?

Generally, I'm a bit critical of TC sensitivity. For me relevant numbers are:
https://github.com/glinscott/fishtest/wiki/UsefulData#elo-change-with-respect-to-tc

Counter question... would you (or any eval expert) be interested in doing something similar in Eval?
This https://github.com/glinscott/fishtest/wiki/UsefulData#elo-contributions-from-various-evaluation-terms got stale, and I think only doc in the code can survive over time.

@vondele
Copy link
Member Author

vondele commented Nov 10, 2019

So, I've added 2 LTC measurement to the queue futility and singular extension. Both have large contributions, and futility is only at low depth (<7) while se is high depth (>=6) ... let's see.

@FauziAkram
Copy link
Contributor

@vondele I have created something similar for eval terms, you can find it here:
https://onedrive.live.com/edit.aspx?cid=7d656668e4e2c5e8&page=view&resid=7D656668E4E2C5E8!635&parId=7D656668E4E2C5E8!105&app=Excel

But maybe it's now a bit outdated, it might need some updates and refreshment

@Rocky640
Copy link

Such tests might help discover a simplification or two.

To start with. we could run at least a rough estimate of threats(), passed(), space(). initiative()
For the king, any change will break the kingDanger calculation, so better to estimate it as a whole too.

Interesting would also to see the impact of using
S((mg+eg)/2, (mg+eg)/2) before scaling. This will measure the value of the "tampered eval".

A third set of tests would be to disable the respective piece eval in piece() or pawn eval contribution, or psqt or mobility.
But instead of completely disable such feature, more informative would be to replace with an average value, computed with a short bench run against more midgame positions (to find average mg) and more endgame positions (to find average eg) and using some dbg_mean_of.

A fourth set of tests would disable each individual bonus.

Another area of research would be to test each individual bonus with 50% value and with 150% value. It is quite possible that despite all the tuning, some bonus are stuck at some local maxima,
and we are missing the global maxima. Such tests would also give more data about the sensitivity of each bonus.

Looking back at Fauzi results, it seems that
anything below 3 ELO could be a candidate to removal if proper adjustments are found.
For example RookOnPawn, ThreatByRank, HinderPassedPawn were all between 1 and 3 ELO
and had been removed since then.

One bonus which we still have is MinorBehindPawn. Removing it "as is" will not work, but it might be removed if we adjust some mg psqt values and a few other bonus cleverly.

@Rocky640
Copy link

Here is a more direct link to Fauzi's work, which is about 1 year old, if someone knows how to replace the dead link
https://github.com/glinscott/fishtest/wiki/UsefulData#elo-contributions-from-various-evaluation-terms with this;

Stockfish Feature's Estimated Elo worth (1).xlsx

A few bonus have been introduced or were modified since then.

@vondele
Copy link
Member Author

vondele commented Nov 10, 2019

I've updated the link on the wiki (just click 'Edit' on the top of the page).

If you find the time, please submit the Eval tests, I think that would be useful.

@FauziAkram
Copy link
Contributor

@ttruscott
Copy link
Contributor

I'd like an ELO estimate for the has_game_cycle() check in search.cpp

@snicolet
Copy link
Member

snicolet commented Nov 12, 2019

Thanks for running the tests!

My suggestion would be to keep the same pattern for Elo estimates in the code as in current master, using a scale of ~2, ~5, ~10, ~15, ~20, ~30, ~40, ~50, etc. instead of writing last digit accuracy which we don't have.

This to avoid people running Elo experiments every two weeks to see if the last digits have changed...

@vondele
Copy link
Member Author

vondele commented Nov 12, 2019

@snicolet, let's keep the result as obtained from the tests. Rounding numbers needlessly increases the error. Not rerunning these tests often should be just a policy (and hasn't been a problem so far).

@vondele
Copy link
Member Author

vondele commented Nov 12, 2019

elo_search_tc

@Rocky640 made the suggestion to look at TC dependence of these terms. I picked two large terms, so with small relative error. It turns out it is actually quite interesting. Contrary to my expectation, early futility pruning is pretty TC sensitive, while singular extension is not.

Going back to the old measurement of futility pruning (30Elo vs today 49 Elo), the code is actually identical. It seems like a nice example of how connected terms in search really are, i.e. the value of early futility pruning increased significantly due to changes elsewhere in search.

@Alayan-stk-2
Copy link

Could you do a measurement for the multicut part of singular extension search ?

@Vizvezdenec
Copy link
Contributor

Code in futility pruning is identical, but futility margin itself is vastly different.
#2270 and following PR by proton change this quite a lot.

snicolet pushed a commit that referenced this pull request Jan 10, 2020
This updates estimates from 1.5 year ago, and adds missing terms. All estimates
from tests run on fishtest at 10+0.1 (STC), 20000 games, error bars +- 3 Elo,
see the original message in the pull request for the full list of tests.
Noteworthy changes are step 7 (futility pruning) going from ~30 to ~50 Elo
and step 13 (pruning at shallow depth) going from ~170 to ~200 Elo.

Full list of tests: #2401

@Rocky640 made the suggestion to look at time control dependence of these terms.
I picked two large terms (early futility pruning and singular extension), so with
small relative error. It turns out it is actually quite interesting (see figure 1).
Contrary to my expectation, the Elo gain for early futility pruning is pretty time
control sensitive, while singular extension gain is not.

Figure 1: TC dependence of two search terms
![elo_search_tc]( http://cassio.free.fr/divers/elo_search_tc.png )

Going back to the old measurement of futility pruning (30 Elo vs today 50 Elo),
the code is actually identical but the margins have changed. It seems like a nice
example of how connected terms in search really are, i.e. the value of early futility
pruning increased significantly due to changes elsewhere in search.

No functional change.
@snicolet
Copy link
Member

Merged via 114ddb7, thanks :-)

@snicolet snicolet closed this Jan 10, 2020
MichaelB7 added a commit to MichaelB7/Stockfish that referenced this pull request Jan 16, 2020
This updates estimates from 1.5 year ago, and adds missing terms. All estimates
from tests run on fishtest at 10+0.1 (STC), 20000 games, error bars +- 3 Elo,
see the original message in the pull request for the full list of tests.
Noteworthy changes are step 7 (futility pruning) going from ~30 to ~50 Elo
and step 13 (pruning at shallow depth) going from ~170 to ~200 Elo.

Full list of tests: official-stockfish#2401

@Rocky640 made the suggestion to look at time control dependence of these terms.
I picked two large terms (early futility pruning and singular extension), so with
small relative error. It turns out it is actually quite interesting (see figure 1).
Contrary to my expectation, the Elo gain for early futility pruning is pretty time
control sensitive, while singular extension gain is not.

Figure 1: TC dependence of two search terms
![elo_search_tc]( http://cassio.free.fr/divers/elo_search_tc.png )

Going back to the old measurement of futility pruning (30 Elo vs today 50 Elo),
the code is actually identical but the margins have changed. It seems like a nice
example of how connected terms in search really are, i.e. the value of early futility
pruning increased significantly due to changes elsewhere in search.

No functional change.

Rewrite initialization of PseudoMoves

This is a non-functional code style change. I believe master is a bit convoluted
here and propose this version for clarity.

No functional change
@BM123499 BM123499 mentioned this pull request May 12, 2021
vondele pushed a commit to vondele/Stockfish that referenced this pull request Dec 21, 2021
This updates estimates from 2yr ago official-stockfish#2401, and adds missing terms.
All tests run at 10+0.1 (STC), 20000 games, error bars +- 1.8 Elo, book 8moves_v3.png.

A table of Elo values with the links to the corresponding tests can be found at the PR

closes official-stockfish#3868

Non-functional Change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants