Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

128 GB TT size limitation #1349

Closed
syzygy1 opened this issue Dec 29, 2017 · 33 comments
Closed

128 GB TT size limitation #1349

syzygy1 opened this issue Dec 29, 2017 · 33 comments

Comments

@syzygy1
Copy link
Contributor

syzygy1 commented Dec 29, 2017

I am opening this issue to discuss whether there is a need to lift the current restriction of the maximum size of the transposition table to 128 GB. The reason for the restriction is that the current implementation of first_entry() in tt.h does not allow the TT to have more than 2^32 clusters (of 32 bytes each):

return &table[(uint32_t(key) * uint64_t(clusterCount)) >> 32].entry[0];

If clusterCount has more than 32 bits, the multiplication overflows.

The restriction can be lifted by making use of the fact that the user can only specify TT sizes that are a multiple of 1 MB. 1 MB corresponds to 2^15 clusters, which means that clusterCount is always a multiple of 2^15. This allows us to increase the max TT size to up to 16 TB (2^39 clusters). I have tested this approach with a max size of 1 TB here (big_hash3):

http://tests.stockfishchess.org/tests/view/5a425d700ebc590ccbb8c19e

Unfortunately it failed.

Another approach is to let the multiplication overflow and use the 64 highest bits of the result as an index into the TT:

return &table[(key * (unsigned __int128)clusterCount) >> 64].entry[0];

This might compile to slightly more efficient code than the big_hash3 approach. Unfortunately this gives a warning with gcc -pedantic, because __int128 is not standard C++, and it seems to require different code for MSVC (UnsignedMultiplyHigh()) (and also for 32-bit architectures).

Because this will uglify rather than simplify the code, I do not know if it is worth testing it. At some point in the future it will be important to lift the restriction to 128 GB, but is it already sufficiently important now?

@syzygy1
Copy link
Contributor Author

syzygy1 commented Dec 29, 2017

@gvreuls wondered whether poor quality Zobrist keys could be the explanation for why big_hash3 failed:
#1345 (comment)

This also crossed my mind. I think I could shuffle the Zobrist keys a bit to make sure that hash collisions for big_hash3 are identical to hash collisions for master. That would allow a direct speed comparison.

@syzygy1
Copy link
Contributor Author

syzygy1 commented Dec 30, 2017

After shuffling the Zobrist keys to keep the hash collisions and node counts the same as for master, big_hash3 seems to be about as fast as master and big_hash2 (taking the high 64-bits of key * clusterCount) is a bit faster than both. I've submitted a test for big_hash2:
http://tests.stockfishchess.org/tests/view/5a46da3c0ebc590ccbb8c3a9

@MichaelB7
Copy link
Contributor

Good Luck!

@gvreuls
Copy link
Contributor

gvreuls commented Dec 30, 2017

I'm running a local test where I discard the 16 least significant and 16 most significant bits of the PRNG output in the hope of generating better quality Zobrist keys (and magic numbers), but the results aren't very promising (<0.5 elo at 10000 games sprt[-3,1]). I'm wondering if a cryptographically secure PRNG would perform better, does it make sense to test ChaCha20 or Fortuna as a CSPRNG or is that overkill?

@vdbergh
Copy link
Contributor

vdbergh commented Dec 30, 2017

@gvreuls I find it hard to believe that the PRNG could be at fault. Modern PRNGs pass the most stringent statistical tests, so it should not matter which bits you select. Just my personal opinion. Why big_hash keeps failing remains a mystery. Terrible bad luck is still possible (something similar happened with the 3-fold rep patch: over the years it was tested several times and it always failed, until it suddenly passed).

@gvreuls
Copy link
Contributor

gvreuls commented Dec 30, 2017

@vdbergh It was just a shot in the dark, selecting bits does make a tiny difference (more than picking a different seed) but the difference is insignificant over a large enough sample. The PRNG is indeed of very high quality, if anything twiddling the bits for a day convinced me of that.

@syzygy1
Copy link
Contributor Author

syzygy1 commented Dec 30, 2017

big_hash2 failed relatively quickly, and this cannot be attributed to the PRNG since I shuffled the keys to get identical collisions.

So unless this is all bad luck, this code is just extremely speed-sensitive. It is indeed executed quite often (for TT probes and prefetches). But it is puzzling that it was possible to replace the AND operation with a multiplication.

@CoffeeOne
Copy link

CoffeeOne commented Dec 30, 2017

@syzygy1
Here is the answer to the question: Do we need more than 128GB hash for Stockfish, from my point of view:
NO.
I tested with a 40 core machine with 512 GB RAM under Windows. Please note, things will be different under Linux (I made tests some time ago with a 64 Core machine under both Linux and Windows).
Stockfish slows down a lot with 128GB hash, it starts calculating with a mediocre speed, then slowly the node-speed goes up.
Test with latest stockfish master:
setoption name Threads value 80 (use maximum possible threads to fill the hash)
setoption name Hash value 131072
go movetime 600000
Result:
info depth 37 seldepth 56 multipv 1 score cp 8 nodes 24406500267 nps 40677093 hashfull 880 tbhits 0 time 600006 pv e2e4 e7e6 g1f3 d7d5 e4d5 e6d5 d2d4 f8d6 f1d3
g8f6 e1g1 e8g8 c1g5 h7h6 g5h4 c7c6 b1d2 c8g4 h2h3 g4f3 d1f3 b8d7 c2c4 d6b4 f1d1 b4d2 d1d2 d8a5 d2d1 d5c4 d3c4 a5h5 f3h5 f6h5 d4d5 d7b6 a1c1 c6d5 c4d5 b6d5 d1d5
g7g5 h4g3 h5g3 f2g3 f8d8 d5d8 a8d8 c1c7 d8d1 g1h2 d1d2 c7b7 f7f5 b7a7 d2b2
info depth 38 seldepth 51 multipv 1 score cp 20 nodes 24406500267 nps 40676890 hashfull 880 tbhits 0 time 600009 pv e2e4 e7e6 g1f3 d7d5 e4d5 e6d5 d2d4 g8f6 f1d3
f8d6 e1g1 e8g8 c1g5 h7h6 g5h4 c7c6 b1d2 c8g4 c2c4 b8d7 c4d5 c6d5 d1b3 d8b6 b3b6 a7b6 f1e1 f8e8 e1e8 f6e8 a2a3 g4f3 d2f3 d7f8 h4g3 f8e6 d3f5 d6g3 h2g3 e8d6 f5e6
f7e6
bestmove e2e4 ponder e7e6
=> So stockfish cannot fill the hash with 80 threads from the starting position within 10 minutes, hash is filled with 88%.
Moreover speed at the beginning was very low (below 1MN/s, endspeed here was 40MN/s).
So increasing the hash even more would worsen, it is already bad with 128GB.
I am thinking of opening a new issue about introducing large pages for Windows, that would dramatically increase the hash fill rate (bring it to the same performance as Linux version). But I remember it was rejected by @mcostalba

@vondele
Copy link
Member

vondele commented Dec 31, 2017

@CoffeeOne I wonder if the speed observation you make is related to the fact that this is the first search after allocating the hash.

What happens if, after the first search is completed, you do a
ucinewgame
go movetime 600000
(i.e. without intermediately changing hash size, etc) in particular do you see a good speed at the beginning?

Also on linux I see for the first search slow performance initially, and good performance later on. The second search has good performance.

@CoffeeOne
Copy link

Hi,

I repeated the run:
First after fresh start of stockfish:
setoption name Threads value 80
setoption name Hash value 131072
go movetime 600000
info depth 35 seldepth 50 multipv 1 score cp 15 lowerbound nodes 23400010668 nps 40157559 hashfull 856 tbhits 0 time 582705 pv d2d4
info depth 35 currmove d2d4 currmovenumber 1
info depth 35 seldepth 50 multipv 1 score cp 15 nodes 24228199761 nps 40379794 hashfull 874 tbhits 0 time 600008 pv d2d4
info depth 35 seldepth 53 multipv 1 score cp 19 nodes 24228199761 nps 40379727 hashfull 874 tbhits 0 time 600009 pv e2e4 e7e6 d2d4 d7d5 b1c3 g8f6 c1g5 d5e4 c3e4
f8e7 g5f6 e7f6 e4f6 d8f6 g1f3 e8g8 c2c3 c8d7 f1e2 d7c6 e1g1 b8d7 b2b4 b7b6 a2a4 a7a5 f1e1 c6b7 b4a5 a8a5 e2b5 f8d8 d1e2 b7f3 e2f3 f6f3 g2f3 d7f6 g1g2 c7c5 d4c5
b6c5
bestmove e2e4 ponder e7e6
Second after
ucinewgame
go movetime 600000:
info depth 38 seldepth 45 multipv 1 score cp 6 upperbound nodes 27868858949 nps 47958222 hashfull 936 tbhits 0 time 581107 pv d2d4 g8f6
info depth 38 currmove d2d4 currmovenumber 1
info depth 38 seldepth 45 multipv 1 score cp 6 nodes 28759727984 nps 47932081 hashfull 938 tbhits 0 time 600010 pv d2d4 g8f6
info depth 38 seldepth 52 multipv 1 score cp 23 nodes 28759727984 nps 47932001 hashfull 938 tbhits 0 time 600011 pv d2d4 e7e6 g1f3 g8f6 c2c4 d7d5 b1c3 c7c6 e2e3
b8d7 d1c2 f8d6 f1d3 d5c4 d3c4 a7a6 a2a4 c6c5 d4c5 d6c5 e1g1 e8g8 c3e4 f6e4 c2e4 d7f6 e4c2 b7b6 b2b3 c8b7 c1b2 h7h6 a1c1 d8c7 c4d3 a8c8 b3b4 c5d6 c2e2 c7d8 d3a6
c8c1 f1c1
bestmove d2d4 ponder e7e6

On run1 stockfish needed 1 minutes (no joke) to reach 1MN/s,
on run2 it started with full speed (+40MN/s) right from the beginning.
So you were right.
This slowdown effect on first run is very big with large hash, though. It is much less for example with only 16GB hash.

@vondele
Copy link
Member

vondele commented Dec 31, 2017

OK, in that case, I think I understand what is happening.. We allocate the TT with '''calloc''' which just reserves the pages (depending on OS, etc). When we first write to this page (which happens during search), that page is actually really obtained, zeroed as needed (OS dependent), etc. So basically, one pays during the (first) search the overhead of paging in and zeroing 128Gb of memory. The slowdown is smaller with smaller hash, as there is less to zero/page in.

@CoffeeOne
Copy link

That seems like an explanation :) Could we pre-zero the hash, to avoid doing that during game play / analysis?

@vondele
Copy link
Member

vondele commented Dec 31, 2017

that would be easy, but can we first do another experiment?

What happens if in the above sequence of commands you issue an ucinewgame just after setting the hash size (and repeat the two searches). What I would like to see is if the MN/s of the second (fast) search is influenced by doing the zeroing in search or before search. (There could be some differences in where the pages are allocated in a numa environment, depending on which thread touches the page first).

I don't think game play will be influenced, as at least cutechess will do an ucinewgame before any game.

@gvreuls
Copy link
Contributor

gvreuls commented Dec 31, 2017

IMHO touching all allocated pages before the game clock starts to run is a good idea anyhow, it makes sure the time on the clock is spent productively instead of waiting on the operating system, and it makes identical behaviour on different operating systems and machine architectures more likely.

@CoffeeOne
Copy link

@vondele
First run:
setoption name Threads value 80
setoption name Hash value 131072
ucinewgame (takes about half a minute)
go movetime 600000
....
info depth 36 seldepth 48 multipv 1 score cp 10 upperbound nodes 26835636659 nps 46678222 hashfull 904 tbhits 0 time 574907 pv d2d4 g8f6
info depth 36 currmove d2d4 currmovenumber 1
info depth 36 seldepth 48 multipv 1 score cp 10 nodes 28023281274 nps 46705235 hashfull 915 tbhits 0 time 600003 pv d2d4 g8f6
info depth 37 seldepth 50 multipv 1 score cp 23 nodes 28023281274 nps 46705079 hashfull 915 tbhits 0 time 600005 pv e2e4
bestmove e2e4 ponder e7e6
Second run:
ucinewgame
go movetime 600000
....
info depth 36 seldepth 49 multipv 1 score cp 11 nodes 21475322356 nps 47393082 hashfull 865 tbhits 0 time 453132 pv e2e4 e7e5 g1f3 b8c6 f1c4 f8c5 d2d3 g8f6 c2c3
e8g8 e1g1 d7d6 h2h3 h7h6 f1e1 a7a5 a2a4 c8e6 b1a3 c5a3 c4e6 f7e6 a1a3 a8b8 d1b3 d8e7 d3d4 e7f7 d4e5 c6e5 f3e5 d6e5 f2f3 b7b6 b3c4 f7e7 c1e3 e7d6 c4b3
info depth 37 currmove e2e4 currmovenumber 1
info depth 37 seldepth 49 multipv 1 score cp 11 nodes 28414021913 nps 47355913 hashfull 933 tbhits 0 time 600010 pv e2e4 e7e5 g1f3 b8c6 f1c4 f8c5 d2d3 g8f6 c2c3
e8g8 e1g1 d7d6 h2h3 h7h6 f1e1 a7a5 a2a4 c8e6 b1a3 c5a3 c4e6 f7e6 a1a3 a8b8 d1b3 d8e7 d3d4 e7f7 d4e5 c6e5 f3e5 d6e5 f2f3 b7b6 b3c4 f7e7 c1e3 e7d6 c4b3
info depth 37 seldepth 48 multipv 1 score cp 19 nodes 28414021913 nps 47355756 hashfull 933 tbhits 0 time 600012 pv d2d4 e7e6 c2c4 d7d5 g1f3 g8f6 b1c3 f8e7 c1f4
e8g8 e2e3 b8d7 c4c5 f6h5 f1d3 h5f4 e3f4 b7b6 b2b4 a7a5 a2a3 c7c6 e1g1 g7g6 d1e2 c8b7 f3e5 b6b5 f1b1 d8c7 g2g3 e7f6 g1g2 f6g7 b1e1 f8e8 e2d2 a5b4 a3b4 d7f6 e5f3
e8d8
bestmove d2d4 ponder e7e6

I would say, no big change for the second run. The first run got a massive speedup, but still a bit slower than the second run.

@gvreuls
Copy link
Contributor

gvreuls commented Dec 31, 2017

Here is a better argumentation for not relying on the ucinewgame command, sorry if I wasn't clear enough the first time. From the UCI protocol http://wbec-ridderkerk.nl/html/UCIProtocol.html:

  • ucinewgame
    this is sent to the engine when the next search (started with "position" and "go") will be from a different game. This can be a new game the engine should play or a new game it should analyse but also the next position from a testsuite with positions only.
    If the GUI hasn't sent a "ucinewgame" before the first "position" command, the engine shouldn't expect any further ucinewgame commands as the GUI is probably not supporting the ucinewgame command. So the engine should not rely on this command even though all new GUIs should support it.
    As the engine's reaction to "ucinewgame" can take some time the GUI should always send "isready" after "ucinewgame" to wait for the engine to finish its operation.

(emphasis mine)

@mstembera
Copy link
Contributor

@syzygy1
To avoid using __int128 and making the same source compile under gcc and MSVC we can use the intrinsic _umul128. It may also be worth retesting as it may be more efficient.

#include <intrin.h>
  TTEntry* first_entry(const Key key) const {
#ifdef IS_64BIT
    uint64_t highProduct;
    _umul128(key, clusterCount, &highProduct);
    return &table[highProduct].entry[0];
#else
    return &table[((key >> 32) * uint64_t(clusterCount)) >> 32].entry[0];
#endif
  }

vondele added a commit to vondele/Stockfish that referenced this issue Jan 1, 2018
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write.
This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT.
Using an explicit clear of the TT on resize fixes this problem.

Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy.

No functional change.
mcostalba pushed a commit that referenced this issue Jan 1, 2018
as discussed in issue #1349, the way pages are allocated with calloc might imply some overhead on first write.
This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT.
Using an explicit clear of the TT on resize fixes this problem.

Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy.

No functional change.
@syzygy1
Copy link
Contributor Author

syzygy1 commented Jan 1, 2018

@mstembera
Unfortunately intrin.h and _umul128 don't seem to exist for gcc/Linux.
Clang does have them (and also defines __umulh() which seems a bit more convenient).

@syzygy1
Copy link
Contributor Author

syzygy1 commented Jan 1, 2018

@CoffeeOne
I'm aware of the various intrinsics header files. They don't contain _umul128().

@vondele
Copy link
Member

vondele commented Jan 3, 2018

The following passed a [-3, 1]:
http://tests.stockfishchess.org/tests/view/5a466abd0ebc590ccbb8c38f
it is likely a small slowdown. Not sure if I want to turn it in a pull request.

@CoffeeOne
Copy link

Let's ask @mcostalba for his opinion.

@syzygy1
Copy link
Contributor Author

syzygy1 commented Jan 13, 2018

If it gets committed, a simplficiation to one of the approaches that failed seems likely to pass :-)

atumanian pushed a commit to atumanian/Stockfish that referenced this issue Jan 24, 2018
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write.
This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT.
Using an explicit clear of the TT on resize fixes this problem.

Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy.

No functional change.
@yaneurao
Copy link

yaneurao commented May 3, 2018

@mstembera

    _umul128(key, clusterCount, &highProduct);
    return &table[highProduct].entry[0];

I guess the above code is not so good in the following two points.

  1. In transpositionTable::probe() , we use the high 16bits of hash key for TTEntry::key16

const uint16_t key16 = key >> 48; // Use the high 16 bits as key inside the cluster

  1. In singular extension , we use only bit 16..31 is different hash key.

posKey = pos.key() ^ Key(excludedMove << 16); // Isn't a very good hash

So , highProduct should be extracted from the bit0..47 of the key , not from the higher bit of the key.

e.g.
_umul128(key << 16, clusterCount, &highProduct);

yaneurao added a commit to yaneurao/YaneuraOu that referenced this issue May 3, 2018
	- 128 GB TT size limitation : official-stockfish/Stockfish#1349
		- このスレッドに書かれている修正案、いくつかの点においてよろしくない。
			- cf. 置換表の128GB制限を取っ払う冴えない方法 : http://yaneuraou.yaneu.com/2018/05/03/%E7%BD%AE%E6%8F%9B%E8%A1%A8%E3%81%AE128gb%E5%88%B6%E9%99%90%E3%82%92%E5%8F%96%E3%81%A3%E6%89%95%E3%81%86%E5%86%B4%E3%81%88%E3%81%AA%E3%81%84%E6%96%B9%E6%B3%95/
	- singular extension時のhash keyを求めるコード、改良。
@syzygy1
Copy link
Contributor Author

syzygy1 commented May 3, 2018

@yaneurao
The proposed patches also change the two points you mention.

goodkov pushed a commit to goodkov/Stockfish that referenced this issue Jul 21, 2018
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write.
This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT.
Using an explicit clear of the TT on resize fixes this problem.

Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy.

No functional change.
@Mindbreaker1
Copy link

If and when large 3D Xpoint memory modules become affordable, we may need this. There are 960 GB modules available now, but they are not cheap. They are slower than RAM but can be rewritten large numbers of times and are much faster than flash SSDs, and are designed to be able to function as RAM. https://www.newegg.com/Product/Product.aspx?Item=9SIA6ZP7E00339&cm_re=optane-_-20-167-458-_-Product

As it is slower, it is hard to say if this would really be useful for something like postal or not.

@syzygy1
Copy link
Contributor Author

syzygy1 commented Dec 2, 2018

I'm closing this for now. To be reopened when the first complaint about the limitation is received.

@syzygy1 syzygy1 closed this as completed Dec 2, 2018
@skiminki
Copy link
Contributor

skiminki commented Jun 5, 2020

Something like this should be able to extend the hash size with minimal impact: https://github.com/skiminki/Stockfish/tree/more-memory

With this patch, the bench id is preserved. I don't have access to any box with >128 GB memory to test, though.

@vondele vondele reopened this Jun 5, 2020
@vondele
Copy link
Member

vondele commented Jun 5, 2020

@skiminki I'll reopen this issue, since >128Gb RAM becomes increasingly common. However, you would need to test your patch for non-regression against current master on fishtest for normal hashes.

@skiminki
Copy link
Contributor

skiminki commented Jun 6, 2020

This wasn't actually meant as a PR, it was more of a concept. The idea is simple: we have unused bits in the Zobrist hash key. By grouping clusters into bigger units (=super cluster) with a power-of-two size, we can easily use some of the currently unused key bits to select the cluster within the super cluster.

In fact, the simplest approach would have been to:

  // select the super cluster with the existing formula
  superClusterIndex = (uint32_t(key) * uint64_t(superClusterCount)) >> 32;

  // use bits 32..39 to choose the cluster within super cluster
  clusterSubIndex = uint8_t(key >> 32);

  return &table[superClusterIndex * 256 + clusterSubIndex].entry[0];

But the simplest approach changes the hash indexes for hash sizes <= 128 GB, and thus, the bench id. Anyways, perhaps I'll submit some fishtest runs after my fishtest account has been activated to see whether either one of the approach has a significant regression. Hopefully, this results in a PR.

Further, we could have just as well made the super cluster size 1 or 2 MB, adding 15/16 bits to the max hash size instead of only 8 bits. However, 32 TB should be enough for anyone™ for now, at least, and I would like to also investigate whether the remaining 8 bits of the Zobrist key can be used to reduce the number of bad hash fetches by adding to the per-slot hash16. There should be 7-8 bits per slot that can be relatively easily squeezed in the current cluster layout, but perf impact remains to be seen (5 bits in the unused cluster bytes, and 2-3 bits in move16.)

Just to continue the bad fetches theme a bit. When the hash is full, 0.0046% of hash fetches actually successfully fetch data from an incorrect position (prob. 3 / 2^16). This is assuming that the stores and fetches are unrelated. But anyways, my local test runs verifying the fetches against fully stored positions give bad fetch rates around that ball park. (My local test runs are on Ethereal, though, due to its simpler code base.)

0.0046% may not sound that big of a probability, but at 100 Mnodes/s (or even at just 1 Mnps) those bad fetches become quite frequent. Further, these bad fetches actually affect the search and sometimes the chosen move. Even if adding a move pseudolegality test before using the hash data for anything, that only culls the bad fetches by a factor of 10 or so (depending on the position, number based on the preliminary investigation with Ethereal).

The actual effect of bad fetches in move selection quality is still to be investigated. I am aware of the study that says there's no big effect with such hash collisions. But speeds and hash sizes since that study have increased quite a lot, so I think this deserves another study.

TL;DR: That's the backstory of reserving 8 bits of the Zobrist hash key.

@skiminki
Copy link
Contributor

skiminki commented Jun 6, 2020

I ended up adding 16 bits anyways to the computation, even if the hash size increases only by 8 bits. The extra 8 bits should help battle with quantization error for non-power-of-2 hash sizes. SF currently has this quantization problem, mostly visible with hash sizes between 64G and 128G. To illustrate,

Hash size = 96G = 1.5 * 2^31 clusters:

Key Index
  0     0
  1     0
  2     1
  3     2
  4     3
  5     3
  6     4
  7     5
  8     6
 ...
2^32-1  2^31+2^30-1

It's easy to see that the keys don't distribute evenly to different clusters indexes. Essentially, this quantization error decreases hash effectiveness a bit, as some clusters receive the double amount of keys wrt to the rest. With 8 more bits, some clusters get 341 keys and others 342 keys, so much more even distribution. (Or something like that. Didn't calculate precisely.)

@skiminki
Copy link
Contributor

skiminki commented Jun 7, 2020

I missed a final reduction yesterday in that fixed-point 32.16 calculation, resulting in intermediate precision loss. To illustrate the better distribution, now we get the following cluster selection for 96GB hash size:

key: 0x0000.00000000   hash index: 0 
key: 0x4000.00000000   hash index: 0 
key: 0x8000.00000000   hash index: 0 
key: 0xc000.00000000   hash index: 0 
key: 0x0000.00000001   hash index: 0 
key: 0x4000.00000001   hash index: 0 
key: 0x8000.00000001   hash index: 1 
key: 0xc000.00000001   hash index: 1 
key: 0x0000.00000002   hash index: 1 
key: 0x4000.00000002   hash index: 1 
key: 0x8000.00000002   hash index: 1 
key: 0xc000.00000002   hash index: 2 
key: 0x0000.00000003   hash index: 2 
key: 0x4000.00000003   hash index: 2 
key: 0x8000.00000003   hash index: 2 
key: 0xc000.00000003   hash index: 2 
key: 0x0000.00000004   hash index: 3 
key: 0x4000.00000004   hash index: 3 
key: 0x8000.00000004   hash index: 3 
key: 0xc000.00000004   hash index: 3 
key: 0x0000.00000005   hash index: 3 
key: 0x4000.00000005   hash index: 3 
key: 0x8000.00000005   hash index: 4 
key: 0xc000.00000005   hash index: 4 
key: 0x0000.00000006   hash index: 4 
key: 0x4000.00000006   hash index: 4 
key: 0x8000.00000006   hash index: 4 
key: 0xc000.00000006   hash index: 5 

The above print-out skips most hash keys, so the distribution is much more even in reality. This printout is produced with the current patch.

Compare this with the current upstream where index 0 and 3 get twice as many keys as index 1, 2, 4.

key: 0x00000000   hash index: 0 
key: 0x00000001   hash index: 0 
key: 0x00000002   hash index: 1 
key: 0x00000003   hash index: 2 
key: 0x00000004   hash index: 3 
key: 0x00000005   hash index: 3 
key: 0x00000006   hash index: 4 
key: 0x00000007   hash index: 5

@skiminki
Copy link
Contributor

skiminki commented Jun 7, 2020

Anyway, Fistest passed the regression test yesterday, but that version didn't fix the quantization error properly. I submitted another one for the updated patch. Since the new version is basically just using a bit different ordering of shifts and different shift values, I'd expect the updated patch to do similarly well. But let's see.

skiminki added a commit to skiminki/Stockfish that referenced this issue Jun 7, 2020
Group hash clusters into super clusters of 256 clusters. Use super
clusters as the hash size. This scheme allows us to use hash sizes up to
32 TB (= 2^32 super clusters = 2^40 clusters).

Use 48 bits of the Zobrist key to choose the cluster index. We use 8
extra bits to mitigate the quantization error for very large hashes when
scaling the hash key to cluster index.

The hash index computation is organized to be compatible with the existing
scheme for power-of-two hash sizes up to 128 GB. Hash index quantization
error for non-power-of-two hash sizes is significantly reduced also for hash
sizes less than 128 GB, improving key-to-cluster distribution.

Fixes official-stockfish#1349

Passed non-regression STC:
LLR: 2.93 (-2.94,2.94) {-1.50,0.50}
Total: 37976 W: 7336 L: 7211 D: 23429
Ptnml(0-2): 578, 4295, 9149, 4356, 610
https://tests.stockfishchess.org/tests/view/5edcbaaef29b40b0fc95abc5

No functional change.
@vondele vondele closed this as completed in 4b10578 Jun 9, 2020
MichaelB7 pushed a commit to MichaelB7/Stockfish that referenced this issue Jun 13, 2020
Conceptually group hash clusters into super clusters of 256 clusters.
This scheme allows us to use hash sizes up to 32 TB
(= 2^32 super clusters = 2^40 clusters).

Use 48 bits of the Zobrist key to choose the cluster index. We use 8
extra bits to mitigate the quantization error for very large hashes when
scaling the hash key to cluster index.

The hash index computation is organized to be compatible with the existing
scheme for power-of-two hash sizes up to 128 GB.

Fixes official-stockfish#1349

closes official-stockfish#2722

Passed non-regression STC:
LLR: 2.93 (-2.94,2.94) {-1.50,0.50}
Total: 37976 W: 7336 L: 7211 D: 23429
Ptnml(0-2): 578, 4295, 9149, 4356, 610
https://tests.stockfishchess.org/tests/view/5edcbaaef29b40b0fc95abc5

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants