-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
128 GB TT size limitation #1349
Comments
@gvreuls wondered whether poor quality Zobrist keys could be the explanation for why big_hash3 failed: This also crossed my mind. I think I could shuffle the Zobrist keys a bit to make sure that hash collisions for big_hash3 are identical to hash collisions for master. That would allow a direct speed comparison. |
After shuffling the Zobrist keys to keep the hash collisions and node counts the same as for master, big_hash3 seems to be about as fast as master and big_hash2 (taking the high 64-bits of key * clusterCount) is a bit faster than both. I've submitted a test for big_hash2: |
Good Luck! |
I'm running a local test where I discard the 16 least significant and 16 most significant bits of the PRNG output in the hope of generating better quality Zobrist keys (and magic numbers), but the results aren't very promising (<0.5 elo at 10000 games sprt[-3,1]). I'm wondering if a cryptographically secure PRNG would perform better, does it make sense to test ChaCha20 or Fortuna as a CSPRNG or is that overkill? |
@gvreuls I find it hard to believe that the PRNG could be at fault. Modern PRNGs pass the most stringent statistical tests, so it should not matter which bits you select. Just my personal opinion. Why big_hash keeps failing remains a mystery. Terrible bad luck is still possible (something similar happened with the 3-fold rep patch: over the years it was tested several times and it always failed, until it suddenly passed). |
@vdbergh It was just a shot in the dark, selecting bits does make a tiny difference (more than picking a different seed) but the difference is insignificant over a large enough sample. The PRNG is indeed of very high quality, if anything twiddling the bits for a day convinced me of that. |
big_hash2 failed relatively quickly, and this cannot be attributed to the PRNG since I shuffled the keys to get identical collisions. So unless this is all bad luck, this code is just extremely speed-sensitive. It is indeed executed quite often (for TT probes and prefetches). But it is puzzling that it was possible to replace the AND operation with a multiplication. |
@syzygy1 |
@CoffeeOne I wonder if the speed observation you make is related to the fact that this is the first search after allocating the hash. What happens if, after the first search is completed, you do a Also on linux I see for the first search slow performance initially, and good performance later on. The second search has good performance. |
Hi, I repeated the run: On run1 stockfish needed 1 minutes (no joke) to reach 1MN/s, |
OK, in that case, I think I understand what is happening.. We allocate the TT with '''calloc''' which just reserves the pages (depending on OS, etc). When we first write to this page (which happens during search), that page is actually really obtained, zeroed as needed (OS dependent), etc. So basically, one pays during the (first) search the overhead of paging in and zeroing 128Gb of memory. The slowdown is smaller with smaller hash, as there is less to zero/page in. |
That seems like an explanation :) Could we pre-zero the hash, to avoid doing that during game play / analysis? |
that would be easy, but can we first do another experiment? What happens if in the above sequence of commands you issue an ucinewgame just after setting the hash size (and repeat the two searches). What I would like to see is if the MN/s of the second (fast) search is influenced by doing the zeroing in search or before search. (There could be some differences in where the pages are allocated in a numa environment, depending on which thread touches the page first). I don't think game play will be influenced, as at least cutechess will do an ucinewgame before any game. |
IMHO touching all allocated pages before the game clock starts to run is a good idea anyhow, it makes sure the time on the clock is spent productively instead of waiting on the operating system, and it makes identical behaviour on different operating systems and machine architectures more likely. |
@vondele I would say, no big change for the second run. The first run got a massive speedup, but still a bit slower than the second run. |
Here is a better argumentation for not relying on the ucinewgame command, sorry if I wasn't clear enough the first time. From the UCI protocol http://wbec-ridderkerk.nl/html/UCIProtocol.html:
(emphasis mine) |
@syzygy1 #include <intrin.h>
TTEntry* first_entry(const Key key) const {
#ifdef IS_64BIT
uint64_t highProduct;
_umul128(key, clusterCount, &highProduct);
return &table[highProduct].entry[0];
#else
return &table[((key >> 32) * uint64_t(clusterCount)) >> 32].entry[0];
#endif
} |
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write. This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT. Using an explicit clear of the TT on resize fixes this problem. Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy. No functional change.
as discussed in issue #1349, the way pages are allocated with calloc might imply some overhead on first write. This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT. Using an explicit clear of the TT on resize fixes this problem. Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy. No functional change.
@mstembera |
@CoffeeOne |
The following passed a [-3, 1]: |
Let's ask @mcostalba for his opinion. |
If it gets committed, a simplficiation to one of the approaches that failed seems likely to pass :-) |
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write. This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT. Using an explicit clear of the TT on resize fixes this problem. Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy. No functional change.
I guess the above code is not so good in the following two points.
So , highProduct should be extracted from the bit0..47 of the key , not from the higher bit of the key. e.g. |
- 128 GB TT size limitation : official-stockfish/Stockfish#1349 - このスレッドに書かれている修正案、いくつかの点においてよろしくない。 - cf. 置換表の128GB制限を取っ払う冴えない方法 : http://yaneuraou.yaneu.com/2018/05/03/%E7%BD%AE%E6%8F%9B%E8%A1%A8%E3%81%AE128gb%E5%88%B6%E9%99%90%E3%82%92%E5%8F%96%E3%81%A3%E6%89%95%E3%81%86%E5%86%B4%E3%81%88%E3%81%AA%E3%81%84%E6%96%B9%E6%B3%95/ - singular extension時のhash keyを求めるコード、改良。
@yaneurao |
as discussed in issue official-stockfish#1349, the way pages are allocated with calloc might imply some overhead on first write. This overhead can be large and slow down the first search after a TT resize significantly, especially for large TT. Using an explicit clear of the TT on resize fixes this problem. Not implemented, but possibly useful for large TT, is to do this zero-ing using all search threads. Not only would this be faster, it could also lead to a more favorable memory allocation on numa systems with a first touch policy. No functional change.
If and when large 3D Xpoint memory modules become affordable, we may need this. There are 960 GB modules available now, but they are not cheap. They are slower than RAM but can be rewritten large numbers of times and are much faster than flash SSDs, and are designed to be able to function as RAM. https://www.newegg.com/Product/Product.aspx?Item=9SIA6ZP7E00339&cm_re=optane-_-20-167-458-_-Product As it is slower, it is hard to say if this would really be useful for something like postal or not. |
I'm closing this for now. To be reopened when the first complaint about the limitation is received. |
Something like this should be able to extend the hash size with minimal impact: https://github.com/skiminki/Stockfish/tree/more-memory With this patch, the bench id is preserved. I don't have access to any box with >128 GB memory to test, though. |
@skiminki I'll reopen this issue, since >128Gb RAM becomes increasingly common. However, you would need to test your patch for non-regression against current master on fishtest for normal hashes. |
This wasn't actually meant as a PR, it was more of a concept. The idea is simple: we have unused bits in the Zobrist hash key. By grouping clusters into bigger units (=super cluster) with a power-of-two size, we can easily use some of the currently unused key bits to select the cluster within the super cluster. In fact, the simplest approach would have been to:
But the simplest approach changes the hash indexes for hash sizes <= 128 GB, and thus, the bench id. Anyways, perhaps I'll submit some fishtest runs after my fishtest account has been activated to see whether either one of the approach has a significant regression. Hopefully, this results in a PR. Further, we could have just as well made the super cluster size 1 or 2 MB, adding 15/16 bits to the max hash size instead of only 8 bits. However, 32 TB should be enough for anyone™ for now, at least, and I would like to also investigate whether the remaining 8 bits of the Zobrist key can be used to reduce the number of bad hash fetches by adding to the per-slot hash16. There should be 7-8 bits per slot that can be relatively easily squeezed in the current cluster layout, but perf impact remains to be seen (5 bits in the unused cluster bytes, and 2-3 bits in move16.) Just to continue the bad fetches theme a bit. When the hash is full, 0.0046% of hash fetches actually successfully fetch data from an incorrect position (prob. 3 / 2^16). This is assuming that the stores and fetches are unrelated. But anyways, my local test runs verifying the fetches against fully stored positions give bad fetch rates around that ball park. (My local test runs are on Ethereal, though, due to its simpler code base.) 0.0046% may not sound that big of a probability, but at 100 Mnodes/s (or even at just 1 Mnps) those bad fetches become quite frequent. Further, these bad fetches actually affect the search and sometimes the chosen move. Even if adding a move pseudolegality test before using the hash data for anything, that only culls the bad fetches by a factor of 10 or so (depending on the position, number based on the preliminary investigation with Ethereal). The actual effect of bad fetches in move selection quality is still to be investigated. I am aware of the study that says there's no big effect with such hash collisions. But speeds and hash sizes since that study have increased quite a lot, so I think this deserves another study. TL;DR: That's the backstory of reserving 8 bits of the Zobrist hash key. |
I ended up adding 16 bits anyways to the computation, even if the hash size increases only by 8 bits. The extra 8 bits should help battle with quantization error for non-power-of-2 hash sizes. SF currently has this quantization problem, mostly visible with hash sizes between 64G and 128G. To illustrate, Hash size = 96G = 1.5 * 2^31 clusters:
It's easy to see that the keys don't distribute evenly to different clusters indexes. Essentially, this quantization error decreases hash effectiveness a bit, as some clusters receive the double amount of keys wrt to the rest. With 8 more bits, some clusters get 341 keys and others 342 keys, so much more even distribution. (Or something like that. Didn't calculate precisely.) |
I missed a final reduction yesterday in that fixed-point 32.16 calculation, resulting in intermediate precision loss. To illustrate the better distribution, now we get the following cluster selection for 96GB hash size:
The above print-out skips most hash keys, so the distribution is much more even in reality. This printout is produced with the current patch. Compare this with the current upstream where index 0 and 3 get twice as many keys as index 1, 2, 4.
|
Anyway, Fistest passed the regression test yesterday, but that version didn't fix the quantization error properly. I submitted another one for the updated patch. Since the new version is basically just using a bit different ordering of shifts and different shift values, I'd expect the updated patch to do similarly well. But let's see. |
Group hash clusters into super clusters of 256 clusters. Use super clusters as the hash size. This scheme allows us to use hash sizes up to 32 TB (= 2^32 super clusters = 2^40 clusters). Use 48 bits of the Zobrist key to choose the cluster index. We use 8 extra bits to mitigate the quantization error for very large hashes when scaling the hash key to cluster index. The hash index computation is organized to be compatible with the existing scheme for power-of-two hash sizes up to 128 GB. Hash index quantization error for non-power-of-two hash sizes is significantly reduced also for hash sizes less than 128 GB, improving key-to-cluster distribution. Fixes official-stockfish#1349 Passed non-regression STC: LLR: 2.93 (-2.94,2.94) {-1.50,0.50} Total: 37976 W: 7336 L: 7211 D: 23429 Ptnml(0-2): 578, 4295, 9149, 4356, 610 https://tests.stockfishchess.org/tests/view/5edcbaaef29b40b0fc95abc5 No functional change.
Conceptually group hash clusters into super clusters of 256 clusters. This scheme allows us to use hash sizes up to 32 TB (= 2^32 super clusters = 2^40 clusters). Use 48 bits of the Zobrist key to choose the cluster index. We use 8 extra bits to mitigate the quantization error for very large hashes when scaling the hash key to cluster index. The hash index computation is organized to be compatible with the existing scheme for power-of-two hash sizes up to 128 GB. Fixes official-stockfish#1349 closes official-stockfish#2722 Passed non-regression STC: LLR: 2.93 (-2.94,2.94) {-1.50,0.50} Total: 37976 W: 7336 L: 7211 D: 23429 Ptnml(0-2): 578, 4295, 9149, 4356, 610 https://tests.stockfishchess.org/tests/view/5edcbaaef29b40b0fc95abc5 No functional change.
I am opening this issue to discuss whether there is a need to lift the current restriction of the maximum size of the transposition table to 128 GB. The reason for the restriction is that the current implementation of first_entry() in tt.h does not allow the TT to have more than 2^32 clusters (of 32 bytes each):
If clusterCount has more than 32 bits, the multiplication overflows.
The restriction can be lifted by making use of the fact that the user can only specify TT sizes that are a multiple of 1 MB. 1 MB corresponds to 2^15 clusters, which means that clusterCount is always a multiple of 2^15. This allows us to increase the max TT size to up to 16 TB (2^39 clusters). I have tested this approach with a max size of 1 TB here (big_hash3):
http://tests.stockfishchess.org/tests/view/5a425d700ebc590ccbb8c19e
Unfortunately it failed.
Another approach is to let the multiplication overflow and use the 64 highest bits of the result as an index into the TT:
This might compile to slightly more efficient code than the big_hash3 approach. Unfortunately this gives a warning with gcc -pedantic, because __int128 is not standard C++, and it seems to require different code for MSVC (UnsignedMultiplyHigh()) (and also for 32-bit architectures).
Because this will uglify rather than simplify the code, I do not know if it is worth testing it. At some point in the future it will be important to lift the restriction to 128 GB, but is it already sufficiently important now?
The text was updated successfully, but these errors were encountered: