New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use int16_t for histories #1113
Conversation
I have verified bench stays the same 1779683533 up to depth 28. I also get close to a 1% speedup.
|
Nice catch @vondele - 1% speed ups are pretty awesome - will test when I get home but pretty sure I will see it too. Edit: 0.8% gain on macOS with PGO builds - consistent with various bench levels from depth 16 to depth 24 |
@vondele: Regarding the speedup coming from improved cache effect, I think this will depend on the hash size. I would expect the speedup to be more mesurable with a minimal hash size. stockfish bench defaults to hash=16, but with hash=1, it should be more mesurable. And in real games with a large hash, the effect should be null (as you constantly have cache misses due to random lookup on a large hash table, which in turns pollutes the cache for other things). But what is nice here is that we reduce the memory footprint of stockfish. Let's not forget that stockfish has been successfully run on some exotic hardware with limited resources, such as MIPS routers. |
@vondele I really miss the logic of a patch that is NOT a simplification in any way (indeed it is sold as an optimization) that is tested with [-3, 1]. I think this should be at least [0, 4] Moving from int to a specific int size should be backed with some sound reason, memory alone is not one because it is only an indirect (and very weak) hint to a possible speed-up. This is a well known historical myth of pre-testing era developers that try to use the smaller integer that fits the bill thinking that it will make engine faster. This is wrong in many ways. It is another case of voodoo programming that in chess engine world still persist until today (possibly because chess engines programmers are hobbyists for most part). See this for some longer explanation: Typically, CPUs are fastest at operating on integers of their native word size A CPU works more efficient when the data with equals to the native CPU register width Most of the time the space cost is negligible and you shouldn't worry about it |
@mcostalba in that test, should test and master use the same resources in terms of memory ? The increased memory efficiency of the test branch could be used to increase the TT and have processes compete at equal process memory footprint. Furthermore, the savings is 2Mb per thread, so the test would be best done in a multithreaded run. In that case (since L3 is typically shared between two or more cores) the speedup can be expected to be somewhat larger. Obviously, the speedup is not due to doing arithmetics on smaller integer sizes, as the arithmetic is all done on default int. Edit: at 32 threads (and without TT), the resident process size of master is 50% more than the test branch (210Mb, vs 144Mb) |
@vondele the test should be a usual standard test: if less memory usage speeds-up the engine with your patch this will be shown, if less memory usage has no effect then also this will be shown. We are not testing if memory is saved (we already know it) we are testing if this yields to some improvement or not. You never set TT to max available RAM in real games, you always have some free RAM left, indeed usually hundreds of MB if not GB of still free RAM, so your argument is moot. |
So, not too surprising for a memory reduction patch, failed yellow at [0,4]: LLR: -2.94 (-2.94,2.94) [0.00,4.00] Anybody has an accurate estimate about what % speed-up is needed to pass [0,4] at STC ? |
@vondele I also agree with @mcostalba that non-simplification patches should never be tested and accepted with [-3,1] bounds. A patch may cure cancer....but won't and shouldn't be accepted if it doesn't prove an elo gain. |
If a patch is found to be ELO neutral but shows benefits that are evident in terms of speed or memory footprint or other parameters, then why shudn't it be accepted? |
@anshulongithub If the patch is accepted...we can no longer tweak stat values past 16 bits. Right now the max stat value can barely fit into 16 bits. So tweaking the stats a bit higher will cause an overflow. Now what if down the road we found a significant ELO gain by tweaking max stat value > 2^19? Every time I would like to submit a patch with stats over 16 bit... I would have to convert all the int16 to regular int. This will demotivate me to do any patches for higher stat values. I am not a fan of patches that makes it more difficult to tweak and test ideas.... |
@anshulongithub Any non-simplification patch is more complex. Guidelines exist to preserve order. |
This is a non-functional speed optimization so according to the guidelines in the relevant section of https://github.com/glinscott/fishtest/wiki/Creating-my-first-test : Benchmark it with psybench or FishBench, then send a pull request. The speed up is verified "to ensure:
There seems to be some evidence of a speed up. Is it confirmed on different machines? The patch doesn't seem invasive and/or ugly to me. |
Regarding VoyagerOne's comment about patches that make it "more difficult to tweak and test ideas": I've been working on sorting tweaks that depend on the stats fitting in an int16 (since it seemed to me that they could, which this patch demonstrates). So while this patch may make some tweaks more difficult, it makes other tweaks possible. |
This patch is not a simplification but it is non functional and doesn't add complexity either. Reducing memory usage by 2MB per thread is a benefit by itself (probably even more so on platforms w/ many cores and small caches like 8 core cell phones.) The speedup is measurable w/ statistical significance and the yellow fishtest result is in line with it. IMO this patch has upsides w/o downsides and so is an improvement to the codebase and should be committed. @VoyagerOne I'm not for making things more difficult to tweak but keeping extra baggage around w/ just the hope it may be useful someday is a bad policy IMO. Also IF in the future some possible tweak required going over 2^16 wouldn't it be better to just rescale everything down to keep it more efficient at 16 bits anyway? |
I don't believe in measuring speed ups with benches: this is totally unreliable as long as the speed up is not so clear (but in the latter case it will pass easily [0, 4]) and is heavily machine/compiler dependent. IMO [0, 4] is very sensible and can spot even 1-1,5% speeds-up. To verify this (that alone would be very interesting) maybe we can think to a patch that slowdowns the engine of a tunable value and verify until which slowdown SPRT[0, 4] is able to discriminate between the 2 version. Regarding this patch I agree with Voyager: it is not a simplification, it is NOT proved to be a speed-up on average (indeed failed [0, 4] and I am quite confident that 1% speed-up are caught by fishtest) and makes the code worse than before, this is the main point for me. |
I'll be reporting some benchmarking on 32 threads (quad socket) and 64 threads (xeon phi) later today. I disagree with @mcostalba 's belief that fishtest would be more reliable than benchmarks in providing evidence for speedups in ~1% range. |
@vondele this should be tested of course, but I am confident sprt[0, 4] is able to spot 1% speed-up and on average on a borader set of machines and OS, not only in the tester setup. |
@mcostalba testing this is indeed a good idea. I'll work on getting a tuneable slowdown in the code, and do these tests in the framework. |
@vondele already done, see mcostalba/Stockfish@slowdown and here is the test, just 5% as a first run to verify everything is ok. |
don't think the test in fishtest is right. Obviously 5% slowdown will fail to pass [0,4]. you might want to swap master and test or adjust the bounds ? |
@vondele this is just a first check to see if code has bugs, after this I will lower the slowdown until it is no more detectable by [0, 4] |
to detect a slowdown you still need to swap master and test or use different bounds ? All (positive) values of slowdown should fail a [0,4] test. |
@vondele you are right. Test stopped and re-submitted with branches swapped. |
@mcostalba I realised your branch is also a good way to answer the other question, namely the reliability of benchmarking. This is the estimated speedup based on '40 test + 40 master ./stockfish bench' invocations, i.e. my standard script, which takes 5min on 1 core of a low end home PC:
so all within 0.3 error based on 40 runs only. I have also added the 1% slowdown run to the fishtest framework. |
Finally benchmark results for the quad socket and the xeon phi (both used 32 threads). To improve statistics (needed because of the non-determinism in the multithreaded runs), I ran a depth 20 bench, and repeated them a few hundred times. Results are consistent with other benchmarks posted.
|
Ok it seems testing with [0, 4] at STC is a reliable way to measure 1% speed up and above: http://tests.stockfishchess.org/tests/view/591701360ebc59035df345e0 Maybe we can even go lower than 1% but IMO is not necessary: if a speed-up patch adds complexity and/or makes the code worse and more fragile (as is this case) than original, I think we need at least a 1% speed-up to be considered worth it. If the patch is very invasive / ugly, then 1% may be even not enough, but this is not the case, this patch is not extremely ugly, just mildly. MY SUMMARY: This patch introduces a limitations on types from int to int16_t that has following drawbacks:
P.S: As an added bonus we now know that SPRT[0, 4] at STC can be used to discriminate speed optimizations of at least 1% (and perhaps even lower). |
@mcostalba The results of the tests with a 1% slowdown imply an elo gain of about 4 - 7. These very clear test results look somewhat suspicious, imho. |
@joergoster also I was surprised how easily they passed, so far down to 0.3%. Full table:
|
SPRT is the most powerful possible test (Neyman-Pearson lemma) so detecting 4-7 elo difference is not that surprising, especially at STC. |
@mcostalba I propose to repeat the test with 1% slowdown at LTC. UCI protocol and therefore cutechess-cli can't provide the accuracy needed at STC, because times are always sent in milliseconds. So even if SF spent only 500 microseconds more, cutechess will eventually send 1 millisecond less with the next 'go' command. And this sums up during the game possibly magnifying the intended slowdown. |
Windows 10 gcc version 6.3.0 (Rev3, Built by MSYS2 project)
gcc version 7.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
Windows 10 WSL gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
|
testing as @joergoster gcc version 6.3.0 (Rev3, Built by MSYS2 project)
gcc version 7.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
Windows 10 WSL
|
Thanks for testing! Ok first test passed SPRT[0, 4] as expected, now I have submitted this to validate slowdown code. Test should fail to verify slowdown code is reliable and does not introduces odd overheads like in previous versions. |
@mcostalba no problems on Windows 10 benching 5ae31a854facbb1a414134065647e55a90732649 vs. ecd3218 gcc version 6.3.0 (Rev3, Built by MSYS2 project)
|
@mcostalba i7 3770k Windows 10 benches for Check resolution with slowdown below 1% (0564ecd64912d23349af8fedef5b2623c21c0342 vs. ecd3218) gcc version 6.3.0 (Rev3, Built by MSYS2 project)
gcc version 7.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
WSL - gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
|
...and the test passed! :-) The (improved) slowdown branch above which test is based should be less than 1% of slowdown against master. Please confirm this is less than 1% also for you and then we can finally prove that SPRT[0, 4] is able to detect below 1% speed-ups!!! |
@mcostalba Windows 10 benches in previous post, speed-ups: 1.05% - 1.17% - 1.24% |
This is what I get.
And the bench comparison:
|
@ppigazzini @joergoster Thanks for testing and confirming code works as expected! So it is confirmed slowdown in long searches is about 1%, indeed in real games it is even lower. For instance with just one second remaining on the clock (a typical STC end-game), the slowdown disappears:
So that at STC the slowdown is mainly at the opening / midgame, but this is already enough to measure a sensible difference. I am surprised of how well SPRT[0, 4] at STC is able to resolve even very small differences in speed in a reliable way. I guess the key ingredient here is STC that yields to games played at relatively shallow search depth, where even a bit of depth more has an impact on the overall engine strength. Our fishtest framework is really amazing :-) |
@mcostalba what about testing 0.5% slowdown? Anyway, check here how well works fishtest despite relaxed bounds, not many CPU contributors and (I suppose) some CPUs that should be considered too noisy by fishcooking people :) Obviously easier tasks. |
@ppigazzini I think we are already at the limit of resolution of slowdown code (maybe not of fishtest). For instance average slowdown for the game is already below 1% because, as I said, when 1 second remains on the clock there is no slowdown. So moving to 0,5% it means that the very little slowdown will be effective only at the beginning and early midgame. Moreover bench results show that speed-up uncertainty is about 0,5% already and setting the slowdown to something lower than 1% we risk to measure just noise. As is now is good to prove that SPRT[0, 4] at STC is able to discriminate optimizations of at least 1% or even a bit lower and this is what we wanted to prove. |
saves over 2Mb of memory, as in particular CounterMoveHistoryStats is a large object. Improved cache utilization yields a small speedup No functional change.
Nothing was "proved" here. At the very least the test would have to be repeated a number of times to really yield an idea experimentally how sensitive an SPRT(0,4) is to 1% speedups. A much better approach is to first determine the ratio elo/speedup. Under the reasonable assumption that the relation elo<->speedup is linear for small elo differences this can be done for a much larger speedup (say 5% or even 10%) using a fixed length test. (Trying instead to measure directly the elo corresponding to a 1% speedup would require an enormous number of games to have a reliable result.) Once the ratio elo/speedup is known the pass probabilty for a fixed speedup can just be computed mathematically. |
@mcostalba bench process has some juice left :)
|
So, final version of this patch rebased on master. Certainly won't pass [0,4], so if as a result of the discussion in this thread it is decided that [0,4] is the standard for future speedup patches, feel free to close. |
@mcostalba fishtest is both a theoretical framework, and for this aspect the @vdbergh suggestion to measure/verify elo/speedup and pass probability is correct, and a practical framework, where the contribute of the workers was never tested/measured. There are some not answered questions (and some complaint threads) about: disabling HT, n° cores -1, using virtual cores, cloud instances, CPU load from other processes ... Instructions and FAQ suggest to use only strictly controlled workers, but the workers set today is a big fruit salade :) IMO is important to check if the practical framework is able to work according to the theoretical framework and the speedup test at STC is an easy way to perform the check. I'm confident that the big fruit salade is working fine :) |
Time to make a decision on this one:
I'll take 'auto&' changes from this patch, but the data type shall remain as int. |
fine. |
No functional change. Closes official-stockfish#1113
No functional change. Closes official-stockfish#1113
No functional change. Closes official-stockfish#1113
No functional change. Closes official-stockfish#1113
saves over 2Mb of memory, as in particular CounterMoveHistoryStats is a large object. Improved cache utilization yields a small speedup (0.7% locally).
The patch is non-functional as int16_t is sufficiently large to capture all values that can be reached with the current update scheme.
The accessible range of values is [-32 * D, 32 * D] as a closer analysis of the update formula reveals.
Two functionally equivalent, but less clean, versions of this patch have been tested for no-regression:
http://tests.stockfishchess.org/tests/view/591353890ebc59035df344e4
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 54476 W: 9949 L: 9886 D: 34641
http://tests.stockfishchess.org/tests/view/59140f6b0ebc59035df3451a
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 27019 W: 4905 L: 4794 D: 17320
No functional change.