New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look for symmetrical position in cache. #1275
Conversation
How does the performance look? Edit: To explain, I'm very skeptical about code that can only be useful for at most a few moves per game. But if this is basically free, it's worth considering. |
This gives a speed boost for the first three moves or so and any performance hit afterwards is too insignificant for me to detect. |
The other PR broke this, but I'll need a bit to review this anyway, wrt calc_hash changes, performance impact and the need for the rotation table (which relates to performance impact again). |
@TFiFiE can you do a me a favor and rebase this? |
c91fffb
to
39911dd
Compare
src/FastBoard.h
Outdated
@@ -125,6 +125,8 @@ class FastBoard { | |||
int m_boardsize; | |||
int m_squaresize; | |||
|
|||
std::array<std::array<unsigned short, MAXSQ>, 8> m_symmetry_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of a pain would it be to make it so there is only one of these tables? Or is there a reason why it makes sense to have one for each object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tied to the board size, which is variable for this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. Can totally see it now, thanks.
I had it stuck in my head as being stored in an X by Y array which would be the same for all sizes but no, it's a vertex array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can live with an extra indirection you can make it a static class owned vector (indexed by board size) and just check if that is initialized the moment you initialize FastBoard.
Alternatively, do the calculation on the fly...if necessary limit the symmetry check to the first 10-20 moves or so?
That's some of the things I want to look at.
src/Network.cpp
Outdated
@@ -868,8 +870,19 @@ Network::Netresult Network::get_scored_moves( | |||
|
|||
if (!skip_cache) { | |||
// See if we already have this in the cache. | |||
if (NNCache::get_NNCache().lookup(state->board.get_hash(), result)) { | |||
return result; | |||
for (auto symmetry = 0; symmetry < 8; ++symmetry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name clash now that rotation
parameter has been renamed. Sorry about that..
fbdb8e3
to
c1bf93f
Compare
I have been running this in my custom LZ build (essentially LZ_next + roy7's LCB_max_root) for a week now with Lizzie, testing it over the course of several dozen real-life 9dan vs. LZ games. I have noticed no downsides or performance decreases with this PR. For the first few moves when symmetrical positions are extremely likely, it indeed massively boosts reported visit counts for valid symmetrical board root moves as expected. Although I cannot personally confirm one way or the other whether this PR's implementation has or does not have any bugs in aggregating symmetrical move scores or whether it truly results in better move selection, I can say half empirically and half anecdotally that: 1) I have noticed no errors, misplays, or bizarrities from enabling this PR (which presumably is in effect for the entire game?). 2) I have noticed no decrease in n/sec overall, although on this point I have only conducted the bare minimum of performance testing and comparison, and didn't keep records of my limited testing results. If there are no identifiable problems in how this is implemented, this PR could be enabled soon for selfplay and provide a mild boost to early-game search strength. Though do note that -v 3200 might interfere with the greatly boosted visit counts from this PR—and so perhaps a special allowance of extra visits for early game symmetrical positions should be given, or else a "true visits" value should be tallied separately from the boosted "true + symmetrically_identical_position_visits" search results for the purposes of correctly enforcing a full search with 3200 real visits. Otherwise, using the visit counts directly from this PR might result in early-game searches erroneously ending search calculations hundreds or thousands of visits earlier than the intended 3200v mark. This is a simple issue to address in any case, but I cannot fully interpret the code above on my own so it's even possible this effect has already been accounted for, in which case disregard. |
This doesn't interfere with visit counting AFAIK, just makes it faster. I basically only want to see if we can get the extra array out of FastBoard, and if getting rid of it would ( to keep performance parity) require limiting this to the first moves of the game. |
Why would you want such a change if only one |
I've only conducted rather shallow testing regarding performance, just enough to satisfy me to continue running it. I recall noticing zero performance degradation from my one or two haphazard tests from a week or so ago. I can't quantify or shed light on performance impact (or lack there of) any further than that. That said, it's pretty clear the benefit of collating symmetrical moves is nearly guaranteed to be negligible beyond the first handful number of moves. I'd be happy with a max of 10 moves, just judging by gut feelings. It probably goes without saying, but if we end up limiting this, then further considerations might need to be made for symmetrical play involving pre-placed handicap stones depending on how LZ's internal move count interacts with the set move limit. |
Because it might be quite useful to construct multiple, for example if asked to list dead stones. |
I guess this should be disabled during training? Else we can't take advantage of rotational asymmetry in the opening. |
Based on pull request #1275, but without keeping the rotation array in every board instance.
I reworked this a bit in #1421. |
I can't measure a performance difference, and the other pull addresses the self-play games, so going to close this. |
@gcp: Can you reopen this? |
Based on pull request #1275, but without keeping the rotation array in every board instance.
* Look for symmetrical position in cache. * Disable NNCache symmetry in self-play. To increase randomness from rotational assymetry. * Only check symmetry in opening. Refactor TimeControl. Only check for symmetries in the NNCache when we are in the opening (fast moving zone). Refactor TimeControl to take the boardsize out. * Change bench to assymetric position. Avoids rotation symmetry speedups, they are not typical. * Rename rotation to symmetry, limit to early opening. Be consistent and don't call symmetries rotations. Limit the symmetry lookups to until halfway the opening (which is the first 30 moves on 19 x 19). Based on pull request #1275, but without keeping the rotation array in every board instance. Pull request #1421.
* Look for symmetrical position in cache. * Disable NNCache symmetry in self-play. To increase randomness from rotational assymetry. * Only check symmetry in opening. Refactor TimeControl. Only check for symmetries in the NNCache when we are in the opening (fast moving zone). Refactor TimeControl to take the boardsize out. * Change bench to assymetric position. Avoids rotation symmetry speedups, they are not typical. * Rename rotation to symmetry, limit to early opening. Be consistent and don't call symmetries rotations. Limit the symmetry lookups to until halfway the opening (which is the first 30 moves on 19 x 19). Based on pull request leela-zero#1275, but without keeping the rotation array in every board instance. Pull request leela-zero#1421.
* Add multi GPU training support. Pull request leela-zero#1386. * Extend GTP to support real time search info. * Extend GTP to add support for displaying winrates and variations from LZ while LZ is thinking. * Use UCI format for lz-analyze and lz-genmove-analyze. * Don't sort gtp lz-analyze ouput because it is not thread-safe. Pull request leela-zero#1388. * Remove virtual loss from eval for live stats. For discussion see pull request leela-zero#1412. * Make analysis output use one move per line. More in line with UCI, cleaner, easier to parse, smaller code. * Remove versioned clang from Makefile. Don't hardcode the clang version in the Makefile. * Fix varargs usage. Regression from leela-zero#1388. Fixes issue leela-zero#1424. * AutoGTP: send leelaz version to server. Send leelaz version embedded in the URL used to ask for a new job. Pull request leela-zero#1430. * Multi GPU: fix split and variable placement. * Fix split in net_to_model. * Add soft placement of variables. * Fixes Windows issues. Pull request leela-zero#1443. * Mutex optimization. * Updated Mutex implementation to use TTS instead of TS. * Explicitly relax memory order (no behavior change, it's the default) and attempt TS before TTS loop. (improves performance in low contention locks) Pull request leela-zero#1432. * Update leela-zero.vcxproj for VS2015. Pull request leela-zero#1439. * Add order to analysis data. See discussion in issue leela-zero#1425. Pull request leela-zero#1478. * Fix misleading comments & naming. The Alpha (Go) Zero outputs use TanH nonlinearities, not sigmoids. The code comments and variable naming refer to an earlier version that used sigmoids and that is confusing people. See issue leela-zero#1484. * Add Lizzie and LeelaSabaki to README. Pull request leela-zero#1513. * Make Debian package with CMake. * Create debian package by cpack We can create debian leelaz package by "make package" by cpack. * Find leelaz if ./leelaz is not existed If leelaz is installed at /usr/bin, then autogtp should find it by leelaz instead of ./leelaz. * Generate package dependency list Use dpkg-shlibdeps to generate better package dependency list * Use git tags as version strings Pull request leela-zero#1445. * Look for symmetry on NNCache lookup. * Look for symmetrical position in cache. * Disable NNCache symmetry in self-play. To increase randomness from rotational assymetry. * Only check symmetry in opening. Refactor TimeControl. Only check for symmetries in the NNCache when we are in the opening (fast moving zone). Refactor TimeControl to take the boardsize out. * Change bench to assymetric position. Avoids rotation symmetry speedups, they are not typical. * Rename rotation to symmetry, limit to early opening. Be consistent and don't call symmetries rotations. Limit the symmetry lookups to until halfway the opening (which is the first 30 moves on 19 x 19). Based on pull request leela-zero#1275, but without keeping the rotation array in every board instance. Pull request leela-zero#1421. * Symmetry calculation cleanup. Pull request leela-zero#1522. * Non-pruning (simple) time management. See issue leela-zero#1416. Pull request leela-zero#1497. * Clean up some constants. * Remove unused 'BIG' constant. * Capture "N/A" vertex value in constant. Pull request leela-zero#1528. * Duplicate line removal. Pull request leela-zero#1529. * Script for converting minigo weights. Pull request leela-zero#1538. * Update README.md. Added q+Enter instructions. Pull request leela-zero#1542. * Fix Validation checking on Windows. Fix Validation checking if binary exists on Windows. Pull request leela-zero#1544. * Constant for the unchanged symmetry index. Pull request leela-zero#1548. * Update README.md. Update the TODO list. * Removed unused class KeyPress. Pull request leela-zero#1560. * Allow 3 AutoGTP quitting conditions. Pull request leela-zero#1580. * More draw handling. Pull request leela-zero#1577. * Suppress upstream warnings in Makefile. Pull request leela-zero#1605. * Fix TF update operations. The real update operation should be the computation of the gradient rather than the assignment of it. Pull request leela-zero#1614. Fixes issue leela-zero#1502. * Code restructuring: less globals. * Remove thread_local variables for OpenCL subsystem. (this is to allow many different OpenCL implementations to exist concurrently) * OpenCLScheduler: task queue cleanup. * Change static Network methods to instance methods and replace it with global Network instance. * All weights moved from Network.cpp static variables to class Network. * NNCache is now a member variable of Network, not a global. * Network filename now comes from external call, not a global variable. * Removed global g_network object, instead it is member of UCTSearch class. * UCTNode is now a static member variable of GTP. (instead of a static of a function) * Rename ThreadData to OpenCLContext. (it's no longer a thread-specific structure). Pull request leela-zero#1558. * Removed unused types. Pull request leela-zero#1621. * Resurrect GPU autodetection. Fixes issue leela-zero#1632. Pull request leela-zero#1633. * Restrict the use of "score". Using "score" as a nonspecific term (and not when it, for example, refers to the count at the end of game) makes it unnecessarily hard to understand the code and see how it matches with the literature. Pull request leela-zero#1635. * Code restructuring: Create ForwardPipe interface. Code restructuring: Create abstract class ForwardPipe, which represents a class that has a forward() call. * Moved network initialization code to OpenCLScheduler. * Created abstract class ForwardPipe which will be the base interface of all forward() calls. * Moved CPU-based forward() code to class CPUPipe. * Added --cpu-only option. This command line option will run a CPU-only implementation on a OpenCL build. Can be used for testing and running fallback modes rather than switching binaries. Pull request leela-zero#1620. * Coding style consistency cleanups. * Remove use of "new". Prefer make_unique instead. * Give ForwardPipe a virtual destructor. Silence clang warning. Pull request leela-zero#1644. * Replace if-else chain with switch statement. Pull request leela-zero#1638. * Use Winograd F(4x4, 3x3). * Winograd F(4x4, 3x3) for CPU * Winograd F(4x4, 3x3) for OpenCL * OpenCL batching support. Pull request leela-zero#1643. * Increase error budget in tuner. The 256 channel network exceeds 1% error in the tuner, but the network output seems accurate enough during play. Fixes leela-zero#1645. Pull request leela-zero#1647. * Get rid of more "network" globals and pointers. Keep a single "network" global in GTP, owned by a unique_ptr and move things around when needed. Pull request leela-zero#1650. * Runtime selection of fp16/fp32. * OpenCL half precision is now command-line option, support compiled in by default. This converts the OpenCL code into a gigantic template library. * Update Network self-check. - Final output is used for self-check. - Criteria is 20% error, while ignoring values smaller than 1/361. - Throws exception when three out of last ten checks fails. Pull request leela-zero#1649. * Minor code cleanups. Slight style edits of code and comments. * Clean up SGFTree style. Modernize some parts of SGFTree's style. * Remove separate USE_HALF build from CI. This is integrated into the main build now. Pull request leela-zero#1655. * Don't assume alternating colors in SGF. Fix a bug that an SGF file/string cannot contain 2 consecutive moves of the same color. Fixes issue leela-zero#1469. Pull request leela-zero#1654. * Remove separate half precision kernel. Use the preprocessor defines to make a single kernel support both single precision and half precision storage. Pull request leela-zero#1661. * Compress duplicate evaluation code. Pull request leela-zero#1660. * Consistent header guard naming. Pull request leela-zero#1664. * Replace macros with proper constants. Pull request leela-zero#1671. * Implement NN eval fp16/fp32 autodetect. Implemented NN eval fp16/fp32 autodetect. Runs both precisions for 1 seconds, and if fp16 is faster than fp32 by more than 5%, fp16 is used. Removes --use-half, replaces it with --precision [auto|single|half] option, default auto. Pull request leela-zero#1657. * Resign analysis: search for the highest resign threshold. Added resign analysis option to search for the highest resign threshold that should be set. Pull request leela-zero#1606. * Half precision compute support. Use half precision computation on cards that support it. Pull request leela-zero#1672. * Thread scalability improvements. - On OpenCLScheduler, don't use condvars which tends to be slow because of thread sleep/wake. - Instead, use spinlocks and just have enough contexts to avoid sleeping. - Allow more threads than the CPU physically has. This is required in many multi-GPU setups with low core counts (e.g., quad-core non-hyperthread with 2 GPUs) Pull request leela-zero#1669. * Use L2-norm in self check. The previous method is too strict for fp16 compute. Since lower precision of fp16 is still good enough to play at the same strength as fp32 relax the self check. Pull request leela-zero#1698. * OpenCL tuner fixes. * Fix error calculation (Missing batch_size divider). * Better error reporting when no working configuration could be found. * Change reference data to have less rounding errors with half precision. * Replace BLAS reference SGEMM with custom code that gives transposed output like the OpenCL SGEMM. Pull request leela-zero#1710. * Change policy vector to array. Should save a tiny bit of memory. Pull request leela-zero#1716. * Fall back to single precision net on breakage. Fall back to single precision net when half precision is broken, at least when detection mode is auto. Pull request leela-zero#1726. * AutoGTP: use compressed weights networks. Pull request leela-zero#1721. * Fix OpenCL buffer sizes. Some OpenCL buffers were allocated too big. Tested with oclgrind that the new sizes are correct. Pull request leela-zero#1727. * Script for quantizing weights. Use smaller precision to store the weights to decrease the file size. See discussion in issue leela-zero#1733. Pull request leela-zero#1736. * Network initialization restructuring. * Network initialization restructuring - Create one net at a time when doing fp16/fp32 autodetect. Saves some GPU memory. - Create an internal lambda which initializes the nets. - Use std::copy to copy vectors to reduce runtime. * zeropad_U : loop reordering for performance optimization. Plus other optimizations for zero-copying initialization. Pull request leela-zero#1750. * Fix comments, code style. Minor fixes to incorrect comments, and reduce some excessively long lines. * Validation: support GTP commands for each binary. * Changed Validation and Game to support multiple GTP commands at start up but left the Validations options untouched. * Separated engine options (as positional arguments) from match options. Replaced time settings option with ability to specify any GTP commands. * Added --gtp-command options using the existing option parser. Also changed default binary options from -p 1600 to -v 3200. * Each binary argument has to be preceded by "--". * Changes to use Engine Objects. * Exits on failed GTP command. Added printing of GTP commands in gameStart() so users can see what commands are actually sent to each engine. Pull request leela-zero#1652. * Don't refer to stone locations as "squares". * Don't refer to stone locations as "squares". * Use "vertex" for those in the "letterbox" representation. * Otherwise, mostly use "intersection". * Also, capture all possible moves (i.e. including pass) in its own explicit constant. * Clean up network constants. Pull request leela-zero#1723.
As suggested in #1084 (comment).
I took the liberty of changing the benchmark position to one in which this doesn't cause a speedup, because it's not typical for the game as a whole.