Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime selection of fp16/fp32 #1649

Merged
merged 6 commits into from Jul 25, 2018
Merged

Runtime selection of fp16/fp32 #1649

merged 6 commits into from Jul 25, 2018

Conversation

ihavnoid
Copy link
Member

This adds --use-half as a command line option, and users can select between fp16/fp32 based on command line.

This converts the OpenCL code into a gigantic template library.
src/Leela.cpp Outdated
@@ -90,6 +90,9 @@ static void parse_commandline(int argc, char *argv[]) {
"ID of the OpenCL device(s) to use (disables autodetection).")
("full-tuner", "Try harder to find an optimal OpenCL tuning.")
("tune-only", "Tune OpenCL only and then exit.")
#ifdef USE_HALF
("use-half", "Use half-precision OpenCL code. Traades off some accuracy for higher performance")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

src/Network.cpp Outdated
if (cfg_use_half) {
myprintf("Initializing OpenCL (half precision).\n");
m_forward = std::make_unique<OpenCLScheduler<half_float::half>>();
use_selfcheck = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How badly would the self-check need to be relaxed for HALF to pass?

Copy link
Member

@gcp gcp Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm asking is that the comment for USE_HALF says "please test before enabling it".

And the user is going to wonder: test what? 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit experimenting and ran the whole thing overnight (with self-checking everything) and I recall that eventually failed even with 100% margins. Most of the cases the error was less than 1% so it was okay.

I even moved the self-check to the final results (from the output of the forward() call) and it still did fail with 100% margins. It didn't seem to yield something too problematic (e.g., policy net moving a probability from 0.05 to 0.1) which isn't something that the tree search cannot fix, though.

I guess if we still want the USE_HALF self-check we have to do something like 'N cases with average error of XXX%' but I am not sure what is the right value/way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self-check has a catch that "rounds" all small values to zero (because the relative error gets very big on a very small value). Maybe it's just a matter of slightly pulling that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Leela Zero Chess has a probabilistic self-check, in that it tolerates the occasional failure. Not sure how that combines with us already doing the self check once every xxx nodes though. At some point you're not going to catch buggy drivers either any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically what we should test... is the strength of the engine (which one yields better win rate - speed vs. NN accuracy) and that can differ quite a bit depending on which GPU the user uses. The last time I tested, I figured out that high-bandwidth GPUs (e.g., Tesla P100 from Google Cloud) doesn't yield much additional performance hence we were sacrificing accuracy for nothing. In those cases --use-half would be worthless.

Maybe all I can say for now is to delete the comment? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option would be to use KL divergence or some other measure to calculate the error in the self check. KL divergence doesn't care if the low probabilities are little off unlike the current method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what I will do is:

  • If it fails 2 out of 5 most recent checks it will assert-fail.
  • Regarding the error margin... I will test a bit to see what is acceptable.

@gcp
Copy link
Member

gcp commented Jul 24, 2018

The travis setup has a mode where it compiles leela-zero:gpu-half, so that likely needs some changes. Or maybe it can just be removed outright.

 - Final output is used for self-check
 - Criteria is 20% error, while ignoring values smaller than 1/361
 - Throws exception when three out of last ten checks fails
@ihavnoid
Copy link
Member Author

Added update on self-check. Barfs when:

  • 3 out of last 10 checks failed
  • failure criteria is > 20% error on anything higher tha 1/361.0
  • Checks are performed on final net output instead of output of forward() call

src/Network.cpp Outdated
LOCK(m_selfcheck_mutex, selfcheck_lock);
if (selfcheck_fail) {
m_selfcheck_fails.push_back(true);
if (std::count(m_selfcheck_fails.begin(), m_selfcheck_fails.end(), true) >= max_failures) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

begin(m_selfcheck_fails)

selfcheck_fail = true;
}

LOCK(m_selfcheck_mutex, selfcheck_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to get rid of this lock by making m_selfcheck_fails a bitfield and CAS-ing it. (Optional)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about quickly, but it seems to be more trouble than necessary - we need to implement an array of bitfields and then implement a circular buffer to track last N pass/fails. This lock is unlikely to be performance-critical anyway (only happens one out of 2000 evals).

Any better idea is welcome but I can't figure out how to make it simple :)

Copy link
Member

@gcp gcp Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only happens one out of 2000 evals

That's a good point and makes my idea moot.

For reference:
You don't need an array or circular buffer. You just add the latest result like this bitfield = (bitfield << 1 | result) & 0xA; i.e. just relying on the shift and mask. Checking failure then is popcount(bitfield) > 2. Because the bitfield can be an int32, it can be atomically CASed and no lock is needed.

* Unlike previous versions, this does not enable half-precision by default,
* it just compiles half-precision support. You have to use the command line
* argument --use-half explicitly to enable half-precision.
* half-precision OpenCL gains performance on some GPUs while losing some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious question: should we check it during tuning and automatically enable if beneficial?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifdef USE_HALF
const auto TUNER_KERNEL = std::string("XgemmBatchedHalf");
#endif

This stuff no longer works correctly. (Github didn't let me comment on the actual line)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about autodetecting between fp16 and fp32, I think the best I can possibly try is:

  • Do a netbench on both fp16 and fp32, and use fp16 only if fp16 is at least 5% faster. 5% is a bit arbitrary, but last time we tried engine strength was roughly equivalent with same playouts, so 5% gives a bit of margin.
  • If fp16 selfcheck fails revert to fp32 - the revert process may be tricky due to possibility of violating time constraint if it were to happen in the middle of a game, though. Hopefully this doesn't happen too frequently that it loses by time.
  • autodetect is default but also allow forcing fp16 or fp32
    Any better ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the TUNER_KERNEL has been fixed too. Involved some template tricks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fp16 selfcheck fails revert to fp32

Not sure we need to handle this case with the relaxed settings.

@ihavnoid
Copy link
Member Author

Made some commits. The autodetect might take a bit more time, eventually we have to get there, but I am not sure if we want it as a separate PR or part of this PR.

@fell111
Copy link

fell111 commented Jul 25, 2018

If we do not have auto detect, we need to add "USE_HALF" option in autogtp too. Otherwise there will be no parameter passes into leelaz.

@gcp gcp merged commit b323d40 into leela-zero:next Jul 25, 2018
gcp pushed a commit that referenced this pull request Jul 25, 2018
* OpenCL half precision is now command-line option, 
  support compiled in by default.
  This converts the OpenCL code into a gigantic template library.
* Update Network self-check.
 - Final output is used for self-check.
 - Criteria is 20% error, while ignoring values smaller than 1/361.
 - Throws exception when three out of last ten checks fails.

Pull request #1649.
ChinChangYang added a commit to ChinChangYang/leela-zero that referenced this pull request Aug 25, 2018
* Add multi GPU training support.

Pull request leela-zero#1386.

* Extend GTP to support real time search info.

* Extend GTP to add support for displaying winrates and variations
  from LZ while LZ is thinking.
* Use UCI format for lz-analyze and lz-genmove-analyze.
* Don't sort gtp lz-analyze ouput because it is not thread-safe.

Pull request leela-zero#1388.

* Remove virtual loss from eval for live stats.

For discussion see pull request leela-zero#1412.

* Make analysis output use one move per line.

More in line with UCI, cleaner, easier to parse, smaller code.

* Remove versioned clang from Makefile.

Don't hardcode the clang version in the Makefile.

* Fix varargs usage.

Regression from leela-zero#1388. Fixes issue leela-zero#1424.

* AutoGTP: send leelaz version to server.

Send leelaz version embedded in the URL used to ask for a new job.

Pull request leela-zero#1430.

* Multi GPU: fix split and variable placement.

* Fix split in net_to_model.
* Add soft placement of variables.
* Fixes Windows issues.

Pull request leela-zero#1443.

* Mutex optimization.

* Updated Mutex implementation to use TTS instead of TS.
* Explicitly relax memory order (no behavior change, it's the default) 
  and attempt TS before TTS loop. 
  (improves performance in low contention locks)

Pull request leela-zero#1432.

* Update leela-zero.vcxproj for VS2015.

Pull request leela-zero#1439.

* Add order to analysis data.

See discussion in issue leela-zero#1425.

Pull request leela-zero#1478.

* Fix misleading comments & naming.

The Alpha (Go) Zero outputs use TanH nonlinearities, not sigmoids. The
code comments and variable naming refer to an earlier version that used
sigmoids and that is confusing people.

See issue leela-zero#1484.

* Add Lizzie and LeelaSabaki to README.

Pull request leela-zero#1513.

* Make Debian package with CMake.

* Create debian package by cpack

We can create debian leelaz package by "make package"  
by cpack.

* Find leelaz if ./leelaz is not existed

If leelaz is installed at /usr/bin, then autogtp should find it by 
leelaz instead of ./leelaz.

* Generate package dependency list

Use dpkg-shlibdeps to generate better package dependency list

* Use git tags as version strings

Pull request leela-zero#1445.

* Look for symmetry on NNCache lookup.

* Look for symmetrical position in cache.
* Disable NNCache symmetry in self-play.

To increase randomness from rotational assymetry.

* Only check symmetry in opening. Refactor TimeControl.

Only check for symmetries in the NNCache when we are in the 
opening (fast moving zone). Refactor TimeControl to take the 
boardsize out.

* Change bench to assymetric position.

Avoids rotation symmetry speedups, they are not typical.

* Rename rotation to symmetry, limit to early opening.

Be consistent and don't call symmetries rotations. Limit the symmetry
lookups to until halfway the opening (which is the first 30 moves on
19 x 19).

Based on pull request leela-zero#1275, but without keeping the rotation array in
every board instance.

Pull request leela-zero#1421.

* Symmetry calculation cleanup.

Pull request leela-zero#1522.

* Non-pruning (simple) time management.

See issue leela-zero#1416.

Pull request leela-zero#1497.

* Clean up some constants.

* Remove unused 'BIG' constant.
* Capture "N/A" vertex value in constant.

Pull request leela-zero#1528.

* Duplicate line removal.

Pull request leela-zero#1529.

* Script for converting minigo weights.

Pull request leela-zero#1538.

* Update README.md.

Added q+Enter instructions.

Pull request leela-zero#1542.

* Fix Validation checking on Windows.

Fix Validation checking if binary exists on Windows.

Pull request leela-zero#1544.

* Constant for the unchanged symmetry index.

Pull request leela-zero#1548.

* Update README.md.

Update the TODO list.

* Removed unused class KeyPress. 

Pull request leela-zero#1560.

* Allow 3 AutoGTP quitting conditions.

Pull request leela-zero#1580.

* More draw handling.

Pull request leela-zero#1577.

* Suppress upstream warnings in Makefile.

Pull request leela-zero#1605.

* Fix TF update operations.

The real update operation should be the computation of the gradient 
rather than the assignment of it.

Pull request leela-zero#1614.

Fixes issue leela-zero#1502.

* Code restructuring: less globals.

* Remove thread_local variables for OpenCL subsystem.
  (this is to allow many different OpenCL implementations
   to exist concurrently)
* OpenCLScheduler: task queue cleanup.
* Change static Network methods to instance methods and
  replace it with global Network instance.
* All weights moved from Network.cpp static variables to class Network.
* NNCache is now a member variable of Network, not a global.
* Network filename now comes from external call, not a global variable.
* Removed global g_network object,
  instead it is member of UCTSearch class.
* UCTNode is now a static member variable of GTP.
  (instead of a static of a function)
* Rename ThreadData to OpenCLContext.
  (it's no longer a thread-specific structure).

Pull request leela-zero#1558.

* Removed unused types. 

Pull request leela-zero#1621.

* Resurrect GPU autodetection.

Fixes issue leela-zero#1632.

Pull request leela-zero#1633.

* Restrict the use of "score".

Using "score" as a nonspecific term (and not when it, for example,
refers to the count at the end of game) makes it unnecessarily hard
to understand the code and see how it matches with the literature.

Pull request leela-zero#1635.

* Code restructuring: Create ForwardPipe interface.

Code restructuring: Create abstract class ForwardPipe,
which represents a class that has a forward() call.

* Moved network initialization code to OpenCLScheduler.
* Created abstract class ForwardPipe which will be the base interface
  of all forward() calls.
* Moved CPU-based forward() code to class CPUPipe.
* Added --cpu-only option.

This command line option will run a CPU-only implementation on a
OpenCL build. Can be used for testing and running fallback modes
rather than switching binaries.

Pull request leela-zero#1620.

* Coding style consistency cleanups.

* Remove use of "new".

Prefer make_unique instead.

* Give ForwardPipe a virtual destructor.

Silence clang warning.

Pull request leela-zero#1644.

* Replace if-else chain with switch statement.

Pull request leela-zero#1638.

* Use Winograd F(4x4, 3x3).

* Winograd F(4x4, 3x3) for CPU
* Winograd F(4x4, 3x3) for OpenCL 
* OpenCL batching support.

Pull request leela-zero#1643.

* Increase error budget in tuner.

The 256 channel network exceeds 1% error in the tuner,
but the network output seems accurate enough during play.

Fixes leela-zero#1645.

Pull request leela-zero#1647.

* Get rid of more "network" globals and pointers. 

Keep a single "network" global in GTP, owned by a unique_ptr and move
things around when needed.

Pull request leela-zero#1650.

* Runtime selection of fp16/fp32.

* OpenCL half precision is now command-line option, 
  support compiled in by default.
  This converts the OpenCL code into a gigantic template library.
* Update Network self-check.
 - Final output is used for self-check.
 - Criteria is 20% error, while ignoring values smaller than 1/361.
 - Throws exception when three out of last ten checks fails.

Pull request leela-zero#1649.

* Minor code cleanups.

Slight style edits of code and comments.

* Clean up SGFTree style.

Modernize some parts of SGFTree's style.

* Remove separate USE_HALF build from CI.

This is integrated into the main build now.

Pull request leela-zero#1655.

* Don't assume alternating colors in SGF.

Fix a bug that an SGF file/string cannot contain
2 consecutive moves of the same color.

Fixes issue leela-zero#1469.

Pull request leela-zero#1654.

* Remove separate half precision kernel.

Use the preprocessor defines to make a single kernel support 
both single precision and half precision storage.

Pull request leela-zero#1661.

* Compress duplicate evaluation code. 

Pull request leela-zero#1660.

* Consistent header guard naming. 

Pull request leela-zero#1664.

* Replace macros with proper constants.

Pull request leela-zero#1671.

* Implement NN eval fp16/fp32 autodetect.

Implemented NN eval fp16/fp32 autodetect.
Runs both precisions for 1 seconds, and if fp16 is faster than 
fp32 by more than 5%, fp16 is used. 
Removes --use-half, replaces it with 
--precision [auto|single|half] option, default auto.

Pull request leela-zero#1657.

* Resign analysis: search for the highest resign threshold. 

Added resign analysis option to search for the highest 
resign threshold that should be set.

Pull request leela-zero#1606.

* Half precision compute support.

Use half precision computation on cards that support it.

Pull request leela-zero#1672.

* Thread scalability improvements.

- On OpenCLScheduler, don't use condvars which tends to be slow 
  because of thread sleep/wake.
- Instead, use spinlocks and just have enough contexts to avoid sleeping.
- Allow more threads than the CPU physically has. 
  This is required in many multi-GPU setups with low core counts 
  (e.g., quad-core non-hyperthread with 2 GPUs)

Pull request leela-zero#1669.

* Use L2-norm in self check.

The previous method is too strict for fp16 compute. 

Since lower precision of fp16 is still good enough to play at 
the same strength as fp32 relax the self check.

Pull request leela-zero#1698.

* OpenCL tuner fixes.

* Fix error calculation (Missing batch_size divider).
* Better error reporting when no working configuration could be found.
* Change reference data to have less rounding errors with half precision.
* Replace BLAS reference SGEMM with custom code that gives transposed 
  output like the OpenCL SGEMM.

Pull request leela-zero#1710.

* Change policy vector to array.

Should save a tiny bit of memory.

Pull request leela-zero#1716.

* Fall back to single precision net on breakage.

Fall back to single precision net when half precision is broken, 
at least when detection mode is auto.

Pull request leela-zero#1726.

* AutoGTP: use compressed weights networks.

Pull request leela-zero#1721.

* Fix OpenCL buffer sizes.

Some OpenCL buffers were allocated too big. 
Tested with oclgrind that the new sizes are correct.

Pull request leela-zero#1727.

* Script for quantizing weights.

Use smaller precision to store the weights to decrease the file size.

See discussion in issue leela-zero#1733.

Pull request leela-zero#1736.

* Network initialization restructuring.

* Network initialization restructuring

- Create one net at a time when doing fp16/fp32 autodetect.
  Saves some GPU memory.
- Create an internal lambda which initializes the nets.
- Use std::copy to copy vectors to reduce runtime.

* zeropad_U : loop reordering for performance optimization.

Plus other optimizations for zero-copying initialization.

Pull request leela-zero#1750.

* Fix comments, code style.

Minor fixes to incorrect comments, and reduce some excessively long
lines.

* Validation: support GTP commands for each binary.

* Changed Validation and Game to support multiple GTP commands
  at start up but left the Validations options untouched.
* Separated engine options (as positional arguments) from match options.
  Replaced time settings option with ability to specify any GTP commands.
* Added --gtp-command options using the existing option parser.
  Also changed default binary options from -p 1600 to -v 3200.
* Each binary argument has to be preceded by "--".
* Changes to use Engine Objects.
* Exits on failed GTP command.

Added printing of GTP commands in gameStart() so users can see what
commands are actually sent to each engine.

Pull request leela-zero#1652.

* Don't refer to stone locations as "squares".

* Don't refer to stone locations as "squares".

* Use "vertex" for those in the "letterbox" representation.
* Otherwise, mostly use "intersection".
* Also, capture all possible moves (i.e. including pass) in its own
  explicit constant.

* Clean up network constants.

Pull request leela-zero#1723.
@ihavnoid ihavnoid deleted the runtime_fp16 branch October 8, 2018 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants