Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--full-tuner can do 3 tunings instead of 1 or 2 #1973

Closed
liujn2018 opened this issue Nov 1, 2018 · 26 comments
Closed

--full-tuner can do 3 tunings instead of 1 or 2 #1973

liujn2018 opened this issue Nov 1, 2018 · 26 comments
Labels

Comments

@liujn2018
Copy link

I have downloaded the latest source and compiled on google colab.
However, when I did the full tuner, the tuner process ran again and again.

My command is:
!cd run_lz_in_gg && ./leelaz -w networks/2da87ea8da0f54e87b70159e6bb82811b61d1c31091b6e019fbe62aeaa803b9c.gz --full-tuner

The output is:
`Using 2 thread(s).
RNG seed: 11407022574244404640
Leela Zero 0.16 Copyright (C) 2017-2018 Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

BLAS Core: built-in Eigen 3.3.5 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 9.2.176
Platform profile: FULL_PROFILE
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Device ID: 0
Device name: Tesla K80
Device type: GPU
Device vendor: NVIDIA Corporation
Device driver: 396.44
Device speed: 823 MHz
Device cores: 13 CU
Device score: 1112
Selected platform: NVIDIA CUDA
Selected device: Tesla K80
with OpenCL 1.2 capability.
Half precision compute support: No.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 9.2.176
Platform profile: FULL_PROFILE
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Device ID: 0
Device name: Tesla K80
Device type: GPU
Device vendor: NVIDIA Corporation
Device driver: 396.44
Device speed: 823 MHz
Device cores: 13 CU
Device score: 1112
Selected platform: NVIDIA CUDA
Selected device: Tesla K80
with OpenCL 1.2 capability.
Half precision compute support: No.

Started OpenCL SGEMM tuner.
Will try 5355 valid configurations.
(1/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.7798 ms (151.3 GFLOPS)
(2/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4100 ms (287.7 GFLOPS)
(3/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4083 ms (288.9 GFLOPS)
(87/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3289 ms (358.7 GFLOPS)
(130/5355) KWG=32 KWI=8 MDIMA=32 MDIMC=16 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3166 ms (372.6 GFLOPS)
(134/5355) KWG=32 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.2621 ms (450.1 GFLOPS)
(139/5355) KWG=16 KWI=8 MDIMA=32 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.2610 ms (452.0 GFLOPS)
(159/5355) KWG=32 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=1 0.2163 ms (545.3 GFLOPS)
(296/5355) KWG=16 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 0.1923 ms (613.5 GFLOPS)
(2361/5355) KWG=32 KWI=8 MDIMA=16 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=1 VWM=2 VWN=4 0.1903 ms (619.9 GFLOPS)
(2370/5355) KWG=32 KWI=8 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=1 VWM=4 VWN=4 0.1852 ms (637.1 GFLOPS)
(2711/5355) KWG=16 KWI=8 MDIMA=32 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=0 STRM=1 STRN=1 VWM=2 VWN=4 0.1818 ms (648.8 GFLOPS)
(3652/5355) KWG=16 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=1 STRM=0 STRN=1 VWM=8 VWN=2 0.1817 ms (649.2 GFLOPS)
(4602/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=1 STRN=0 VWM=4 VWN=2 0.1814 ms (650.1 GFLOPS)
Wavefront/Warp size: 32
Max workgroup size: 1024
Max workgroup dimensions: 1024 1024 64

Started OpenCL SGEMM tuner.
Will try 5355 valid configurations.
(1/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.7486 ms (157.6 GFLOPS)
(2/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4401 ms (268.1 GFLOPS)
(17/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4366 ms (270.2 GFLOPS)
(87/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.4358 ms (270.7 GFLOPS)
(130/5355) KWG=32 KWI=8 MDIMA=32 MDIMC=16 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.4190 ms (281.6 GFLOPS)
(134/5355) KWG=32 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3727 ms (316.5 GFLOPS)
(139/5355) KWG=16 KWI=8 MDIMA=32 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3722 ms (317.0 GFLOPS)
(204/5355) KWG=16 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=2 0.3668 ms (321.6 GFLOPS)
(281/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 0.3532 ms (334.0 GFLOPS)
(302/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=8 VWN=2 0.3529 ms (334.3 GFLOPS)
(1371/5355) KWG=32 KWI=2 MDIMA=32 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.3374 ms (349.6 GFLOPS)
(1443/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3170 ms (372.1 GFLOPS)
(1457/5355) KWG=16 KWI=2 MDIMA=32 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3055 ms (386.2 GFLOPS)
(1475/5355) KWG=16 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3049 ms (386.9 GFLOPS)
(1501/5355) KWG=32 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=4 VWN=1 0.2848 ms (414.1 GFLOPS)
(1608/5355) KWG=32 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 0.2805 ms (420.6 GFLOPS)
(4123/5355) KWG=16 KWI=2 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=1 0.2776 ms (424.9 GFLOPS)
(4260/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.2696 ms (437.6 GFLOPS)
(4276/5355) KWG=16 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.2618 ms (450.6 GFLOPS)
(4284/5355) KWG=16 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.2534 ms (465.5 GFLOPS)
(4302/5355) KWG=16 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=8 VWN=2 0.2423 ms (486.8 GFLOPS)
Wavefront/Warp size: 32
Max workgroup size: 1024
Max workgroup dimensions: 1024 1024 64
Using OpenCL single precision (less than 5% slower than half).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 9.2.176
Platform profile: FULL_PROFILE
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Device ID: 0
Device name: Tesla K80
Device type: GPU
Device vendor: NVIDIA Corporation
Device driver: 396.44
Device speed: 823 MHz
Device cores: 13 CU
Device score: 1112
Selected platform: NVIDIA CUDA
Selected device: Tesla K80
with OpenCL 1.2 capability.
Half precision compute support: No.

Started OpenCL SGEMM tuner.
Will try 5355 valid configurations.
(1/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.5249 ms (224.8 GFLOPS)
(2/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4124 ms (286.0 GFLOPS)
(3/5355) KWG=32 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 0.4124 ms (286.1 GFLOPS)
(87/5355) KWG=16 KWI=2 MDIMA=8 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3330 ms (354.3 GFLOPS)
(130/5355) KWG=32 KWI=8 MDIMA=32 MDIMC=16 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.3140 ms (375.7 GFLOPS)
(134/5355) KWG=32 KWI=8 MDIMA=8 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 0.2630 ms (448.5 GFLOPS)
(159/5355) KWG=32 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=32 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=1 0.2172 ms (543.0 GFLOPS)
(296/5355) KWG=16 KWI=8 MDIMA=16 MDIMC=8 MWG=64 NDIMB=16 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 0.1960 ms (602.0 GFLOPS)
^C`

@pondturtle
Copy link

I have some problem running full tuner on my own computer. It acts like it is going on. but it hangs around 1200/5355 indefinitely and CPU usage drops from 25% to 0

@ihavnoid
Copy link
Member

ihavnoid commented Nov 1, 2018

Looking at the code, it seems that the full tuner will run three times; once for single precision, another time for half precision, and then if single precision is selected, once more for single precision. This is because invoking full tuner will not reuse prior tuned results.

I think the easy way is to explicitly disable the tunet on the third run although it feels a bit hacky...

@liujn2018
Copy link
Author

@ihavnoid Thank you.

@Ttl
Copy link
Member

Ttl commented Nov 1, 2018

Specifying --precision manually should run it only once for the specified precision. See the startup logs during precision autodetection to find out which precision is faster.

@gcp
Copy link
Member

gcp commented Nov 2, 2018

I think the easy way is to explicitly disable the tunet on the third run although it feels a bit hacky...

What about only allowing full-tuner with an explicitly specified precision?

@gcp gcp reopened this Nov 2, 2018
@gcp gcp changed the title full-tuner does not stop in LZ 0.16 --full-tuner canl do 3 tunings instead of 1 or 2 Nov 2, 2018
@gcp gcp added the bug label Nov 2, 2018
@gcp gcp changed the title --full-tuner canl do 3 tunings instead of 1 or 2 --full-tuner can do 3 tunings instead of 1 or 2 Nov 2, 2018
@pondturtle
Copy link

I specifically tried to do full tuning with explicitely specified single precision. It generates tuning file (with suspiciously low performance number) which upon running LZ through GTP is ignored and instead LZ runs fast tuning automatically.

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

@pondturtle Seems a bit odd. Can you put the actual command line that you used when you ran through GTP?

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

@gcp my concern is that doing so is not a backwards-compatible behavior. Anybody who was using 0.15 with the exact same command line will start seeing failures on the new behavior because they never assumed auto precision detection. Explicitly running two batches of tuning might be the better, more sensible behavior - even if it takes double the runtime it at least doesn't break.

@pondturtle
Copy link

pondturtle commented Nov 2, 2018

@ihavnoid

%~dp0\leelaz.exe --precision single --tune-only --threads 4 --gpu 0 --gpu 1 --full-tuner -w leelaz-modelbest

tried with precision half and omiting it alltogether. Behaves exactly the same way

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

@pondturtle - it seems that you have two GPUs - possibly different ones. Can you dump the full stdout so that we can understand what GPUs are used when...?

@gcp
Copy link
Member

gcp commented Nov 2, 2018

@gcp my concern is that doing so is not a backwards-compatible behavior.

True, but I have no particular qualms about breaking --full-tuner. The need for it is something I want to get away from, i.e. only to be run very exceptionally.

Is the concern that a lot of people are using some sort of cloud scripts that would need an update?

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

@gcp well I agree that this should be somewhat rare, but the concern is simply that we just don't know.

If we were to introduce non backward compatible behavior I would take the more extreme behavior that the full tuner implies tune only. I dunno why anybody would want to wait for the full tuner to run while he/she wants to play a game right now...?

@pondturtle
Copy link

pondturtle commented Nov 2, 2018

@ihavnoid Two identical GPUs. Using EXACTLY same command that worked NP for previous version of LZ (I made bat file). Enclosing log from full tuning itself and GTP log from client when I ran LZ after said full tuning. Hope it helps.

Specifying precision either way made no difference at all in case you are wondering.

GTP log from sabaki.txt
log of full tuning.txt

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

I think I understand what's going on. --tune-only terminates right after the first tuner completed, which is the single precision. The second run that you saw from sabaki is the half precision run.

Looks like we can't just terminate leelaz right after the tuning is done - or the easier way is to explicitly request precision when somebody wants --tune-only and/or --full-tuner.

@pondturtle
Copy link

But it specifically says

Half precision compute support: No.

In light of that it makes no sense to me why LZ would run half precision tuning at all. Still I reran full tuning with these parameters to check your assumption.

%~dp0\leelaz.exe --tune-only --threads 4 --gpu 0 --gpu 1 --full-tuner -w leelaz-modelbest --precision half

Doesn't seem to make any difference. Did I go wrong somewhere?

GTP log from sabaki - half precision.txt
log of full tuning - half precision.txt

@Hersmunch
Copy link
Member

What command line do you use in sabaki? Does it also include the same --precision argument as used for the full tuning? It looks like it is left on auto and therefore doing the tuning for the other precision that isn’t in the full tuning run.
Even without half precision compute support it is still possible to use half precision storage.

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

Half precision 'compute' is different - the normal 'half precision' is that it computes in single precision but does the memory storage in half precision - that is, half precision 'memory storage' support (this is supported on most GPUs with a reasonable device driver since early 2010s). This is the mode that you are using on sabaki.

This time, this seems to be what happened:

  1. Full tuner on half precision
  2. Run was on auto detection - since there is no single precision tune result it runs the single precision tuning routines, while reusing the half precision tuning results from the full tuner.

My experience is that the number of flops for the matrix multiplication is similar for both single and half precision, while half precision is significantly faster because all the other routines are much faster.

@ihavnoid
Copy link
Member

ihavnoid commented Nov 2, 2018

@gcp for the fix, here is my suggestion:

  • --full-tuner implies --tune-only
  • --full-tuner requires explicit precision

What do you think?

@pondturtle
Copy link

So if I gather this correctly then something like


%~dp0\leelaz.exe --tune-only --threads 4 --gpu 0 --gpu 1 --full-tuner -w leelaz-modelbest --precision half 

%~dp0\leelaz.exe --tune-only --threads 4 --gpu 0 --gpu 1 --full-tuner -w leelaz-modelbest --precision single

should give me best results?

@ihavnoid
Copy link
Member

ihavnoid commented Nov 3, 2018

@pondturtle yes, for now.

ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 3, 2018
…d auto precision detection

--full-tuner implies --tune-only
--full-tuner requires an explicit precision
@ihavnoid
Copy link
Member

ihavnoid commented Nov 3, 2018

Hmm let's see if #1986 looks okay.

ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 4, 2018
…d auto precision detection

--full-tuner implies --tune-only
--full-tuner requires an explicit precision
ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 4, 2018
ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 4, 2018
gcp pushed a commit that referenced this issue Nov 5, 2018
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #1986.
gcp pushed a commit that referenced this issue Nov 5, 2018
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #1986.
@gcp
Copy link
Member

gcp commented Nov 5, 2018

I had to back out the fix for this:

AutoGTP v17
Starting tuning process, please wait...
./leelaz --tune-only -w networks/6c7c1c83d15c53089b13f1a361853c5107f16fb163856ca3b2995d2a030ed04f.gz
Automatic precision not supported when tuning only
Please add '--precision single' or '--precision half'

AutoGTP wants to do a tuning run before starting, so it can inform that users what's going on. But it has no idea about what precision it should use, so it relies on autodetection.

@gcp
Copy link
Member

gcp commented Nov 5, 2018

@gcp for the fix, here is my suggestion: --full-tuner implies --tune-only --full-tuner requires explicit precision What do you think?

This is fine, but somewhere along the way --tune-only also broke to only support explicit precision.

@alreadydone
Copy link
Contributor

Would we store the detected precision (and maybe whether full tuner has been run) in leelaz_opencl_tuning as another attribute or in a separate file? That can be used to solve this and #1987 I assume.

@gcp
Copy link
Member

gcp commented Nov 7, 2018

Yes, keyed off of the list of GPUs detected would work fine.

But it is somewhat orthogonal to this bug.

ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 9, 2018
ihavnoid added a commit to ihavnoid/leela-zero that referenced this issue Nov 10, 2018
gcp pushed a commit that referenced this issue Nov 15, 2018
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #2004.
gcp pushed a commit that referenced this issue Nov 15, 2018
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #2004.
@sethtroisi
Copy link
Member

closing techsupport issue with PR and answer

gcp pushed a commit that referenced this issue Apr 2, 2019
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #2004.
godmoves added a commit to godmoves/leela-zero that referenced this issue Apr 27, 2019
* Command line parsing : OPENGL --> OPENCL

* Asynchronous simulation / evaluation+backup for batching.

* temp commit.

* New fractional backup implementation.

* reorder children after Dirichlet noise + minor fix.

* Fix for compiler syntax nitpick.

* Once again...

* Output max queue length.

* One queue for each GPU.

* Limit max queue size to twice gpucount*batchsize and Serialize OpenCL commandqueue. (Reverted "one queue for each GPU".)

* temp commits.

* Less variation in speed (pos/s) but seems ~5% slower than max performance.

* Use accumulated virtual losses to avoid visiting expanding nodes.

* Fix missing header leading to error with some compiler.

* Fast conclusion of think().

* Solve problem with root node expansion when it's in NNCache; Fix error with some compilers.

* Cleanup loop code.

Pull request leela-zero#2033.

* always output tuning result

* fixes.

* Tensor core support for half precision

* Bugfixes

* Use m32n8k16 format instead of m16n16k16 - seems to be a bit faster

* Merge fixes.

* Code cleanup for tuning for tensorcores

* Change default to try SA=0 / SA=1 for tensorcore cases

* Update UCTSearch.cpp

* Clear NNCache when clear_board or loadsgf is issued.

* Fixes.

* Queue insertion/vl undo improvements.

* Half precision by default.

* hgemm : Added m16n16k16/m32n8k16/m8n32k16 tuning

Tuner will see which shaped multiplication is fastest.
MDIMA represents the M dimension, NDIMB represents the N dimension.

* Tuner : adjusted range for tensorcore cases so that it covers all MDIMA/NDIMB dimensions

* Fix bug causing infinite wait.

* Fix bug causing infinite wait.

* Minor fixes.

* Minor fixes.

* Crucial fix: infinite wait_expanded.

* Tentative fixes.

* Follow-up fixes.

* Update UCTNode.cpp

* stupid typo.

* stupid typo.

* small fix.

* Fix crucial bug in frac-backup factor calculation.

* Fix crucial bug in frac-backup factor calculation.

* Better output stats.

* Defaulted frac-backup; better naming of pending backup stats.

* Small fix.

* Revert SEL -> WR for get_visits for selection.

* Forgotten comment text change.

* Make some debug variables atomic.

* Renaming a variable; static_cast -> load()

* virtual loss in numerator.

* Small output fix.

* Reorganize pending backup obligations.

* Move backup data insertion to Network::get_output0.

* Remove statics; bugfixes.

* Optimizations? Do not use m_return_queue.

* Corrected implementation of virtual loss accumulation.

* Missing include.

* Modifications that don't achieve good result.

* WIP; implemented readers-writer lock.

* A snapshot as basis of further changes.

* Checkpoint.

* Checkpoint: Seamless think/ponder transition implemented.
NOT for actual use: This version sends positions to GPUs without limit for stress-testing purposes; will eat up your memory.

* Bugfixes and better debug outputs; usable version.

* Checkpoint: changes are not done but it compiles.

* Checkpoint: moved some members from OpenCLScheduler and OpenCL_Network to OpenCL; compiles.

* temp

* temp commit; won't compile.

* Checkpoint: implementation unfinished, now switch to another design.

* Mostly lock-free OpenCLScheduler.
Ensure minimal latency when there're enough positions to feed the GPUs.
Compiles. Pending debug.

* Seems working now.

* Fixes.

* Worker thread = search thread.

* Tweak conversion script for ELF v2.

Small tweak to conversion script for ELF v2 weights.

Pull request leela-zero#2213.

* Bugfix: accumulated virtual loss removal.

* Work around inexplicable reported bug.

* Endgame/Double-pass bugfix.

* Fix some cv race conditions.

* Update OpenCL.h

* Correctly initialize board when reading SGF.

Even though SGF defaults to size 19 boards, we should not try
to set up a board that size if LZ has not been compiled to support
it.

Pull request leela-zero#1964.

* Increase memory limit for 32-bit builds.

Without this, it's empirically not possible to load the current 256x40
networks on a 32-bit machine.

* Never select a CPU during OpenCL autodetection.

If we are trying to auto-select the best device for OpenCL, never select
a CPU. This will cause the engine to refuse to run when people are
trying to run the OpenCL version without a GPU or without GPU drivers,
instead of selecting any slow and suboptimal (and empirically extremely
broken) OpenCL-on-CPU drivers.

Falling back to CPU-only would be another reasonable alternative, but
doesn't provide an alert in case the GPU drivers are missing.

Improves behavior of issue leela-zero#1994.

* Fix tuner for heterogeneous GPUs and auto precision.

Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes leela-zero#1973.

Pull request leela-zero#2004.

* Optimized out and out_in kernels.

Very minor speedup of about 2% with batch size of 1.
With batch size of 5 there is a speedup of about 5% with half precision
and 12% with single precision.

Out transformation memory accesses are almost completely coalesced
with the new kernel.

Pull request leela-zero#2014.

* Update OpenCL C++ headers.

From upstream a807dcf0f8623d40dc5ce9d1eb00ffd0e46150c7.

* CPU-only eval performance optimization.

* CPUPipe : change winograd transformation constants to an equation.

Combined with a series of strength reduction changes, 
improves netbench by about 8%.

* Convert some std::array into individual variables

For some reason this allows gcc to optimize the code better,
improving netbench by 2%.

Pull request leela-zero#2021.

* Convolve in/out performance optimization.

Use hard-coded equations instead of matrix multiplication.

Pull request leela-zero#2023.

* Validation: fix -k option.

Fix Validation -k option by reading its value before the parser is reused.

Pull request leela-zero#2024.

* Add link to Azure free trial instructions.

See pull request leela-zero#2031.

* Cleanup loop code.

Pull request leela-zero#2033.

* Cleanup atomics and dead if.

Pull request leela-zero#2034.

* Const in SGFTree.

Pull request leela-zero#2035.

* Make the README more clear.

Simplify instructions, especially related to building and running
when wanting to contribute.

Based on pull request leela-zero#1983.

* Refactor to allow AutoGTP to use Engine.

* Move Engine to Game.h and refactor autogtp to use it too.
* Fix initialization of job engines.

Pull request leela-zero#2029.

* Fix printf call style.

Generally speaking, providing character pointers as the first argument 
directly might cause FSB (Format String Bug).

Pull request leela-zero#2063.

* Add O(sqrt(log(n))) scaling to tree search.

Pull request leela-zero#2072.

* Update Khronos OpenCL C++ headers.

Update from upstream f0b7045.

Fixes warnings related to CL_TARGET_OPENCL_VERSION.

* AutoGTP: allow specifying an SGF as initial position.

* Make AutoGTP URL parametric.
* Support for the sgfhash and movescount parameters in get-task.
* Automatic downloading of sgf and training files.
* Fix Management.cpp for older Qt5 versions.
* Added starting match games from specified initial position
* Tidy ValidationJob::init() like ProductionJob::init()
* Use existing QUuid method of generating random file 
  names instead of QTemporaryFile when fetching game data.

Moreover, we do not load training data in LeelaZ since it is not needed to start from
an arbitrary position.

Pull request leela-zero#2052.

* Support separate options for white in match games.

* Add optional separate options for white in match game.
* Fixed loading of saved match order with optionsSecond.

Pull request leela-zero#2078.

* Option to get network output without writing to cache. 

Pull request leela-zero#2093.

* Add permission to link with NVIDIA libs. Update year.

See issue leela-zero#2032.

All contributors to the core engine have given their permission to
add an additional permission to link with NVIDIA's CUDA/cuDNN/TensorRT
libraries. This makes it possible to distribute the engine when built to
use those libraries.

Update the copyright notices to 2019.

* Add link to GoReviewPartner.

Pull request leela-zero#2147.

* Reminder to install OpenCL driver if seperate.

Although the OpenCL driver is generally installed as part of the driver
install, mention the requirement explicitly in case it wasn't.

See pull request leela-zero#2138.

* Fixed leelaz_file on Android.

Pull request leela-zero#2135.

* Fix 'catching polymorphic type by value' warning.

Pull request leela-zero#2134.

* Fixed converter script for minigo removing bias.

Fixes leela-zero#2020.

Pull request leela-zero#2133.

* Add zlib to the mac OS X build instructions.

See pull request leela-zero#2122.

* UCTNodePtr rare race condition fix.

Calling get_eval() on zero-visit node will assert-fail.
The original code could assert-fail on b.get_eval() if 'a' and 'b' both
had zero visits but suddenly 'a' gained an additional visit.

Pull request leela-zero#2110.

* Make sure analysis is printed at least once.

Fixes issue leela-zero#2001.

Pull request leela-zero#2114.

* Don't post if not requested.

Follow up fix for pull request leela-zero#2114.

* AutoGTP: Allow specifying initial GTP commands.

* AutoGTP: Allow specifying initial GTP commands.
  Also add support for white taking the first move in handicapped job games.
* AutoGTP: Refactored core loop for match games to avoid code duplication.
* Fixed white using black's match game settings after loading from an SGF by
  moving SGF loading into Game::gameStart() to before sending GTP commands
  (except handicap commands).
* Changed so that when an SGF file is loaded, AutoGTP determines whether
  handicap is in use from the SGF rather than from any starting GTP commands.

Pull request leela-zero#2096.

* Update Eigen to 3.3.7. 

This includes some optimization improvements for newer GCC/Clang that
may be relevant to a lot of our users.

Pull request leela-zero#2151.

* Fix lz-setoption name playouts.

Fixes issue leela-zero#2167.

I could swear I fixed this before. Maybe I forgot to push?

* AutoGTP: More info in SGF comments.

* AutoGTP: Added full engine options and starting GTP commands 
  to SGF comments that are produced.
* Refactored Game::fixSgf().

Pull request leela-zero#2160.

* Truncate and compress minigo weights.

Truncate to 4 precision and compress converted minigo weights.

Pull request leela-zero#2173.

* Add gomill-explain_last_move.

Add gomill-explain_last_move for additional output in ringmaster
competitions.

Pull request leela-zero#2174.

* Add a feature to exclude moves from the search.

* The "avoid" command is now a param for lz-analyze and for
  lz-genmove_analyze.

New syntax is:

  `lz-analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`
  `lz-genmove_analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`

The number_of_moves is now always relative to the current move number.

Example:

  `lz-analyze b 200 avoid b q16 1 avoid b q4 1 avoid b d16 1 avoid b d4 1`

* Re-organize the parser for the "analyze" commands.

  * New tag "interval"; old syntax "100" is now short for "interval 100"
  * Tags can be specified in any arbitrary order
  * Moved all of the parsing code for "lz-anaylze" and
    "lz-genmove_analyze" into the parse_analyze_tags function
  * parse_analyze_tags uses its return value instead of side effects

* Implement the "allow" tag for lz-analyze.

It works similar to "avoid".  Adding moves to the "allow" list is the
same as adding all other moves (except pass and resign) to the "avoid" list.

* "Avoid" and "allow" moves can be specified as a comma-separated list.

Example:

  `lz-analyze b 100 avoid w q4,q16,d4,d16 2 avoid b pass 50`

Pull request leela-zero#1949.

* Removed --cpu-only option from USE_CPU_ONLY build. 

Generalized output displayed in cases where potentially referring to a CPU 
instead of or as well as a GPU.

Pull request leela-zero#2161.

* Tensor Core support with PTX inline assembly.

* Tensor core support for half precision
* hgemm : Added m16n16k16/m32n8k16/m8n32k16 tuning

Tuner will see which shaped multiplication is fastest.
MDIMA represents the M dimension, NDIMB represents the N dimension.

* tensorcore : Test m16n16k16 typs only for checking tensorcore availability

It seems that there are cases where only m16n16k16 is supported.
If other formats are not available they will be auto-disabled on tuning.

Pull request leela-zero#2049.

* Update TODO list.

We support avoid tags now. Clarify batching work needs
changes in the search.

* Remove an unnecessary std::move().

Which inhibits RVO. See e.g. https://stackoverflow.com/a/19272035

* Add contributor (and maintainer) guidelines. 

* Add contributor (and maintainer) guidelines.

Spell out the existing code style, C++ usage, git workflow,
commit message requirements, and give guidelines regarding reviewing,
merging and adding configuration options and GTP extensions.

Pull request leela-zero#2186.

* Add several simple GTP commands.

Added several simple GTP commands useful for building interfaces to LZ.

Added the following GTP commands.

    last_move
    move_history

The output of these commands is in line with that of the corresponding
commands in GNU Go when such commands existed.

Pull request leela-zero#2170.

* Minor style fixups.

Minor fixups for pull request leela-zero#2170.

* Remark about move assignment in style guideline.

Emphasize use of emplace_back and move semantics.

* Add lz-analyze minmoves tag.

Add an lz-analyze tag to suggest the minimum amount of moves the
engine should post info about (rather than only those it considers
interesting, i.e. the ones with at least a visit).

This allows some very flexible constructs:

Getting a heatmap:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 361

Forcing a move among the top policy moves only:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 2
    (store those moves, e.g. A1, B1)
    lz-setoption name visits value 0
    lz-genmove_analyze b interval 1 allow b A1 1 allow b B1 1

* Fix style, extra spaces in PV output.

Adding the minmoves tag exposes a small bug in the PV
output formatting. Avoid extra blank spaces.

Small style fixups.

* Rework test regex for MSVC limits.

Seems like the previous test regex is causing MSVC's regex engine to run
out of stack space.

* .gitignore: Add build.

leela-zero's default build directory is `build`.

It is very annoying when using leela as a git submodule that 
the repository updates whenever it builds.

Pull request leela-zero#2199.

* Batched neural net evaluations

Group evaluations and run them in parallel. Roughly 50% speedup on my setup, but there are a couple of points that is debatable.

- Thread / batch sizing heuristics : This PR changes how the default threads / default batch sizes are picked.  See Leela.cpp
- Batch-forming heuristic : See OpenCLScheduler.cpp for the batch forming heuristic : the heuristic exists so that we can wait for the rest of the engine to create more NN evaluations so that we can run larger batches.  We can't wait indefinitely since there are cases we enter 'serial' paths.  Since heuristics are heuristics, these might need some tests on a larger variety of types of systems.

Did make sure that winrate improves when running default vs. default command line `./leelaz -w (weight file)` on time parity.

Pull request leela-zero#2188.

* Autogtp: Tune for batchsize 1

Self-play games specify `-t 1` for playing which implies batch size of 1, but tuning was done for default settings since number of threads was not specified.

Pull request leela-zero#2206

* Update README.md.

Update links to leela-zero instead of gcp.
Update badge and link to the new AppVeyor project
under leela-zero instead of gcp ownership.

* Remove unused lambda capture.

Pull request leela-zero#2231.

* README.md: link to mentioned pull requests.

Pull request leela-zero#2229.

* Minor cleanup involving Network::get_output. 

Pull request leela-zero#2228.

* Set up default batch size and threads.

Fixes issue leela-zero#2214.

Pull request leela-zero#2256.

* Shuffle tuner parameters to find good parameters quicker.

Parameters are searched in a linear fashion currently. By shuffling them,
we will find a good instance more quickly.

Also, shuffing could help reduce possible bias due to grouped, similar
parameters that affect the environment (e.g. cache, branch predictor, ...),
leading to more accurate/fair results.

Additionally, this is a preparation for exiting the tuner during the search,
which becomes a possible option.

Pull request leela-zero#2225.

* Refactor tree_stats_helper to lambda.

Pull request leela-zero#2244.

* Enable batching for self-play.

Pull request leela-zero#2253.

* Allow configuring default komi at compile-time.

Pull request leela-zero#2257.

* Make chunkparser more robust.

Some clients are sending corrupted data, make the
chunk parser resilient against it.

* Fix thread count error message.

Pull request leela-zero#2287.

* Fix small style nits.

* Add support for time controls in loadsgf/printsgf.

Added extra support for "TM" and "OT" and other sgf time control
properties on printsgf and loadsgf GTP commands.

* Added parsing and loading of "TM" and "OT" sgf properties on GTP command
  loadsgf. Only supports "OT" syntax matching output from a printsgf GTP
  command.
* Change SGFTree to have a shared_ptr for a time control.
* Added saving and loading of "BL", "WL", "OB" and "OW" sgf properties on
  GTP commands printsgf and loadsgf.
* Change to make TimeControl::make_from_text_sgf() a time control factory
  and other minor tidying.

Pull request leela-zero#2172.

* Fix inconsistent default timecontrol.

As noted in pull request leela-zero#2172, the default
constructor set byo yomi stones but no time or
periods.

* Error out if weights are for wrong board size.

We currently will either crash or do strange things if we're
fed a weights file that doesn't match the board size we're compiled
for.

See issue leela-zero#2289.

* Ignore passing moves unless they make sense.

Only pass when winning or low on legal moves.
Disabled in self-play.

Fixes issue leela-zero#2273.
Based on pull request leela-zero#2277.

Pull request leela-zero#2301.

* Always allow passing when low on moves.

As pointed out by @gjm11 in leela-zero#2277, when there's few legal moves we might
want to allow passing even if this loses on the board count. The
alternative might be to self-destruct large groups and carry the game
on endlessely even if the policy wouldn't want to.

No difference in "dumbpass" mode.

* Report root visits in gomill-explain_last_move.

See issue leela-zero#2280.

Pull request leela-zero#2302.

* Choose move based on normal distribution LCB.

* Calculate node variance.
* Use normal distribution LCB to choose the played move.
* Cached student-t.
* Sort lz-analyze output according to LCB.
* Don't choose nodes with very few visits even if LCB is better.

Guard against NN misevaluations when top move has lot of visits.
Without this it's possible for move with few hundred visits to be picked
over a move with over ten thousand visits.

The problem is that the evaluation distribution isn't really normal
distribution. Evaluations correlate and the distribution can change
if deeper in the tree it finds a better alternative.

Pull request leela-zero#2290.

* Mixed precision training support.

* Add mixed precision training support.
* Do not use loss scale if training with fp32
* Fix potential reg_term overflow of large networks.

Pull request leela-zero#2191.

* Update AUTHORS.

* Don't detect precision with Tensor Cores. 

Don't autodetect or default to fp32 when all cards have
Tensor Cores. We will assume fp16 is the fastest.

This avoids problems in tune-only mode which does not
detect the precision to use and would use fp32 on such cards.

Pull request leela-zero#2312.

* Update README.md.

We have a first implementation of batching now.

* Ignore --batchsize in CPU only compiles.

AutoGTP will always send --batchsize, but CPU only
compiles don't support the option. Ignore the option
in those builds.

The same problem exists with --tune-only, but quitting
immediately happens to be sane behavior so we don't need
to fix that.

Pull request leela-zero#2313.

* Don't include OpenCL scheduler in CPU build.

It will recursively include OpenCL.h and that
is bad.

Pull request leela-zero#2314.

* Bump version numbers.

* Fix: batch sizes were not set according to command line.
Vandertic pushed a commit to CuriosAI/sai that referenced this issue Jun 10, 2019
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes leela-zero#1973.

Pull request leela-zero#2004.
ihavnoid pushed a commit that referenced this issue Jul 27, 2019
* Correctly initialize board when reading SGF.

Even though SGF defaults to size 19 boards, we should not try
to set up a board that size if LZ has not been compiled to support
it.

Pull request #1964.

* Increase memory limit for 32-bit builds.

Without this, it's empirically not possible to load the current 256x40
networks on a 32-bit machine.

* Never select a CPU during OpenCL autodetection.

If we are trying to auto-select the best device for OpenCL, never select
a CPU. This will cause the engine to refuse to run when people are
trying to run the OpenCL version without a GPU or without GPU drivers,
instead of selecting any slow and suboptimal (and empirically extremely
broken) OpenCL-on-CPU drivers.

Falling back to CPU-only would be another reasonable alternative, but
doesn't provide an alert in case the GPU drivers are missing.

Improves behavior of issue #1994.

* Fix tuner for heterogeneous GPUs and auto precision.

Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes #1973.

Pull request #2004.

* Optimized out and out_in kernels.

Very minor speedup of about 2% with batch size of 1.
With batch size of 5 there is a speedup of about 5% with half precision
and 12% with single precision.

Out transformation memory accesses are almost completely coalesced
with the new kernel.

Pull request #2014.

* Update OpenCL C++ headers.

From upstream a807dcf0f8623d40dc5ce9d1eb00ffd0e46150c7.

* CPU-only eval performance optimization.

* CPUPipe : change winograd transformation constants to an equation.

Combined with a series of strength reduction changes, 
improves netbench by about 8%.

* Convert some std::array into individual variables

For some reason this allows gcc to optimize the code better,
improving netbench by 2%.

Pull request #2021.

* Convolve in/out performance optimization.

Use hard-coded equations instead of matrix multiplication.

Pull request #2023.

* Validation: fix -k option.

Fix Validation -k option by reading its value before the parser is reused.

Pull request #2024.

* Add link to Azure free trial instructions.

See pull request #2031.

* Cleanup atomics and dead if.

Pull request #2034.

* Const in SGFTree.

Pull request #2035.

* Make the README more clear.

Simplify instructions, especially related to building and running
when wanting to contribute.

Based on pull request #1983.

* Refactor to allow AutoGTP to use Engine.

* Move Engine to Game.h and refactor autogtp to use it too.
* Fix initialization of job engines.

Pull request #2029.

* Fix printf call style.

Generally speaking, providing character pointers as the first argument 
directly might cause FSB (Format String Bug).

Pull request #2063.

* Update Khronos OpenCL C++ headers.

Update from upstream f0b7045.

Fixes warnings related to CL_TARGET_OPENCL_VERSION.

* Cleanup loop code.

Pull request #2033.

* AutoGTP: allow specifying an SGF as initial position.

* Make AutoGTP URL parametric.
* Support for the sgfhash and movescount parameters in get-task.
* Automatic downloading of sgf and training files.
* Fix Management.cpp for older Qt5 versions.
* Added starting match games from specified initial position
* Tidy ValidationJob::init() like ProductionJob::init()
* Use existing QUuid method of generating random file 
  names instead of QTemporaryFile when fetching game data.

Moreover, we do not load training data in LeelaZ since it is not needed to start from
an arbitrary position.

Pull request #2052.

* Support separate options for white in match games.

* Add optional separate options for white in match game.
* Fixed loading of saved match order with optionsSecond.

Pull request #2078.

* Add O(sqrt(log(n))) scaling to tree search.

Pull request #2072.

* Option to get network output without writing to cache. 

Pull request #2093.

* Add permission to link with NVIDIA libs. Update year.

See issue #2032.

All contributors to the core engine have given their permission to
add an additional permission to link with NVIDIA's CUDA/cuDNN/TensorRT
libraries. This makes it possible to distribute the engine when built to
use those libraries.

Update the copyright notices to 2019.

* Add link to GoReviewPartner.

Pull request #2147.

* Reminder to install OpenCL driver if seperate.

Although the OpenCL driver is generally installed as part of the driver
install, mention the requirement explicitly in case it wasn't.

See pull request #2138.

* Fixed leelaz_file on Android.

Pull request #2135.

* Fix 'catching polymorphic type by value' warning.

Pull request #2134.

* Fixed converter script for minigo removing bias.

Fixes #2020.

Pull request #2133.

* Add zlib to the mac OS X build instructions.

See pull request #2122.

* UCTNodePtr rare race condition fix.

Calling get_eval() on zero-visit node will assert-fail.
The original code could assert-fail on b.get_eval() if 'a' and 'b' both
had zero visits but suddenly 'a' gained an additional visit.

Pull request #2110.

* Make sure analysis is printed at least once.

Fixes issue #2001.

Pull request #2114.

* Don't post if not requested.

Follow up fix for pull request #2114.

* AutoGTP: Allow specifying initial GTP commands.

* AutoGTP: Allow specifying initial GTP commands.
  Also add support for white taking the first move in handicapped job games.
* AutoGTP: Refactored core loop for match games to avoid code duplication.
* Fixed white using black's match game settings after loading from an SGF by
  moving SGF loading into Game::gameStart() to before sending GTP commands
  (except handicap commands).
* Changed so that when an SGF file is loaded, AutoGTP determines whether
  handicap is in use from the SGF rather than from any starting GTP commands.

Pull request #2096.

* Update Eigen to 3.3.7. 

This includes some optimization improvements for newer GCC/Clang that
may be relevant to a lot of our users.

Pull request #2151.

* Fix lz-setoption name playouts.

Fixes issue #2167.

I could swear I fixed this before. Maybe I forgot to push?

* AutoGTP: More info in SGF comments.

* AutoGTP: Added full engine options and starting GTP commands 
  to SGF comments that are produced.
* Refactored Game::fixSgf().

Pull request #2160.

* Truncate and compress minigo weights.

Truncate to 4 precision and compress converted minigo weights.

Pull request #2173.

* Add gomill-explain_last_move.

Add gomill-explain_last_move for additional output in ringmaster
competitions.

Pull request #2174.

* Add a feature to exclude moves from the search.

* The "avoid" command is now a param for lz-analyze and for
  lz-genmove_analyze.

New syntax is:

  `lz-analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`
  `lz-genmove_analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`

The number_of_moves is now always relative to the current move number.

Example:

  `lz-analyze b 200 avoid b q16 1 avoid b q4 1 avoid b d16 1 avoid b d4 1`

* Re-organize the parser for the "analyze" commands.

  * New tag "interval"; old syntax "100" is now short for "interval 100"
  * Tags can be specified in any arbitrary order
  * Moved all of the parsing code for "lz-anaylze" and
    "lz-genmove_analyze" into the parse_analyze_tags function
  * parse_analyze_tags uses its return value instead of side effects

* Implement the "allow" tag for lz-analyze.

It works similar to "avoid".  Adding moves to the "allow" list is the
same as adding all other moves (except pass and resign) to the "avoid" list.

* "Avoid" and "allow" moves can be specified as a comma-separated list.

Example:

  `lz-analyze b 100 avoid w q4,q16,d4,d16 2 avoid b pass 50`

Pull request #1949.

* Removed --cpu-only option from USE_CPU_ONLY build. 

Generalized output displayed in cases where potentially referring to a CPU 
instead of or as well as a GPU.

Pull request #2161.

* Tensor Core support with PTX inline assembly.

* Tensor core support for half precision
* hgemm : Added m16n16k16/m32n8k16/m8n32k16 tuning

Tuner will see which shaped multiplication is fastest.
MDIMA represents the M dimension, NDIMB represents the N dimension.

* tensorcore : Test m16n16k16 typs only for checking tensorcore availability

It seems that there are cases where only m16n16k16 is supported.
If other formats are not available they will be auto-disabled on tuning.

Pull request #2049.

* Update TODO list.

We support avoid tags now. Clarify batching work needs
changes in the search.

* Remove an unnecessary std::move().

Which inhibits RVO. See e.g. https://stackoverflow.com/a/19272035

* Add contributor (and maintainer) guidelines. 

* Add contributor (and maintainer) guidelines.

Spell out the existing code style, C++ usage, git workflow,
commit message requirements, and give guidelines regarding reviewing,
merging and adding configuration options and GTP extensions.

Pull request #2186.

* Add several simple GTP commands.

Added several simple GTP commands useful for building interfaces to LZ.

Added the following GTP commands.

    last_move
    move_history

The output of these commands is in line with that of the corresponding
commands in GNU Go when such commands existed.

Pull request #2170.

* Minor style fixups.

Minor fixups for pull request #2170.

* Remark about move assignment in style guideline.

Emphasize use of emplace_back and move semantics.

* Add lz-analyze minmoves tag.

Add an lz-analyze tag to suggest the minimum amount of moves the
engine should post info about (rather than only those it considers
interesting, i.e. the ones with at least a visit).

This allows some very flexible constructs:

Getting a heatmap:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 361

Forcing a move among the top policy moves only:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 2
    (store those moves, e.g. A1, B1)
    lz-setoption name visits value 0
    lz-genmove_analyze b interval 1 allow b A1 1 allow b B1 1

* Fix style, extra spaces in PV output.

Adding the minmoves tag exposes a small bug in the PV
output formatting. Avoid extra blank spaces.

Small style fixups.

* Rework test regex for MSVC limits.

Seems like the previous test regex is causing MSVC's regex engine to run
out of stack space.

* .gitignore: Add build.

leela-zero's default build directory is `build`.

It is very annoying when using leela as a git submodule that 
the repository updates whenever it builds.

Pull request #2199.

* Batched neural net evaluations

Group evaluations and run them in parallel. Roughly 50% speedup on my setup, but there are a couple of points that is debatable.

- Thread / batch sizing heuristics : This PR changes how the default threads / default batch sizes are picked.  See Leela.cpp
- Batch-forming heuristic : See OpenCLScheduler.cpp for the batch forming heuristic : the heuristic exists so that we can wait for the rest of the engine to create more NN evaluations so that we can run larger batches.  We can't wait indefinitely since there are cases we enter 'serial' paths.  Since heuristics are heuristics, these might need some tests on a larger variety of types of systems.

Did make sure that winrate improves when running default vs. default command line `./leelaz -w (weight file)` on time parity.

Pull request #2188.

* Autogtp: Tune for batchsize 1

Self-play games specify `-t 1` for playing which implies batch size of 1, but tuning was done for default settings since number of threads was not specified.

Pull request #2206

* Tweak conversion script for ELF v2.

Small tweak to conversion script for ELF v2 weights.

Pull request #2213.

* Update README.md

Update links to leela-zero instead of gcp.

* Update README.md

Appveyor link still needs to be 'gcp'.

* Update README.md

Update badge and link to the new AppVeyor project under leela-zero instead of gcp ownership.

* Update README.md.

Update links to leela-zero instead of gcp.
Update badge and link to the new AppVeyor project
under leela-zero instead of gcp ownership.

* Remove unused lambda capture.

Pull request #2231.

* README.md: link to mentioned pull requests.

Pull request #2229.

* Minor cleanup involving Network::get_output. 

Pull request #2228.

* Set up default batch size and threads.

Fixes issue #2214.

Pull request #2256.

* Shuffle tuner parameters to find good parameters quicker.

Parameters are searched in a linear fashion currently. By shuffling them,
we will find a good instance more quickly.

Also, shuffing could help reduce possible bias due to grouped, similar
parameters that affect the environment (e.g. cache, branch predictor, ...),
leading to more accurate/fair results.

Additionally, this is a preparation for exiting the tuner during the search,
which becomes a possible option.

Pull request #2225.

* Refactor tree_stats_helper to lambda.

Pull request #2244.

* Enable batching for self-play.

Pull request #2253.

* Allow configuring default komi at compile-time.

Pull request #2257.

* Update README.md

Update links to leela-zero instead of gcp.

* Make chunkparser more robust.

Some clients are sending corrupted data, make the
chunk parser resilient against it.

* Fix thread count error message.

Pull request #2287.

* Fix small style nits.

* Add support for time controls in loadsgf/printsgf.

Added extra support for "TM" and "OT" and other sgf time control
properties on printsgf and loadsgf GTP commands.

* Added parsing and loading of "TM" and "OT" sgf properties on GTP command
  loadsgf. Only supports "OT" syntax matching output from a printsgf GTP
  command.
* Change SGFTree to have a shared_ptr for a time control.
* Added saving and loading of "BL", "WL", "OB" and "OW" sgf properties on
  GTP commands printsgf and loadsgf.
* Change to make TimeControl::make_from_text_sgf() a time control factory
  and other minor tidying.

Pull request #2172.

* Fix inconsistent default timecontrol.

As noted in pull request #2172, the default
constructor set byo yomi stones but no time or
periods.

* Error out if weights are for wrong board size.

We currently will either crash or do strange things if we're
fed a weights file that doesn't match the board size we're compiled
for.

See issue #2289.

* Ignore passing moves unless they make sense.

Only pass when winning or low on legal moves.
Disabled in self-play.

Fixes issue #2273.
Based on pull request #2277.

Pull request #2301.

* Always allow passing when low on moves.

As pointed out by @gjm11 in #2277, when there's few legal moves we might
want to allow passing even if this loses on the board count. The
alternative might be to self-destruct large groups and carry the game
on endlessely even if the policy wouldn't want to.

No difference in "dumbpass" mode.

* Report root visits in gomill-explain_last_move.

See issue #2280.

Pull request #2302.

* Choose move based on normal distribution LCB.

* Calculate node variance.
* Use normal distribution LCB to choose the played move.
* Cached student-t.
* Sort lz-analyze output according to LCB.
* Don't choose nodes with very few visits even if LCB is better.

Guard against NN misevaluations when top move has lot of visits.
Without this it's possible for move with few hundred visits to be picked
over a move with over ten thousand visits.

The problem is that the evaluation distribution isn't really normal
distribution. Evaluations correlate and the distribution can change
if deeper in the tree it finds a better alternative.

Pull request #2290.

* Mixed precision training support.

* Add mixed precision training support.
* Do not use loss scale if training with fp32
* Fix potential reg_term overflow of large networks.

Pull request #2191.

* Update AUTHORS.

* Don't detect precision with Tensor Cores. 

Don't autodetect or default to fp32 when all cards have
Tensor Cores. We will assume fp16 is the fastest.

This avoids problems in tune-only mode which does not
detect the precision to use and would use fp32 on such cards.

Pull request #2312.

* Update README.md.

We have a first implementation of batching now.

* Ignore --batchsize in CPU only compiles.

AutoGTP will always send --batchsize, but CPU only
compiles don't support the option. Ignore the option
in those builds.

The same problem exists with --tune-only, but quitting
immediately happens to be sane behavior so we don't need
to fix that.

Pull request #2313.

* Don't include OpenCL scheduler in CPU build.

It will recursively include OpenCL.h and that
is bad.

Pull request #2314.

* Bump version numbers.

* Address git hub security alert

* Match upstream
Vandertic pushed a commit to CuriosAI/sai that referenced this issue Dec 14, 2019
* Correctly initialize board when reading SGF.

Even though SGF defaults to size 19 boards, we should not try
to set up a board that size if LZ has not been compiled to support
it.

Pull request leela-zero#1964.

* Increase memory limit for 32-bit builds.

Without this, it's empirically not possible to load the current 256x40
networks on a 32-bit machine.

* Never select a CPU during OpenCL autodetection.

If we are trying to auto-select the best device for OpenCL, never select
a CPU. This will cause the engine to refuse to run when people are
trying to run the OpenCL version without a GPU or without GPU drivers,
instead of selecting any slow and suboptimal (and empirically extremely
broken) OpenCL-on-CPU drivers.

Falling back to CPU-only would be another reasonable alternative, but
doesn't provide an alert in case the GPU drivers are missing.

Improves behavior of issue leela-zero#1994.

* Fix tuner for heterogeneous GPUs and auto precision.

Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes leela-zero#1973.

Pull request leela-zero#2004.

* Optimized out and out_in kernels.

Very minor speedup of about 2% with batch size of 1.
With batch size of 5 there is a speedup of about 5% with half precision
and 12% with single precision.

Out transformation memory accesses are almost completely coalesced
with the new kernel.

Pull request leela-zero#2014.

* Update OpenCL C++ headers.

From upstream a807dcf0f8623d40dc5ce9d1eb00ffd0e46150c7.

* CPU-only eval performance optimization.

* CPUPipe : change winograd transformation constants to an equation.

Combined with a series of strength reduction changes, 
improves netbench by about 8%.

* Convert some std::array into individual variables

For some reason this allows gcc to optimize the code better,
improving netbench by 2%.

Pull request leela-zero#2021.

* Convolve in/out performance optimization.

Use hard-coded equations instead of matrix multiplication.

Pull request leela-zero#2023.

* Validation: fix -k option.

Fix Validation -k option by reading its value before the parser is reused.

Pull request leela-zero#2024.

* Add link to Azure free trial instructions.

See pull request leela-zero#2031.

* Cleanup atomics and dead if.

Pull request leela-zero#2034.

* Const in SGFTree.

Pull request leela-zero#2035.

* Make the README more clear.

Simplify instructions, especially related to building and running
when wanting to contribute.

Based on pull request leela-zero#1983.

* Refactor to allow AutoGTP to use Engine.

* Move Engine to Game.h and refactor autogtp to use it too.
* Fix initialization of job engines.

Pull request leela-zero#2029.

* Fix printf call style.

Generally speaking, providing character pointers as the first argument 
directly might cause FSB (Format String Bug).

Pull request leela-zero#2063.

* Update Khronos OpenCL C++ headers.

Update from upstream f0b7045.

Fixes warnings related to CL_TARGET_OPENCL_VERSION.

* Cleanup loop code.

Pull request leela-zero#2033.

* AutoGTP: allow specifying an SGF as initial position.

* Make AutoGTP URL parametric.
* Support for the sgfhash and movescount parameters in get-task.
* Automatic downloading of sgf and training files.
* Fix Management.cpp for older Qt5 versions.
* Added starting match games from specified initial position
* Tidy ValidationJob::init() like ProductionJob::init()
* Use existing QUuid method of generating random file 
  names instead of QTemporaryFile when fetching game data.

Moreover, we do not load training data in LeelaZ since it is not needed to start from
an arbitrary position.

Pull request leela-zero#2052.

* Support separate options for white in match games.

* Add optional separate options for white in match game.
* Fixed loading of saved match order with optionsSecond.

Pull request leela-zero#2078.

* Add O(sqrt(log(n))) scaling to tree search.

Pull request leela-zero#2072.

* Option to get network output without writing to cache. 

Pull request leela-zero#2093.

* Add permission to link with NVIDIA libs. Update year.

See issue leela-zero#2032.

All contributors to the core engine have given their permission to
add an additional permission to link with NVIDIA's CUDA/cuDNN/TensorRT
libraries. This makes it possible to distribute the engine when built to
use those libraries.

Update the copyright notices to 2019.

* Add link to GoReviewPartner.

Pull request leela-zero#2147.

* Reminder to install OpenCL driver if seperate.

Although the OpenCL driver is generally installed as part of the driver
install, mention the requirement explicitly in case it wasn't.

See pull request leela-zero#2138.

* Fixed leelaz_file on Android.

Pull request leela-zero#2135.

* Fix 'catching polymorphic type by value' warning.

Pull request leela-zero#2134.

* Fixed converter script for minigo removing bias.

Fixes leela-zero#2020.

Pull request leela-zero#2133.

* Add zlib to the mac OS X build instructions.

See pull request leela-zero#2122.

* UCTNodePtr rare race condition fix.

Calling get_eval() on zero-visit node will assert-fail.
The original code could assert-fail on b.get_eval() if 'a' and 'b' both
had zero visits but suddenly 'a' gained an additional visit.

Pull request leela-zero#2110.

* Make sure analysis is printed at least once.

Fixes issue leela-zero#2001.

Pull request leela-zero#2114.

* Don't post if not requested.

Follow up fix for pull request leela-zero#2114.

* AutoGTP: Allow specifying initial GTP commands.

* AutoGTP: Allow specifying initial GTP commands.
  Also add support for white taking the first move in handicapped job games.
* AutoGTP: Refactored core loop for match games to avoid code duplication.
* Fixed white using black's match game settings after loading from an SGF by
  moving SGF loading into Game::gameStart() to before sending GTP commands
  (except handicap commands).
* Changed so that when an SGF file is loaded, AutoGTP determines whether
  handicap is in use from the SGF rather than from any starting GTP commands.

Pull request leela-zero#2096.

* Update Eigen to 3.3.7. 

This includes some optimization improvements for newer GCC/Clang that
may be relevant to a lot of our users.

Pull request leela-zero#2151.

* Fix lz-setoption name playouts.

Fixes issue leela-zero#2167.

I could swear I fixed this before. Maybe I forgot to push?

* AutoGTP: More info in SGF comments.

* AutoGTP: Added full engine options and starting GTP commands 
  to SGF comments that are produced.
* Refactored Game::fixSgf().

Pull request leela-zero#2160.

* Truncate and compress minigo weights.

Truncate to 4 precision and compress converted minigo weights.

Pull request leela-zero#2173.

* Add gomill-explain_last_move.

Add gomill-explain_last_move for additional output in ringmaster
competitions.

Pull request leela-zero#2174.

* Add a feature to exclude moves from the search.

* The "avoid" command is now a param for lz-analyze and for
  lz-genmove_analyze.

New syntax is:

  `lz-analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`
  `lz-genmove_analyze ARGS [avoid <color> <coords> <number_of_moves>] [avoid ...]`

The number_of_moves is now always relative to the current move number.

Example:

  `lz-analyze b 200 avoid b q16 1 avoid b q4 1 avoid b d16 1 avoid b d4 1`

* Re-organize the parser for the "analyze" commands.

  * New tag "interval"; old syntax "100" is now short for "interval 100"
  * Tags can be specified in any arbitrary order
  * Moved all of the parsing code for "lz-anaylze" and
    "lz-genmove_analyze" into the parse_analyze_tags function
  * parse_analyze_tags uses its return value instead of side effects

* Implement the "allow" tag for lz-analyze.

It works similar to "avoid".  Adding moves to the "allow" list is the
same as adding all other moves (except pass and resign) to the "avoid" list.

* "Avoid" and "allow" moves can be specified as a comma-separated list.

Example:

  `lz-analyze b 100 avoid w q4,q16,d4,d16 2 avoid b pass 50`

Pull request leela-zero#1949.

* Removed --cpu-only option from USE_CPU_ONLY build. 

Generalized output displayed in cases where potentially referring to a CPU 
instead of or as well as a GPU.

Pull request leela-zero#2161.

* Tensor Core support with PTX inline assembly.

* Tensor core support for half precision
* hgemm : Added m16n16k16/m32n8k16/m8n32k16 tuning

Tuner will see which shaped multiplication is fastest.
MDIMA represents the M dimension, NDIMB represents the N dimension.

* tensorcore : Test m16n16k16 typs only for checking tensorcore availability

It seems that there are cases where only m16n16k16 is supported.
If other formats are not available they will be auto-disabled on tuning.

Pull request leela-zero#2049.

* Update TODO list.

We support avoid tags now. Clarify batching work needs
changes in the search.

* Remove an unnecessary std::move().

Which inhibits RVO. See e.g. https://stackoverflow.com/a/19272035

* Add contributor (and maintainer) guidelines. 

* Add contributor (and maintainer) guidelines.

Spell out the existing code style, C++ usage, git workflow,
commit message requirements, and give guidelines regarding reviewing,
merging and adding configuration options and GTP extensions.

Pull request leela-zero#2186.

* Add several simple GTP commands.

Added several simple GTP commands useful for building interfaces to LZ.

Added the following GTP commands.

    last_move
    move_history

The output of these commands is in line with that of the corresponding
commands in GNU Go when such commands existed.

Pull request leela-zero#2170.

* Minor style fixups.

Minor fixups for pull request leela-zero#2170.

* Remark about move assignment in style guideline.

Emphasize use of emplace_back and move semantics.

* Add lz-analyze minmoves tag.

Add an lz-analyze tag to suggest the minimum amount of moves the
engine should post info about (rather than only those it considers
interesting, i.e. the ones with at least a visit).

This allows some very flexible constructs:

Getting a heatmap:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 361

Forcing a move among the top policy moves only:

    lz-setoption name visits value 1
    lz-analyze interval 1 minmoves 2
    (store those moves, e.g. A1, B1)
    lz-setoption name visits value 0
    lz-genmove_analyze b interval 1 allow b A1 1 allow b B1 1

* Fix style, extra spaces in PV output.

Adding the minmoves tag exposes a small bug in the PV
output formatting. Avoid extra blank spaces.

Small style fixups.

* Rework test regex for MSVC limits.

Seems like the previous test regex is causing MSVC's regex engine to run
out of stack space.

* .gitignore: Add build.

leela-zero's default build directory is `build`.

It is very annoying when using leela as a git submodule that 
the repository updates whenever it builds.

Pull request leela-zero#2199.

* Batched neural net evaluations

Group evaluations and run them in parallel. Roughly 50% speedup on my setup, but there are a couple of points that is debatable.

- Thread / batch sizing heuristics : This PR changes how the default threads / default batch sizes are picked.  See Leela.cpp
- Batch-forming heuristic : See OpenCLScheduler.cpp for the batch forming heuristic : the heuristic exists so that we can wait for the rest of the engine to create more NN evaluations so that we can run larger batches.  We can't wait indefinitely since there are cases we enter 'serial' paths.  Since heuristics are heuristics, these might need some tests on a larger variety of types of systems.

Did make sure that winrate improves when running default vs. default command line `./leelaz -w (weight file)` on time parity.

Pull request leela-zero#2188.

* Autogtp: Tune for batchsize 1

Self-play games specify `-t 1` for playing which implies batch size of 1, but tuning was done for default settings since number of threads was not specified.

Pull request leela-zero#2206

* Tweak conversion script for ELF v2.

Small tweak to conversion script for ELF v2 weights.

Pull request leela-zero#2213.

* Update README.md

Update links to leela-zero instead of gcp.

* Update README.md

Appveyor link still needs to be 'gcp'.

* Update README.md

Update badge and link to the new AppVeyor project under leela-zero instead of gcp ownership.

* Update README.md.

Update links to leela-zero instead of gcp.
Update badge and link to the new AppVeyor project
under leela-zero instead of gcp ownership.

* Remove unused lambda capture.

Pull request leela-zero#2231.

* README.md: link to mentioned pull requests.

Pull request leela-zero#2229.

* Minor cleanup involving Network::get_output. 

Pull request leela-zero#2228.

* Set up default batch size and threads.

Fixes issue leela-zero#2214.

Pull request leela-zero#2256.

* Shuffle tuner parameters to find good parameters quicker.

Parameters are searched in a linear fashion currently. By shuffling them,
we will find a good instance more quickly.

Also, shuffing could help reduce possible bias due to grouped, similar
parameters that affect the environment (e.g. cache, branch predictor, ...),
leading to more accurate/fair results.

Additionally, this is a preparation for exiting the tuner during the search,
which becomes a possible option.

Pull request leela-zero#2225.

* Refactor tree_stats_helper to lambda.

Pull request leela-zero#2244.

* Enable batching for self-play.

Pull request leela-zero#2253.

* Allow configuring default komi at compile-time.

Pull request leela-zero#2257.

* Update README.md

Update links to leela-zero instead of gcp.

* Make chunkparser more robust.

Some clients are sending corrupted data, make the
chunk parser resilient against it.

* Fix thread count error message.

Pull request leela-zero#2287.

* Fix small style nits.

* Add support for time controls in loadsgf/printsgf.

Added extra support for "TM" and "OT" and other sgf time control
properties on printsgf and loadsgf GTP commands.

* Added parsing and loading of "TM" and "OT" sgf properties on GTP command
  loadsgf. Only supports "OT" syntax matching output from a printsgf GTP
  command.
* Change SGFTree to have a shared_ptr for a time control.
* Added saving and loading of "BL", "WL", "OB" and "OW" sgf properties on
  GTP commands printsgf and loadsgf.
* Change to make TimeControl::make_from_text_sgf() a time control factory
  and other minor tidying.

Pull request leela-zero#2172.

* Fix inconsistent default timecontrol.

As noted in pull request leela-zero#2172, the default
constructor set byo yomi stones but no time or
periods.

* Error out if weights are for wrong board size.

We currently will either crash or do strange things if we're
fed a weights file that doesn't match the board size we're compiled
for.

See issue leela-zero#2289.

* Ignore passing moves unless they make sense.

Only pass when winning or low on legal moves.
Disabled in self-play.

Fixes issue leela-zero#2273.
Based on pull request leela-zero#2277.

Pull request leela-zero#2301.

* Always allow passing when low on moves.

As pointed out by @gjm11 in leela-zero#2277, when there's few legal moves we might
want to allow passing even if this loses on the board count. The
alternative might be to self-destruct large groups and carry the game
on endlessely even if the policy wouldn't want to.

No difference in "dumbpass" mode.

* Report root visits in gomill-explain_last_move.

See issue leela-zero#2280.

Pull request leela-zero#2302.

* Choose move based on normal distribution LCB.

* Calculate node variance.
* Use normal distribution LCB to choose the played move.
* Cached student-t.
* Sort lz-analyze output according to LCB.
* Don't choose nodes with very few visits even if LCB is better.

Guard against NN misevaluations when top move has lot of visits.
Without this it's possible for move with few hundred visits to be picked
over a move with over ten thousand visits.

The problem is that the evaluation distribution isn't really normal
distribution. Evaluations correlate and the distribution can change
if deeper in the tree it finds a better alternative.

Pull request leela-zero#2290.

* Mixed precision training support.

* Add mixed precision training support.
* Do not use loss scale if training with fp32
* Fix potential reg_term overflow of large networks.

Pull request leela-zero#2191.

* Update AUTHORS.

* Don't detect precision with Tensor Cores. 

Don't autodetect or default to fp32 when all cards have
Tensor Cores. We will assume fp16 is the fastest.

This avoids problems in tune-only mode which does not
detect the precision to use and would use fp32 on such cards.

Pull request leela-zero#2312.

* Update README.md.

We have a first implementation of batching now.

* Ignore --batchsize in CPU only compiles.

AutoGTP will always send --batchsize, but CPU only
compiles don't support the option. Ignore the option
in those builds.

The same problem exists with --tune-only, but quitting
immediately happens to be sane behavior so we don't need
to fix that.

Pull request leela-zero#2313.

* Don't include OpenCL scheduler in CPU build.

It will recursively include OpenCL.h and that
is bad.

Pull request leela-zero#2314.

* Bump version numbers.

* Address git hub security alert

* Match upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants