Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

katgo 1.1 released #2431

Open
l1t1 opened this issue Jun 18, 2019 · 144 comments
Open

katgo 1.1 released #2431

l1t1 opened this issue Jun 18, 2019 · 144 comments

Comments

@l1t1
Copy link

l1t1 commented Jun 18, 2019

https://github.com/lightvector/KataGo/releases

@l1t1
Copy link
Author

l1t1 commented Jun 18, 2019

20b weight is as strong as elf v2

@john45678
Copy link

Great, Thanks KataGo team!

@alreadydone
Copy link
Contributor

alreadydone commented Jun 18, 2019

Congrats to the great work, and thanks to @lightvector for the release! So we now have ELF-strength network that plays under any handicap and komi! Training data (once converted) and self-play games could be used to train LZ play more reasonably at high/low winrate.

ELFv2 training run used 2000th self-play GPUs (V100) took "around 16 days of wall clock time, achieving superhuman performance in 9 days". ELFv2 is the 1290th model out of a total of 1500 models, so it's produced near 13.76th day, amounting to 27520 V100-days. KataGo g104 run used 27 V100s for 19 days, amounting to 513 V100-days, and surpassed ELFv2. 53.64x reduction in computing resource.

@Friday9i
Copy link

@alreadydone Yeah, very impressive, 50x faster to reach the same level with equivalent ressources!!!
Will you compile a new version for Windows? For whatever reason, I managed to use your first version for a few days and since that, nothing works anymore, I just have a "connection failed" when katago should play in Sabaki ; -(

@alreadydone
Copy link
Contributor

alreadydone commented Jun 19, 2019

I'll compile a new version when I have a chance.

KataGo's 20b256f is of the same size as ELFv2, so should be able to achieve similar inference speed (playouts per second) as LZ-ELFv2 (architectural difference should have negligible impact on speed). I wonder if it's already the case (for the CUDNN version).

Despite innovations on the network architecture and input features, the resulting net is only on par or slightly above ELFv2, which makes me wonder if it's due to:

  1. insufficient self-play/training, the net hasn't reached its ceiling;
  2. the different architecture and input features doesn't help much (unlikely);
  3. maximizing score and not just winning probability hinders performance.

@l1t1
Copy link
Author

l1t1 commented Jun 19, 2019

i try to compile with mingw without cuda, but failed

set path=D:\mingw64\bin;D:\cmake\bin
set include=D:\mingw64\include
set lib=D:\mingw64\lib
D:\KataGo-1.1\cpp>d:\cmake\bin\cmake -G"MinGW Makefiles" .  -DBUILD_MCTS=1 -DUSE_CUDA_BACKEND=1
-DBUILD_MCTS=1 is set, building 'main' executable for mcts-backed GTP engine and other tools
-- The CUDA compiler identification is unknown
CMake Error at CMakeLists.txt:32 (enable_language):
  No CMAKE_CUDA_COMPILER could be found.

  Tell CMake where to find the compiler by setting either the environment
  variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
  path to the compiler, or to the compiler name if it is in the PATH.


-- Configuring incomplete, errors occurred!
See also "D:/KataGo-1.1/cpp/CMakeFiles/CMakeOutput.log".
See also "D:/KataGo-1.1/cpp/CMakeFiles/CMakeError.log".

@lightvector
Copy link

lightvector commented Jun 19, 2019

@alreadydone - It's due to number 1. If I trained it longer, I'm pretty sure it would continue to improve more. Here's a snippet of BayesElo-style ratings computed from some match games I have ongoing between nets from this run and a sampling of Leela Zero nets (1600 visits, LCB). Games played between a wide variety of versions, and between LZ and KataGo as well, to limit nontransitivity and Elo inflation effects.

Bot                                   Elo
lz220                               : 981.5  95cf ( 943.5,1020.5)  (293.0 win, 137.0 loss) 
lz215                               : 974.2  95cf ( 938.2,1011.2)  (333.0 win, 147.0 loss) 
lz200                               : 932.1  95cf ( 900.1, 965.1)  (381.0 win, 204.0 loss) 
g104-b20c256-s447913472-d241840887  : 927.2  95cf ( 888.2, 966.2)  (249.0 win, 149.0 loss) 
lz210                               : 926.9  95cf ( 893.9, 960.9)  (348.0 win, 196.0 loss) 
lz190                               : 910.5  95cf ( 880.5, 941.5)  (498.0 win, 227.0 loss) 
g104-b20c256-s422435072-d233008761  : 859.6  95cf ( 821.6, 897.6)  (239.0 win, 179.0 loss) 
lz195                               : 856.5  95cf ( 826.5, 887.5)  (370.0 win, 269.0 loss) 
g104-b20c256-s396758784-d224242318  : 846.4  95cf ( 809.4, 883.4)  (275.0 win, 191.0 loss) 
elfv1                               : 840.9  95cf ( 811.9, 869.9)  (444.0 win, 288.0 loss) 
elfv2                               : 837.2  95cf ( 808.2, 867.2)  (432.0 win, 282.0 loss) 
lz185                               : 814.1  95cf ( 785.1, 844.1)  (404.0 win, 289.0 loss) 
lz175                               : 808.1  95cf ( 779.1, 838.1)  (431.0 win, 278.0 loss) 
g104-b20c256-s369496064-d214946955  : 798.9  95cf ( 763.9, 833.9)  (277.0 win, 240.0 loss) 
lz180                               : 791.3  95cf ( 763.3, 820.3)  (456.0 win, 310.0 loss) 
g104-b20c256-s330587392-d201656907  : 781.7  95cf ( 746.7, 816.7)  (270.0 win, 244.0 loss) 
g104-b20c256-s302841088-d192258657  : 745.4  95cf ( 710.4, 780.4)  (257.0 win, 267.0 loss) 
elfv0                               : 738.6  95cf ( 709.6, 767.6)  (395.0 win, 333.0 loss) 
g104-b20c256-s226413568-d166126657  : 699.4  95cf ( 663.4, 734.4)  (250.0 win, 281.0 loss) 
g104-b20c256-s262831872-d178598051  : 684.2  95cf ( 648.2, 719.2)  (250.0 win, 287.0 loss) 
g104-b20c256-s183461888-d151583586  : 614.5  95cf ( 578.5, 650.5)  (221.0 win, 306.0 loss) 
lz157                               : 574.8  95cf ( 546.8, 602.8)  (367.0 win, 429.0 loss) 

It's a bit noisy (note how wide the confidence intervals are with only this few games to test), but no obvious signs of having reached a limit. But I had to stop the run at some point, because GPU power is quite expensive and I want to move on to testing ideas that could potentially make a third run even more cost-efficient and a better use of time. :)

Number 3 is certainly possible too, but at least at weaker strengths the ablation run I did back in the paper suggested score maximization helped it, rather than hurting it, it could just as easily be true at strong levels too. It's on the todo list to test a few more things relating to score maximization, but until then, given how it seemingly hasn't actually capped in strength and how well the learning is working overall, I'm not going to worry much about it.

@l1t1 - I'm confused - you don't have CUDA, but you're trying to compile it with the CUDA backend? (-DUSE_CUDA_BACKEND=1).
Also in the past, when I've googled around, things I've found online have generally suggested that CUDA and MinGW don't play nice together, sadly. On the other hand, OpenCL and MinGW should work together...

@tychota
Copy link

tychota commented Jun 19, 2019

@alreadydone I have already started some work on building an actual version and rebasing your branch to master.

See it here: https://github.com/tychota/KataGo/tree/vs

  • find_library was mandatory (else it fail on linking library earlier) tychota/KataGo@f45a144#diff-6725b893dfc969abac4f4ee39a3a317fR210
  • " git version" that was commented earlier seems to work now
  • I updated the structure to use vendors directory, will add a nuget package file tomorrow.
  • some deps were updated (boost to 1.70)

I'm quite done but got a rather bizare error: https://gist.github.com/tychota/d3ff1d13ece5ee3a5aec09c2d78487a6 maybe you can have a look ?

@l1t1
Copy link
Author

l1t1 commented Jun 19, 2019

i installed the NVIDIA cuda kit 10.1, still got error

D:\KataGo-1.1\cpp>d:\cmake\bin\cmake -G"MinGW Makefiles" .  -DBUILD_MCTS=1 -DUSE_CUDA_BACKEND=1
-- The C compiler identification is GNU 8.1.0
-- The CXX compiler identification is GNU 8.1.0
-- Check for working C compiler: D:/mingw64/bin/gcc.exe
-- Check for working C compiler: D:/mingw64/bin/gcc.exe -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: D:/mingw64/bin/g++.exe
-- Check for working CXX compiler: D:/mingw64/bin/g++.exe -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-DBUILD_MCTS=1 is set, building 'main' executable for mcts-backed GTP engine and other tools
-- The CUDA compiler identification is unknown
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe -- broken
CMake Error at D:/cmake/share/cmake-3.14/Modules/CMakeTestCUDACompiler.cmake:46 (message):
  The CUDA compiler

    "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: D:/KataGo-1.1/cpp/CMakeFiles/CMakeTmp

    Run Build Command(s):D:/mingw64/bin/mingw32-make.exe cmTC_4f895/fast
    D:/mingw64/bin/mingw32-make.exe -f CMakeFiles\cmTC_4f895.dir\build.make CMakeFiles/cmTC_4f895.dir/build
    mingw32-make.exe[1]: Entering directory 'D:/KataGo-1.1/cpp/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_4f895.dir/main.cu.obj
    C:\PROGRA~1\NVIDIA~2\CUDA\v10.1\bin\nvcc.exe     -x cu -c D:\KataGo-1.1\cpp\CMakeFiles\CMakeTmp\main.cu -o CMakeFile
s\cmTC_4f895.dir\main.cu.obj
    nvcc fatal   : Cannot find compiler 'cl.exe' in PATH
    mingw32-make.exe[1]: *** [CMakeFiles\cmTC_4f895.dir\build.make:65: CMakeFiles/cmTC_4f895.dir/main.cu.obj] Error 1
    mingw32-make.exe[1]: Leaving directory 'D:/KataGo-1.1/cpp/CMakeFiles/CMakeTmp'
    mingw32-make.exe: *** [Makefile:120: cmTC_4f895/fast] Error 2




  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (enable_language)


-- Configuring incomplete, errors occurred!
See also "D:/KataGo-1.1/cpp/CMakeFiles/CMakeOutput.log".
See also "D:/KataGo-1.1/cpp/CMakeFiles/CMakeError.log".

@lightvector
Copy link

@l1t1 - yes, as I mentioned in my previous post already, MinGW and CUDA are probably just not compatible. You will have to use Microsoft Visual Studio (which I've never used), or wait for @tychota and @alreadydone to get it working and download their version of the binary, or wait for me to optimize the OpenCL version over the next several weeks.

@lightvector
Copy link

@tychota @alreadydone - Thanks for all the work in making a windows version! Poke me if there's anything I can do to help and/or open an issue on the KataGo repo as needed if there are any more simple and easy things I can do to increase compatibility with MSVC, as a follow up to a lot of the 'pedantic' changes and various other things that I handled earlier.

@alreadydone
Copy link
Contributor

alreadydone commented Jun 19, 2019

Compilation was successful: main.zip

https://github.com/alreadydone/KataGo/tree/vs

@tychota

  • I didn't add a find_library
  • git version may be working; I replaced some lines previously commented out by uncommented new lines, and it compiles, but maybe some lines remain commented out.
  • would be nice to have a nuget package file! not sure what vendors directory is. I didn't change any directory in my latest merge commit, but you certainly need to change directories to where the libraries are located on your machine.)
  • did you just change the directory to where boost 1.70 is installed, or you demanded boost 1.70 in your nuget package file?
  • I didn't encounter the bizarre error. Only some annoying double to float conversions which are regarded as errors (not warning).

@Friday9i
Copy link

Friday9i commented Jun 19, 2019

@alreadydone Just tested, I still get a problem somewhere on my config ; -(.
In Sabaki, I first get many usual messages: (note I replaced = by "equal" in the text to avoid a formating problem in github)
KataGo19x19> name
"equal" Leela Zero
○ KataGo19x19> version
"equal" 0.16
○ KataGo19x19> protocol_version
"equal" 2
○ KataGo19x19> list_commands
"equal" protocol_version
name
version
...
Then:
KataGo19x19> time_settings 0 1 0 (note: it fails also with other parameters such as 0 2 1)
connection failed
Then it goes on despite the connection problem:
KataGo19x19>
KataGo v1.1
Loaded model D:\LZ16\KataGo\15b.txt
GTP ready, beginning main protocol loop
○ KataGo19x19> boardsize 19
"equal"
● KataGo19x19> clear_board
"equal"
○ KataGo19x19> clear_board
"equal"
○ KataGo19x19> genmove B
connection failed
Then, it stops

Any idea why it fails like that? It worked 2 months ago for a few days, then without any obvious change (updated Nvidia drivers maybe???), it fails like that and I did not manage to make it work again ; -(.
Maybe you could describe how you use it and the parameters you use?

@tychota
Copy link

tychota commented Jun 19, 2019

@alreadydone

would be nice to have a nuget package file! not sure what vendors directory is. I didn't change any directory in my latest merge commit, but you certainly need to change directories to where the libraries are located on your machine.)

Just a vendorsdirectory next to the codeto have the nuget packages`

image

so it makes the import less specific to your machine.

@l1t1
Copy link
Author

l1t1 commented Jun 19, 2019

vc2017 also have problem @alreadydone

D:\Visual Studio 2017 Enterprise\VC\Auxiliary\Build>call vcvarsall.bat x86_amd64
**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.0.26228.4
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x86_x64'
cd /d D:\KataGo-1.1\cpp
d:\cmake\bin\cmake . -DBUILD_MCTS=1 -DUSE_CUDA_BACKEND=1
-DBUILD_MCTS=1 is set, building 'main' executable for mcts-backed GTP engine and other tools
-- Could NOT find Git (missing: GIT_EXECUTABLE)
CMake Error at D:/cmake/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find ZLIB (missing: ZLIB_LIBRARY ZLIB_INCLUDE_DIR)
Call Stack (most recent call first):
  D:/cmake/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  D:/cmake/share/cmake-3.14/Modules/FindZLIB.cmake:115 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:194 (find_package)


CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDNN_LIBRARY
    linked by target "main" in directory D:/KataGo-1.1/cpp

-- Configuring incomplete, errors occurred!
See also "D:/KataGo-1.1/cpp/CMakeFiles/CMakeOutput.log".

@alreadydone
Copy link
Contributor

@l1t1 I don't know how you get cmake to find the packages on Windows. I changed CMakeLists.txt a lot, hard-coded the directories to make it work.

@tychota I think some packages are pretty big so I avoided copying to save disk space; I don't think you should upload the packages to github or manage them in git...

@Friday9i Do you see anything in gtp.log?

@l1t1
Copy link
Author

l1t1 commented Jun 19, 2019

I cloned your sourcecode , failed too @alreadydone

D:\gitw\bin>git clone https://github.com/alreadydone/KataGo.git
Cloning into 'KataGo'...
remote: Enumerating objects: 250, done.
remote: Counting objects: 100% (250/250), done.
remote: Compressing objects: 100% (110/110), done.
remote: Total 8862 (delta 150), reused 190 (delta 140), pack-reused 8612
Receiving objects: 100% (8862/8862), 33.09 MiB | 91.00 KiB/s, done.
Resolving deltas: 100% (6945/6945), done.

cd /d D:\gitw\bin\KataGo\cpp
d:\cmake\bin\cmake . -DBUILD_MCTS=1 -DUSE_CUDA_BACKEND=1

-- Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1 (found version "10.1")
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDNN_LIBRARY
    linked by target "main" in directory D:/gitw/bin/KataGo/cpp

-- Configuring incomplete, errors occurred!
See also "D:/gitw/bin/KataGo/cpp/CMakeFiles/CMakeOutput.log".

@mc-mong
Copy link

mc-mong commented Jun 19, 2019

Compilation was successful: main.zip

https://github.com/alreadydone/KataGo/tree/vs

@tychota

* I didn't add a find_library

* git version may be working; I replaced some lines previously commented out by uncommented new lines, and it compiles, but maybe some lines remain commented out.

* would be nice to have a nuget package file! not sure what vendors directory is. I didn't change any directory in my latest merge commit, but you certainly need to change directories to where the libraries are located on your machine.)

* did you just change the directory to where boost 1.70 is installed, or you demanded boost 1.70 in your nuget package file?

* I didn't encounter the bizarre error. Only some annoying double to float conversions which are regarded as errors (not warning).

lz-analyze B 100.. error

lz-analyze ..I just keep loading and I can not play the game.

Sabaki v0.43.3 / windows 10 64bit

@Friday9i
Copy link

Friday9i commented Jun 19, 2019

@alreadydone Here is the gtp.log (cut short for the last tests, and renamed). I'm also using Sabaki v0.43.3 and latest windows 10 (64bit):
gtp2.log
(in this test I launched LZ223 as Black and latest Katago as white, but I get comparable "connection failed" when both colors are KataGo, while it works for LZ223 vs LZ223)

Note: since previous tests, I changed my GPU to upgrade to an RTX2080 (from a GTX1080). Could it be one reason of the difficulties (eg linked to FPU16 and/or Tensor cores)?

And a screen copy of Sabaki:
image
(you can see the "connection failed" after time_settings and the parameters I used for KataGo engine (I tried with and without the "override-version 0.16", and also with the default gtp_example.cfg: same result)

@l1t1
Copy link
Author

l1t1 commented Jun 19, 2019

@alreadydone
i run your main.exe and it ask for zip.dll, i dont know where to got it, if there are more dll isnt in cuda and cudnn kits, could you also post them.

@22nsuk
Copy link

22nsuk commented Jun 19, 2019

GeForce RTX graphics card may be the cause of a "connection failed" error.
Is anyone running on an RTX graphics card?

@tychota
Copy link

tychota commented Jun 19, 2019

@alreadydone

@tychota I think some packages are pretty big so I avoided copying to save disk space; I don't think you should upload the packages to github or manage them in git...

The vendors dir is gitignored. But that is a good way to make an environnement nonspecific path.

Compare:

  • set(BOOST_INDLUDEDIR "C:/Users/XJY/.nuget/packages/boost/1.68.0/lib/native/include/boost") (your version, env specific, with user hardcoded)
  • set(BOOST_INDLUDEDIR "./vendors/boost.1.70.0.0/lib/native/include/boost")(my version, non specific)

@lightvector
Copy link

@Friday9i @mc-mong - Is "connection failed" a thing that Sabaki prints, or is it being produced somehow by the KataGo executable from some CUDA internals or something? What happens when you run the executable directly on the command line and type in the GTP commands manually?

@Friday9i
Copy link

@lightvector I don't know if it comes from Sabaki or KataGo (and I was wondering). I'll try the command line late tonight European time (but what should be the command? Do you have an example of what I should type?)
Thanks a lot

@kaorahi
Copy link
Contributor

kaorahi commented Jun 19, 2019

Thx for the release. Now lizgoban supports KataGo v1.1 and its score/ownership estimations.
katago

@lightvector
Copy link

lightvector commented Jun 19, 2019

@Friday9i - Maybe type the commands like the ones that you see Sabaki sending it? "komi 7.5" and "boardsize 19" and "clear_board" and "genmove B" and so on. The GTP command "list_commands" will show you all the commands a bot supports.

See here for the GTP protocol, where it describes what the standard GTP commands are all supposed to do. It really is just text, you can type and communicate with the bot.

Edit: @kaorahi - Sweet!

@Friday9i
Copy link

ok I'll test extensively tonight, thx a lot

@mc-mong
Copy link

mc-mong commented Jun 19, 2019

@lightvector -
Sabaki v0.43.3 / windows 10 64bit

Pressing the F4 key ...
loading ..
continue..
No action.

lz-analyze works with B 100 but ..

I do not go on the board anymore.

sabaki option : gtp -model 6b.txt -config configs/gtp_example.cfg

@Friday9i
Copy link

A few tests with command line "main.exe demoplay -log-file test -model 15b.txt -config configs/gtp19.cfg", here are the results:
1st test:
2019-06-19 19:59:04+0200: Engine starting...
2019-06-19 19:59:04+0200: nnRandSeed0 = 4707124547028803790
2019-06-19 19:59:04+0200: After dedups: nnModelFile0 = 15b.txt useFP16 false cudaUseNHWC false
2019-06-19 19:59:15+0200: Loaded neural net
2nd test:
2019-06-19 20:00:03+0200: Engine starting...
2019-06-19 20:00:03+0200: nnRandSeed0 = 16270632723077444971
2019-06-19 20:00:03+0200: After dedups: nnModelFile0 = 6x96.txt useFP16 false cudaUseNHWC false
2019-06-19 20:00:04+0200: Loaded neural net
Apparently, it stops after loading the net ; -(

@alreadydone
Copy link
Contributor

@l1t1 If you want to compile:

  • You need to change
    find_library(CUDNN_LIBRARY cudnn.lib E:/CUDA/lib/x64)
    in CMakeLists.txt to where your CUDA is installed (after installing CUDNN into there)
    Also make sure other packages are available and change the paths in CMakeLists.txt to appropriate directories.

If you don't want to compile, all necessary DLLs are included in lightvector/KataGo#2.

@Friday9i What commands did you type and what responses did you receive? Why do you use demoplay instead of gtp?

I tried Sabaki and it seems if you play against KataGo and also enable analysis, then after your move, KataGo won't respond; you have to Suspend and Generate Move, and a connection failed message is displayed. Also I noticed that v1 in KataGo's starting message (v1.1) seems to be recognized by Sabaki as something. And that time_settings didn't get a = response (though in command line it does get the response).

BTW KataGo now supports rectangular boards, but it seems UI support is lacking. Sabaki can show rectangular boards but won't send appropriate GTP command to KataGo; I requested the feature at SabakiHQ/Sabaki#75.

@tychota So whoever wants to compile has to get the packages and put them into the vendors folder manually. I'm lucky that I've gotten the packages through compiling LZ earlier, so I just need to find the directories where the packages are located. I don't know how to get the packages installed without NuGet. CMake project in VS seems unable to get NuGet packages automatically; you likely need to upgrade it into a VS project.

@lightvector
Copy link

@hzyhhzy - Take a close look at your stdout.txt. If you read through the exception trace, you will find a lot of "OutOfRangeError". If we look here:
https://www.tensorflow.org/guide/datasets

This suggests that this is a completely normal exception that Tensorflow raises to indicate when the current sequences of batches of training data are done. Presumably tf.Estimator catches this as a signal to continue to do the next thing. So keep digging down through through as the Python trace reports what exceptions were throw in the code that was run after catching that OutOfRangeError, which is caught and re-thrown by Tensorflow through several layers...

Until we come to something that is clearly anomalous: Python's os.rename failed! Let's take a look at the docs:
https://docs.python.org/3/library/os.html

So, seems like the problem is that it behaves inconsistently between Linux and Windows, which I had not known. On Linux, it replaces existing destination files, on Windows, it fails. Indeed, trainhistory.json is such a file that the train.py script writes out to a tmp file and then attempts to replace the last trainhistory.json to update it. (I hope that was a useful case study of how one might debug the error from the message. This is basically how I worked out this error from your stdout.txt)

Based on the Python docs link above, it looks like the better function that should behave more portably and do the desired thing is os.replace. Try changing all instances of os.rename in train.py to os.replace. I'll also fix this in KataGo for a future release.

(thanks for bugfinding!)

@hzyhhzy
Copy link

hzyhhzy commented Jun 28, 2019

Thanks a lot! I replaced three "os.rename" with "os.replace" in train.py(except the one in line590), replace line590 with shutil.move(savepathtmp,savepath), and add "os.makedirs(savepathtmp)" before line566, then it works on Windows.

@hzyhhzy
Copy link

hzyhhzy commented Jun 28, 2019

Train.py can only work on one GPU?

@lightvector
Copy link

Yes, I never implemented multi-GPU training.

@alreadydone
Copy link
Contributor

alreadydone commented Jun 29, 2019

re-analysis of the ear-reddening move with KataGo: https://tieba.baidu.com/p/6178989992
Conclusion: Black 63.3% before but White 63.4% after the move, under 0 komi. Therefore a mistake.

BTW a recent series "Where exactly were humans wrong?"
https://share.yikeweiqi.com/gonews/detail/id/25937
https://share.yikeweiqi.com/gonews/detail/id/25954
https://share.yikeweiqi.com/gonews/detail/id/25966

@PublicStarMoon
Copy link

PublicStarMoon commented Jun 29, 2019

Updated: I get it wrong , Sorry.

@alreadydone
This is what I post in Tieba today.

Probably you got it wrong :
Black 63.3% before move #127
&& White 63.4% after move #127

Apparently move #127 is a bad move which makes winrate of Black decrease from 63.3 to (1 - 63.4%).

@Friday9i
Copy link

It seems it is needed on Windows to have Cuda 10.1 installed:
After helping someone on discord to make KataGo work on Windows 10, it seems that Cuda10.1 is needed. He tried with the latest files (including the fresh 400Mbyte cudnn64_7.dll) and it didn't work. After installing cuda10.1 for windows, it worked!

@diadorak
Copy link

@PublicStarMoon
I have black 66% before 127 and 56% after 127. So the famous move is actually a bad move that costs 10% for black. The strongest move suggested by Katago is C18, which is the same as Golaxy (I heard).

It seems Katago thinks white's move 126 is extremely bad. White was 46% before the move and 34% after that move. Maybe Inseki's ear reddened not because 127 was that good, but because he realized how bad move 126 was. haha.

@PublicStarMoon
Copy link

@diadorak

You are right, I have made a mistake on operating Lizzie with Katago . And I have fixed it in original post link.

1

2

@hzyhhzy
Copy link

hzyhhzy commented Jul 1, 2019

This often happens on my selfplay, especially after it automatically change model. It seems that writing data almost completely got stuck. How should I deal with it?

2019-07-01 14:19:59+0800: Started 4500 games
2019-07-01 14:19:59+0800: WARNING: Struggling to keep up writing data, 3459 games enqueued out of 6000 max
2019-07-01 14:20:00+0800: WARNING: Struggling to keep up writing data, 3464 games enqueued out of 6000 max
.
.
.
2019-07-01 14:21:28+0800: WARNING: Struggling to keep up writing data, 3815 games enqueued out of 6000 max
2019-07-01 14:21:29+0800: WARNING: Struggling to keep up writing data, 3820 games enqueued out of 6000 max
2019-07-01 14:21:30+0800: WARNING: Struggling to keep up writing data, 3825 games enqueued out of 6000 max
2019-07-01 14:21:32+0800: WARNING: Struggling to keep up writing data, 3834 games enqueued out of 6000 max
2019-07-01 14:21:34+0800: WARNING: Struggling to keep up writing data, 3844 games enqueued out of 6000 max
2019-07-01 14:21:34+0800: WARNING: Struggling to keep up writing data, 3847 games enqueued out of 6000 max
2019-07-01 14:21:37+0800: WARNING: Struggling to keep up writing data, 3857 games enqueued out of 6000 max
2019-07-01 14:21:39+0800: WARNING: Struggling to keep up writing data, 3865 games enqueued out of 6000 max
2019-07-01 14:21:42+0800: WARNING: Struggling to keep up writing data, 3877 games enqueued out of 6000 max
2019-07-01 14:21:44+0800: Started 5000 games

@hzyhhzy
Copy link

hzyhhzy commented Jul 1, 2019

Does train.py automatically adjust learning rate? Or I should use "-lr-scale" to adjust it?

@Friday9i
Copy link

Friday9i commented Jul 1, 2019

Updated note regarding KataGo's installation and the pre-requisite on Windows
It seems that several things are needed:

  • From my understanding, Windows 10
  • CUDA 10 at least, possibly 10.1 is needed
  • Recent version of cudnn64_7.dll (~400Mbytes, the one provided above is ~250Mbytes and doesn't work on my computer and several others)
  • Possibly Visual Studio (a message when installing Cuda says it may not be fully fonctionnal without Visual Studio)
  • and a recent Nvidia GPU of course (I guess at least a GTX or an RTX card)

Hope this helps!

PS: KataGo is a bit painful to install, but it is such a nice engine, that justifies some efforts : -).
And so strong too: yesterday evening I tested KataGo as White vs Crazy Stone Deep Learning with highest level (7Dan) and 4 handicap (!) on 19x19, and about equal time (~20s/move): KataGo won convincingly! I used Lizzie and just avoided early 3-3 invasions (by selecting alternative moves).

@ozymandias8
Copy link

I don't know that it requires visual studio. I cannot remember, but I think I just installed the latest VC redist package. I am using alreadydone's compile on a NVIDIA RTX GPU.

@Friday9i
Copy link

Friday9i commented Jul 1, 2019

Cuda seems needed, and @kira had a message when installing Cuda sayin that Visual Studio may be needed. Hence, it's not 100% sure but VS may be needed (in line with my wording above)

@ozymandias8
Copy link

That is a little strange. I run KataGo on a Windows 10 Home PC with a brand new installation on new hardware. I didn't run into any of these issues.

@Friday9i
Copy link

Friday9i commented Jul 1, 2019

@ozymandias8 Did you have Cuda10, cudnn and Visual Studio on your machine before installing KataGo?

@ozymandias8
Copy link

Definitely not visual studio. Although like I said, I needed to installed the VC redistributable I think. Steam Client installed some items so that I could play Sekiro, perhaps there are some CUDA-related files in there.

@Friday9i
Copy link

Friday9i commented Jul 1, 2019

oh yes indeed, that's probably the redistribuable package of VS that is needed, not VS itself

@lightvector
Copy link

@hzyhhzy - The learning rate is constant (except it is lower for the first 5 million samples of training). You can use -lr-scale to adjust it if you need to. If you are doing self-play, the default learning rate is pretty good. In the official released run, KataGo kept -lr-scale 1.0 from the start of the run all the way to reaching around LZ180ish strength! I lowered it for a couple days of training right at the end to finetune the final strength, but I probably would have kept it high still if I intended to keep it going rather than planning to stop at that point.

The "Struggling to keep up writing data" is an error I have never seen myself, I put it in there as a sanity check. Glad to know the sanity check is actually finding something. When you get that error, if you look at the training data output, has the data writing gotten completely stuck and/or died and is actually outputting nothing? Or it actually still producing new training data files, but just actually falling behind? If it's still producing but falling behind, do you have a traditional hard drive or an SSD, and are you running other disk-intensive things on the same machine? (For example, the shuffle script is extremely disk intensive).

@hzyhhzy
Copy link

hzyhhzy commented Jul 2, 2019

Much slower, not stuck. My hard drive is a new 860evo ssd.

@l1t1
Copy link
Author

l1t1 commented Jul 6, 2019

lightvector/KataGo#36
a bot is testing katago on cgos

@lightvector
Copy link

Any idea who's running it?

@hzyhhzy - Honestly, I'm not sure what's going on then. Presumably something on your system is slowing it down compared to mine, or the parameters and number of threads and cores you have are such that the data writing thread is starved on computation time, or something else. If you can narrow down to a clear bug or something that I might be able to reproduce, I'd be more than happy to investigate and fix it!

@Splee99
Copy link

Splee99 commented Jul 7, 2019

The trouble with Katago is that there are too many parameters in the configure file. So when we set the visit to 3200, such as the Katago bot on cgos, there are still other parameters which may affect the strength.

@hzyhhzy
Copy link

hzyhhzy commented Jul 8, 2019

Can katago perform better in handicap games if I keep adjusting komi to make winrate be around 50%?

@lightvector
Copy link

Maybe! Nobody has ever tested it! :)

A simple change that also likely makes it better it to center dynamic score utility not at the current score estimate, but at some weighted average of that and zero. Nobody has ever tested that either, although I went ahead and made the change anyways for the OGS version.

@poptangtwe
Copy link

Will KataGo play more bad moves if the winrate <5% in handicap games? Shall we adjust the komi to keep the winrate staying around 50%?
HZYhHZY and I are confused about the question. I am not support for adjusting the komi while HZYhHZY prefers to increase the winrate by add komi in handicap games to make KataGo play steadier and to decrease the winrate when games are in a dominant position to try to gain more scores. I don't think this approach can benifit for the results.

@lightvector
Copy link

@poptangtwe KataGo usually will not play bad moves if the winrate is low, because it also cares about maximizing the expected score. So there is no necessity to adjust komi. When behind, KataGo will already try to catch up points, and when ahead, it will already try to gain more score while still playing fairly safely and solidly.

However, it almost certainly does not play handicap games against kyu-level players as well as a pro-level human would, due to always assuming the opponent is as strong as itself and not understanding what weaker humans would dynamic or unsettled or complex. (And this isn't simply about obvious trick plays or overplays - for example if there are two josekis that the bot considers both to be good and correct, but a human finds the first one to be simple and the second one complex, the bot might not have a preference for the second one, because it does not always have a good understanding of what a human would find simple versus complex).

It is very possible that adjusting komi or altering the score utility mechanism could make it a little stronger in handicap games - this is untested. There are many possible ways to do things such things, and likely some of them work better than others. Because there are so many possible ways, very likely some of them are better than the current behavior, but it would need testing. But note most possibilities (including adjusting the komi) would not address the above fundamental issues, they would at best be minor improvements.

So summary: you don't need to do anything special, KataGo handles handicap okay! But if you would like to do some careful statistical testing, it could be easily possible to experiment and find some ways to make it still a little better.

@alreadydone
Copy link
Contributor

alreadydone commented Jul 12, 2019

http://m.newsmth.net/article/Weiqi/630476?p=1

星阵训练过一个20B的Resnet系列,到今年2月份停止时,对ELF-v2和Minigo-V15 均领先400elo分,也就是10:1的胜率;对MiniGo-V17 领先300分(考虑到V17使用SEnet,它的实际参数可认为多于20B Resnet)。使用棋谱数量约为750万,显著少于两个对照项目。星阵不是从零开始,但即使加上这个因素,资源上的节省也是明显的。

My translation:

Golaxy trained a series of 20B ResNets until Feb 2019, which surpassed ELF-v2 and MiniGo-V15 by 400 Elo (namely 10:1 win/loss rate) and surpassed MiniGo-V17 by 300 Elo, using 7.5M games, significantly less than the two reference projects. Golaxy did't start from zero, but even considering this factor, the saving of resource is clear.

This is maybe what we should expect if KataGo 20B is trained up to its ceiling.

By the way I recently found that the MiniGo team has published a paper, see:
https://openreview.net/forum?id=H1eerhIpLV
https://slideslive.com/38915880/minigo-a-case-study-in-reproducing-reinforcement-learning-research

@poptangtwe
Copy link

poptangtwe commented Jul 15, 2019

Are the results of Golaxy REAL or FAKE news? The team is a suspect academically creates a false impression. That's academic fraud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests