-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
katgo 1.1 released #2431
Comments
20b weight is as strong as elf v2 |
Great, Thanks KataGo team! |
Congrats to the great work, and thanks to @lightvector for the release! So we now have ELF-strength network that plays under any handicap and komi! Training data (once converted) and self-play games could be used to train LZ play more reasonably at high/low winrate. ELFv2 training run used 2000th self-play GPUs (V100) took "around 16 days of wall clock time, achieving superhuman performance in 9 days". ELFv2 is the 1290th model out of a total of 1500 models, so it's produced near 13.76th day, amounting to 27520 V100-days. KataGo g104 run used 27 V100s for 19 days, amounting to 513 V100-days, and surpassed ELFv2. 53.64x reduction in computing resource. |
@alreadydone Yeah, very impressive, 50x faster to reach the same level with equivalent ressources!!! |
I'll compile a new version when I have a chance. KataGo's 20b256f is of the same size as ELFv2, so should be able to achieve similar inference speed (playouts per second) as LZ-ELFv2 (architectural difference should have negligible impact on speed). I wonder if it's already the case (for the CUDNN version). Despite innovations on the network architecture and input features, the resulting net is only on par or slightly above ELFv2, which makes me wonder if it's due to:
|
i try to compile with mingw without cuda, but failed
|
@alreadydone - It's due to number 1. If I trained it longer, I'm pretty sure it would continue to improve more. Here's a snippet of BayesElo-style ratings computed from some match games I have ongoing between nets from this run and a sampling of Leela Zero nets (1600 visits, LCB). Games played between a wide variety of versions, and between LZ and KataGo as well, to limit nontransitivity and Elo inflation effects.
It's a bit noisy (note how wide the confidence intervals are with only this few games to test), but no obvious signs of having reached a limit. But I had to stop the run at some point, because GPU power is quite expensive and I want to move on to testing ideas that could potentially make a third run even more cost-efficient and a better use of time. :) Number 3 is certainly possible too, but at least at weaker strengths the ablation run I did back in the paper suggested score maximization helped it, rather than hurting it, it could just as easily be true at strong levels too. It's on the todo list to test a few more things relating to score maximization, but until then, given how it seemingly hasn't actually capped in strength and how well the learning is working overall, I'm not going to worry much about it. @l1t1 - I'm confused - you don't have CUDA, but you're trying to compile it with the CUDA backend? ( |
@alreadydone I have already started some work on building an actual version and rebasing your branch to master. See it here: https://github.com/tychota/KataGo/tree/vs
I'm quite done but got a rather bizare error: https://gist.github.com/tychota/d3ff1d13ece5ee3a5aec09c2d78487a6 maybe you can have a look ? |
i installed the NVIDIA cuda kit 10.1, still got error
|
@l1t1 - yes, as I mentioned in my previous post already, MinGW and CUDA are probably just not compatible. You will have to use Microsoft Visual Studio (which I've never used), or wait for @tychota and @alreadydone to get it working and download their version of the binary, or wait for me to optimize the OpenCL version over the next several weeks. |
@tychota @alreadydone - Thanks for all the work in making a windows version! Poke me if there's anything I can do to help and/or open an issue on the KataGo repo as needed if there are any more simple and easy things I can do to increase compatibility with MSVC, as a follow up to a lot of the 'pedantic' changes and various other things that I handled earlier. |
Compilation was successful: main.zip https://github.com/alreadydone/KataGo/tree/vs
|
@alreadydone Just tested, I still get a problem somewhere on my config ; -(. Any idea why it fails like that? It worked 2 months ago for a few days, then without any obvious change (updated Nvidia drivers maybe???), it fails like that and I did not manage to make it work again ; -(. |
Just a so it makes the import less specific to your machine. |
vc2017 also have problem @alreadydone
|
@l1t1 I don't know how you get cmake to find the packages on Windows. I changed CMakeLists.txt a lot, hard-coded the directories to make it work. @tychota I think some packages are pretty big so I avoided copying to save disk space; I don't think you should upload the packages to github or manage them in git... @Friday9i Do you see anything in gtp.log? |
I cloned your sourcecode , failed too @alreadydone
|
lz-analyze B 100.. error lz-analyze ..I just keep loading and I can not play the game. Sabaki v0.43.3 / windows 10 64bit |
@alreadydone Here is the gtp.log (cut short for the last tests, and renamed). I'm also using Sabaki v0.43.3 and latest windows 10 (64bit): Note: since previous tests, I changed my GPU to upgrade to an RTX2080 (from a GTX1080). Could it be one reason of the difficulties (eg linked to FPU16 and/or Tensor cores)? And a screen copy of Sabaki: |
@alreadydone |
GeForce RTX graphics card may be the cause of a "connection failed" error. |
The vendors dir is gitignored. But that is a good way to make an environnement nonspecific path. Compare:
|
@lightvector I don't know if it comes from Sabaki or KataGo (and I was wondering). I'll try the command line late tonight European time (but what should be the command? Do you have an example of what I should type?) |
Thx for the release. Now lizgoban supports KataGo v1.1 and its score/ownership estimations. |
@Friday9i - Maybe type the commands like the ones that you see Sabaki sending it? "komi 7.5" and "boardsize 19" and "clear_board" and "genmove B" and so on. The GTP command "list_commands" will show you all the commands a bot supports. See here for the GTP protocol, where it describes what the standard GTP commands are all supposed to do. It really is just text, you can type and communicate with the bot. Edit: @kaorahi - Sweet! |
ok I'll test extensively tonight, thx a lot |
@lightvector - Pressing the F4 key ... lz-analyze works with B 100 but .. I do not go on the board anymore. sabaki option : gtp -model 6b.txt -config configs/gtp_example.cfg |
A few tests with command line "main.exe demoplay -log-file test -model 15b.txt -config configs/gtp19.cfg", here are the results: |
@l1t1 If you want to compile:
If you don't want to compile, all necessary DLLs are included in lightvector/KataGo#2. @Friday9i What commands did you type and what responses did you receive? Why do you use demoplay instead of gtp? I tried Sabaki and it seems if you play against KataGo and also enable analysis, then after your move, KataGo won't respond; you have to Suspend and Generate Move, and a BTW KataGo now supports rectangular boards, but it seems UI support is lacking. Sabaki can show rectangular boards but won't send appropriate GTP command to KataGo; I requested the feature at SabakiHQ/Sabaki#75. @tychota So whoever wants to compile has to get the packages and put them into the vendors folder manually. I'm lucky that I've gotten the packages through compiling LZ earlier, so I just need to find the directories where the packages are located. I don't know how to get the packages installed without NuGet. CMake project in VS seems unable to get NuGet packages automatically; you likely need to upgrade it into a VS project. |
@hzyhhzy - Take a close look at your stdout.txt. If you read through the exception trace, you will find a lot of "OutOfRangeError". If we look here: This suggests that this is a completely normal exception that Tensorflow raises to indicate when the current sequences of batches of training data are done. Presumably tf.Estimator catches this as a signal to continue to do the next thing. So keep digging down through through as the Python trace reports what exceptions were throw in the code that was run after catching that OutOfRangeError, which is caught and re-thrown by Tensorflow through several layers... Until we come to something that is clearly anomalous: Python's So, seems like the problem is that it behaves inconsistently between Linux and Windows, which I had not known. On Linux, it replaces existing destination files, on Windows, it fails. Indeed, Based on the Python docs link above, it looks like the better function that should behave more portably and do the desired thing is (thanks for bugfinding!) |
Thanks a lot! I replaced three "os.rename" with "os.replace" in train.py(except the one in line590), replace line590 with shutil.move(savepathtmp,savepath), and add "os.makedirs(savepathtmp)" before line566, then it works on Windows. |
Train.py can only work on one GPU? |
Yes, I never implemented multi-GPU training. |
re-analysis of the ear-reddening move with KataGo: https://tieba.baidu.com/p/6178989992 BTW a recent series "Where exactly were humans wrong?" |
Updated: I get it wrong , Sorry.@alreadydone Probably you got it wrong : Apparently move #127 is a bad move which makes winrate of Black decrease from 63.3 to (1 - 63.4%). |
It seems it is needed on Windows to have Cuda 10.1 installed: |
@PublicStarMoon It seems Katago thinks white's move 126 is extremely bad. White was 46% before the move and 34% after that move. Maybe Inseki's ear reddened not because 127 was that good, but because he realized how bad move 126 was. haha. |
You are right, I have made a mistake on operating Lizzie with Katago . And I have fixed it in original post link. |
This often happens on my selfplay, especially after it automatically change model. It seems that writing data almost completely got stuck. How should I deal with it? 2019-07-01 14:19:59+0800: Started 4500 games |
Does train.py automatically adjust learning rate? Or I should use "-lr-scale" to adjust it? |
Updated note regarding KataGo's installation and the pre-requisite on Windows
Hope this helps! PS: KataGo is a bit painful to install, but it is such a nice engine, that justifies some efforts : -). |
I don't know that it requires visual studio. I cannot remember, but I think I just installed the latest VC redist package. I am using alreadydone's compile on a NVIDIA RTX GPU. |
Cuda seems needed, and @kira had a message when installing Cuda sayin that Visual Studio may be needed. Hence, it's not 100% sure but VS may be needed (in line with my wording above) |
That is a little strange. I run KataGo on a Windows 10 Home PC with a brand new installation on new hardware. I didn't run into any of these issues. |
@ozymandias8 Did you have Cuda10, cudnn and Visual Studio on your machine before installing KataGo? |
Definitely not visual studio. Although like I said, I needed to installed the VC redistributable I think. Steam Client installed some items so that I could play Sekiro, perhaps there are some CUDA-related files in there. |
oh yes indeed, that's probably the redistribuable package of VS that is needed, not VS itself |
@hzyhhzy - The learning rate is constant (except it is lower for the first 5 million samples of training). You can use -lr-scale to adjust it if you need to. If you are doing self-play, the default learning rate is pretty good. In the official released run, KataGo kept -lr-scale 1.0 from the start of the run all the way to reaching around LZ180ish strength! I lowered it for a couple days of training right at the end to finetune the final strength, but I probably would have kept it high still if I intended to keep it going rather than planning to stop at that point. The "Struggling to keep up writing data" is an error I have never seen myself, I put it in there as a sanity check. Glad to know the sanity check is actually finding something. When you get that error, if you look at the training data output, has the data writing gotten completely stuck and/or died and is actually outputting nothing? Or it actually still producing new training data files, but just actually falling behind? If it's still producing but falling behind, do you have a traditional hard drive or an SSD, and are you running other disk-intensive things on the same machine? (For example, the shuffle script is extremely disk intensive). |
Much slower, not stuck. My hard drive is a new 860evo ssd. |
lightvector/KataGo#36 |
Any idea who's running it? @hzyhhzy - Honestly, I'm not sure what's going on then. Presumably something on your system is slowing it down compared to mine, or the parameters and number of threads and cores you have are such that the data writing thread is starved on computation time, or something else. If you can narrow down to a clear bug or something that I might be able to reproduce, I'd be more than happy to investigate and fix it! |
The trouble with Katago is that there are too many parameters in the configure file. So when we set the visit to 3200, such as the Katago bot on cgos, there are still other parameters which may affect the strength. |
Can katago perform better in handicap games if I keep adjusting komi to make winrate be around 50%? |
Maybe! Nobody has ever tested it! :) A simple change that also likely makes it better it to center dynamic score utility not at the current score estimate, but at some weighted average of that and zero. Nobody has ever tested that either, although I went ahead and made the change anyways for the OGS version. |
Will KataGo play more bad moves if the winrate <5% in handicap games? Shall we adjust the komi to keep the winrate staying around 50%? |
@poptangtwe KataGo usually will not play bad moves if the winrate is low, because it also cares about maximizing the expected score. So there is no necessity to adjust komi. When behind, KataGo will already try to catch up points, and when ahead, it will already try to gain more score while still playing fairly safely and solidly. However, it almost certainly does not play handicap games against kyu-level players as well as a pro-level human would, due to always assuming the opponent is as strong as itself and not understanding what weaker humans would dynamic or unsettled or complex. (And this isn't simply about obvious trick plays or overplays - for example if there are two josekis that the bot considers both to be good and correct, but a human finds the first one to be simple and the second one complex, the bot might not have a preference for the second one, because it does not always have a good understanding of what a human would find simple versus complex). It is very possible that adjusting komi or altering the score utility mechanism could make it a little stronger in handicap games - this is untested. There are many possible ways to do things such things, and likely some of them work better than others. Because there are so many possible ways, very likely some of them are better than the current behavior, but it would need testing. But note most possibilities (including adjusting the komi) would not address the above fundamental issues, they would at best be minor improvements. So summary: you don't need to do anything special, KataGo handles handicap okay! But if you would like to do some careful statistical testing, it could be easily possible to experiment and find some ways to make it still a little better. |
http://m.newsmth.net/article/Weiqi/630476?p=1
My translation:
This is maybe what we should expect if KataGo 20B is trained up to its ceiling. By the way I recently found that the MiniGo team has published a paper, see: |
Are the results of Golaxy REAL or FAKE news? The team is a suspect academically creates a false impression. That's academic fraud. |
https://github.com/lightvector/KataGo/releases
The text was updated successfully, but these errors were encountered: