-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"error":"This version of KataGo is not enabled for distributed for Nvidia GPU Hopper100 #748
Comments
Thanks for being interested in contributing! The only versions that should be used to contribute are either the tip of Additionally, the "-dirty" suggests that you are attempting to build while having made local changes to the code to your repo. You need to revert all local changes you might have made to anything, and only build using a completely clean checkout from one of the appropriate versions. And of course, please don't attempt to circumvent this check - it's there as a safeguard against accidentally introducing bugs that could affect the data, which even I sometimes rely on myself when working with many katago versions. If the only change you made to the code was to edit "contribute_example.cfg", because that file is part of the repo, you still need to revert that change. But for that specific case you can copy "contribute_example.cfg" to a new file like "contribute.cfg" and put your username and password in the copy and use the copy for the command, leaving the original untouched. |
now, it's working , and distributed training is ongoing: |
what's the web page you sent to me ? |
It's the exact same page as you screenshotted to me. https://katagotraining.org/contributions/ None of the pages are updated exactly in real time. They're all delayed by some minutes, because it takes the server some time to periodically recompute all the stats. |
2023-01-28 09:39:04+0000: Finished game 158 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6921528576-d3107233977/41408EFC2AE46B65.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6921528576-d3107233977/2CEE324989089EF8.npz (29 rows) |
Thanks for the report. I have searched for the error online and I have to wonder whether this is some GPU glitch or something like that. Perhaps the device intermittently fails, or overheats and has to stop, or something like that. Here is a post where people discuss this extensively for a different case - pytorch, reporting a similar pattern - successful running for a long time in pytorch training, then suddenly a failure: pytorch/pytorch#27837 Some people report setting TDR on Windows to be higher helps, but it sounds to me like this is sort of just patching the symptom rather than the cause - presumably the reason that the GPU times out and fails to respond within 2 seconds is still due to some sort of failure to communicate with the GPU because the GPU has some error, or because it overheats and is forced by the hardware to stall, or something like that? I think I've also heard of cases where GPU drew too much power that could not be supplied by the power supply, although I don't know if that would show up as this kind of error. I would check on any stats you have about whether your GPU temperature is okay when you run too long, whether it's drawing too much power for your system to support, whether when KataGo is running it ever gets close to using too much GPU memory, etc. |
after upgrade driver of GPU GH100 to 530.24, and rebuild katago again, report below errror: |
What version of CUDA? And what is the output of "nvcc --version"? Normally, compute_35 should automatically be disabled for nvcc versions that do not support it, but it sounds like your cmake is failing to disable it, causing nvcc to complain. If you upgraded CUDA beyond version 11.8, it's possible that you need to delete the CMakeCache.txt and CMakeFiles, so that cmake can re-detect your CUDA version and specify the right architecture list. |
@century3 Hi did you manage to run with H100? Would appreciate a lot if you could share the Katago's benchmark results with H100:) |
what's the command to show Katago's benchmark results with H100? |
I met another issue when I want to contribute with A2 GPU, below is my config: after git clone stable branch , and then issue : I have copied CUDNN lib and include to /usr/local/cuda/lib64 and /include directories. how to resolve such issue? |
@century3 try |
Configuration:
GPU : Nvidia Hopper 100
Driver: 530.20
CUDA: 11.8
CUDNN: 8.7.0
OS: Ubuntu20.04
cmake: 3.22.5
==============NVSMI LOG==============
Timestamp : Sun Jan 22 12:47:54 2023
Driver Version : 530.20
CUDA Version : 12.1
Attached GPUs : 1
GPU 00000000:41:00.0
Product Name : NVIDIA H100 PCIe
Product Brand : NVIDIA
Product Architecture : Hopper
first git clone under ubuntu 20.04 $HOME folder:
Build distributed katago :
cmake . -DUSE_BACKEND=CUDA -DBUILD_DISTRIBUTED=1
after katago is successfully built , then copy to Katago/ folder.
then execute below command:
./katago contribute -config contribute_example.cfg
return errror log:
2023-01-22 12:44:49+0000: Running with following config:
cudaDeviceToUse = 0
maxSimultaneousGames = 16
password =
serverUrl = https://katagotraining.org/
username =
watchOngoingGameInFile = false
watchOngoingGameInFileName = watchgame.txt
2023-01-22 12:44:49+0000: Distributed Self Play Engine starting...
2023-01-22 12:44:49+0000: Attempting to connect to server
2023-01-22 12:44:49+0000: isSSL: true
2023-01-22 12:44:49+0000: host: katagotraining.org
2023-01-22 12:44:49+0000: port: 443
2023-01-22 12:44:49+0000: baseResourcePath: /
2023-01-22 12:44:49+0000: KataGo v1.12.2
2023-01-22 12:44:49+0000: Git revision: 71acccd-dirty
2023-01-22 12:44:49+0000: Running tiny net to sanity-check that GPU is working
2023-01-22 12:44:49+0000: nnRandSeed0 = 2030077357041009691
2023-01-22 12:44:49+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyModel_C5D4DBB5CE8BE51C.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:49+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:54+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:54+0000: Cuda backend thread 0: Model version 9 useFP16 = true useNHWC = true
2023-01-22 12:44:54+0000: Cuda backend thread 0: Model name: rect15-b2c16-s13679744-d94886722
2023-01-22 12:44:56+0000: nnRandSeed0 = 2125985325012519078
2023-01-22 12:44:56+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_4803ACD83B34793D.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:56+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:56+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: nnRandSeed0 = 14882082112687244351
2023-01-22 12:44:56+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_81976840857A114C.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:56+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:56+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: Tiny net sanity check complete
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: --------
2023-01-22 12:44:56+0000: Type 'pause' and hit enter to pause contribute and CPU and GPU usage.
2023-01-22 12:44:56+0000: Type 'quit' and hit enter to begin shutdown, quitting after all current games are done (may take a long while).
2023-01-22 12:44:56+0000: Type 'forcequit' and hit enter to shutdown and quit more quickly, but lose all unfinished game data.
2023-01-22 12:44:56+0000: --------
2023-01-22 12:44:58+0000: ERROR: task loop loop thread failed: Server returned error 400: This version of KataGo is not enabled for distributed. If this exact version was working previously, then changes in the run require a newer version - please update KataGo to the latest version or release. But if this is already the official newest version of KataGo, or you think that not enabling this version is an oversight, please ask server admins to enable the following version hash: 71acccd-dirty-cuda
---RESPONSE---------------------
The text was updated successfully, but these errors were encountered: