Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"error":"This version of KataGo is not enabled for distributed for Nvidia GPU Hopper100 #748

Open
century3 opened this issue Jan 22, 2023 · 14 comments

Comments

@century3
Copy link

century3 commented Jan 22, 2023

Configuration:
GPU : Nvidia Hopper 100
Driver: 530.20
CUDA: 11.8
CUDNN: 8.7.0
OS: Ubuntu20.04
cmake: 3.22.5

==============NVSMI LOG==============

Timestamp : Sun Jan 22 12:47:54 2023
Driver Version : 530.20
CUDA Version : 12.1

Attached GPUs : 1
GPU 00000000:41:00.0
Product Name : NVIDIA H100 PCIe
Product Brand : NVIDIA
Product Architecture : Hopper

first git clone under ubuntu 20.04 $HOME folder:

Build distributed katago :
cmake . -DUSE_BACKEND=CUDA -DBUILD_DISTRIBUTED=1

after katago is successfully built , then copy to Katago/ folder.
then execute below command:
./katago contribute -config contribute_example.cfg

return errror log:

2023-01-22 12:44:49+0000: Running with following config:
cudaDeviceToUse = 0
maxSimultaneousGames = 16
password =
serverUrl = https://katagotraining.org/
username =
watchOngoingGameInFile = false
watchOngoingGameInFileName = watchgame.txt

2023-01-22 12:44:49+0000: Distributed Self Play Engine starting...
2023-01-22 12:44:49+0000: Attempting to connect to server
2023-01-22 12:44:49+0000: isSSL: true
2023-01-22 12:44:49+0000: host: katagotraining.org
2023-01-22 12:44:49+0000: port: 443
2023-01-22 12:44:49+0000: baseResourcePath: /
2023-01-22 12:44:49+0000: KataGo v1.12.2
2023-01-22 12:44:49+0000: Git revision: 71acccd-dirty
2023-01-22 12:44:49+0000: Running tiny net to sanity-check that GPU is working
2023-01-22 12:44:49+0000: nnRandSeed0 = 2030077357041009691
2023-01-22 12:44:49+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyModel_C5D4DBB5CE8BE51C.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:49+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:54+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:54+0000: Cuda backend thread 0: Model version 9 useFP16 = true useNHWC = true
2023-01-22 12:44:54+0000: Cuda backend thread 0: Model name: rect15-b2c16-s13679744-d94886722
2023-01-22 12:44:56+0000: nnRandSeed0 = 2125985325012519078
2023-01-22 12:44:56+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_4803ACD83B34793D.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:56+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:56+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: nnRandSeed0 = 14882082112687244351
2023-01-22 12:44:56+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_81976840857A114C.bin.gz useFP16 auto useNHWC auto
2023-01-22 12:44:56+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-22 12:44:56+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-22 12:44:56+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: Tiny net sanity check complete
2023-01-22 12:44:56+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-22 12:44:56+0000: --------
2023-01-22 12:44:56+0000: Type 'pause' and hit enter to pause contribute and CPU and GPU usage.
2023-01-22 12:44:56+0000: Type 'quit' and hit enter to begin shutdown, quitting after all current games are done (may take a long while).
2023-01-22 12:44:56+0000: Type 'forcequit' and hit enter to shutdown and quit more quickly, but lose all unfinished game data.
2023-01-22 12:44:56+0000: --------
2023-01-22 12:44:58+0000: ERROR: task loop loop thread failed: Server returned error 400: This version of KataGo is not enabled for distributed. If this exact version was working previously, then changes in the run require a newer version - please update KataGo to the latest version or release. But if this is already the official newest version of KataGo, or you think that not enabling this version is an oversight, please ask server admins to enable the following version hash: 71acccd-dirty-cuda
---RESPONSE---------------------

@lightvector
Copy link
Owner

Thanks for being interested in contributing! The only versions that should be used to contribute are either the tip of stable branch in this repo, OR the tagged version corresponding to the most recent release, v1.12.2 (although v1.12.3 will be out shortly). Please git checkout one of those before building.

Additionally, the "-dirty" suggests that you are attempting to build while having made local changes to the code to your repo. You need to revert all local changes you might have made to anything, and only build using a completely clean checkout from one of the appropriate versions. And of course, please don't attempt to circumvent this check - it's there as a safeguard against accidentally introducing bugs that could affect the data, which even I sometimes rely on myself when working with many katago versions.

If the only change you made to the code was to edit "contribute_example.cfg", because that file is part of the repo, you still need to revert that change. But for that specific case you can copy "contribute_example.cfg" to a new file like "contribute.cfg" and put your username and password in the copy and use the copy for the command, leaving the original untouched.

@century3
Copy link
Author

now, it's working , and distributed training is ongoing:
2023-01-25 13:45:43+0000: Distributed Self Play Engine starting...
2023-01-25 13:45:43+0000: Attempting to connect to server
2023-01-25 13:45:43+0000: isSSL: true
2023-01-25 13:45:43+0000: host: katagotraining.org
2023-01-25 13:45:43+0000: port: 443
2023-01-25 13:45:43+0000: baseResourcePath: /
2023-01-25 13:45:43+0000: KataGo v1.12.2
2023-01-25 13:45:43+0000: Git revision: 3d8c8b5
2023-01-25 13:45:43+0000: Running tiny net to sanity-check that GPU is working
2023-01-25 13:45:43+0000: nnRandSeed0 = 6029478182799605777
2023-01-25 13:45:43+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyModel_E7AF2945E1130EC6.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:45:43+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:45:47+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:45:47+0000: Cuda backend thread 0: Model version 9 useFP16 = true useNHWC = true
2023-01-25 13:45:47+0000: Cuda backend thread 0: Model name: rect15-b2c16-s13679744-d94886722
2023-01-25 13:45:50+0000: nnRandSeed0 = 9906484665874401229
2023-01-25 13:45:50+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_7352E9E9AEFC7A66.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:45:50+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:45:50+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:45:50+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-25 13:45:50+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-25 13:45:50+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-25 13:45:50+0000: nnRandSeed0 = 2040447718285446411
2023-01-25 13:45:50+0000: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_A41A66ABBE1AF8A4.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:45:50+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:45:50+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:45:50+0000: Cuda backend thread 0: Model version 11 useFP16 = true useNHWC = true
2023-01-25 13:45:50+0000: Cuda backend thread 0: Model name: b1c6nbt
2023-01-25 13:45:50+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-25 13:45:50+0000: Tiny net sanity check complete
2023-01-25 13:45:50+0000: GPU 0 finishing, processed 41 rows 21 batches
2023-01-25 13:45:50+0000: --------
2023-01-25 13:45:50+0000: Type 'pause' and hit enter to pause contribute and CPU and GPU usage.
2023-01-25 13:45:50+0000: Type 'quit' and hit enter to begin shutdown, quitting after all current games are done (may take a long while).
2023-01-25 13:45:50+0000: Type 'forcequit' and hit enter to shutdown and quit more quickly, but lose all unfinished game data.
2023-01-25 13:45:50+0000: --------
2023-01-25 13:45:51+0000: Number of nets loaded: selfplay 0 rating 0
2023-01-25 13:45:51+0000: Found new neural net kata1-b60c320-s6896081152-d3100727418
2023-01-25 13:45:55+0000: nnRandSeed0 = 18183437109568991776
2023-01-25 13:45:55+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b60c320-s6896081152-d3100727418.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:45:55+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:46:00+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:46:00+0000: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2023-01-25 13:46:00+0000: Cuda backend thread 0: Model name: kata1-b60c320-s6896081152-d3100727418
2023-01-25 13:46:02+0000: Loaded latest neural net kata1-b60c320-s6896081152-d3100727418 from: katago_contribute/kata1/models/kata1-b60c320-s6896081152-d3100727418.bin.gz
2023-01-25 13:46:02+0000: nnRandSeed0 = 7938906979326446754
2023-01-25 13:46:02+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b60c320-s6896081152-d3100727418.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:46:02+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:46:07+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:46:07+0000: Cuda backend thread 0: Model version 10 useFP16 = false useNHWC = false
2023-01-25 13:46:07+0000: Cuda backend thread 0: Model name: kata1-b60c320-s6896081152-d3100727418
2023-01-25 13:46:17+0000: Testing loaded net
2023-01-25 13:46:20+0000: Maybe predownloading model...
2023-01-25 13:46:46+0000: Testing loaded net okay
2023-01-25 13:46:46+0000: GPU 0 finishing, processed 1330 rows 998 batches
2023-01-25 13:46:46+0000: Loaded new neural net kata1-b60c320-s6896081152-d3100727418
2023-01-25 13:46:46+0000: Starting game 0 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 1 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 2 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 3 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:46+0000: Starting game 4 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 5 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 6 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:46+0000: Starting game 7 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:48+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:48+0000: Starting game 8 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:48+0000: Starting game 9 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:48+0000: Starting game 10 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:48+0000: Starting game 11 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:49+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:49+0000: Starting game 12 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:49+0000: Starting game 13 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:49+0000: Starting game 14 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:49+0000: Starting game 15 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:51+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:51+0000: Starting game 16 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:51+0000: Starting game 17 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:51+0000: Starting game 18 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:51+0000: Starting game 19 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:52+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:52+0000: Starting game 20 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:52+0000: Starting game 21 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:52+0000: Starting game 22 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:52+0000: Starting game 23 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:53+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:53+0000: Starting game 24 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:53+0000: Starting game 25 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:53+0000: Starting game 26 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:53+0000: Starting game 27 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:55+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:55+0000: Starting game 28 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:55+0000: Starting game 29 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:55+0000: Starting game 30 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:55+0000: Starting game 31 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:56+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:56+0000: Starting game 32 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:56+0000: Starting game 33 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:56+0000: Starting game 34 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:56+0000: Starting game 35 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:58+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:58+0000: Starting game 36 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:58+0000: Starting game 37 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:58+0000: Starting game 38 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:58+0000: Starting game 39 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:59+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:46:59+0000: Starting game 40 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:59+0000: Starting game 41 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:59+0000: Starting game 42 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:46:59+0000: Starting game 43 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:01+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:01+0000: Beginning download of model kata1-b40c256-s10939463424-d2666454237
2023-01-25 13:47:02+0000: Downloaded 40633978 / 173552768 bytes for model: https://media.katagotraining.org/uploaded/networks/models/kata1/kata1-b40c256-s10939463424-d2666454237.bin.gz
2023-01-25 13:47:02+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:02+0000: Starting game 44 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:02+0000: Starting game 45 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:02+0000: Starting game 46 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:02+0000: Starting game 47 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:03+0000: Downloaded 99747450 / 173552768 bytes for model: https://media.katagotraining.org/uploaded/networks/models/kata1/kata1-b40c256-s10939463424-d2666454237.bin.gz
2023-01-25 13:47:03+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:03+0000: Starting game 48 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:03+0000: Starting game 49 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:03+0000: Starting game 50 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:03+0000: Starting game 51 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:06+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:06+0000: Starting game 52 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:06+0000: Starting game 53 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:06+0000: Starting game 54 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:06+0000: Starting game 55 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:06+0000: Done downloading 173552768 bytes for model: https://media.katagotraining.org/uploaded/networks/models/kata1/kata1-b40c256-s10939463424-d2666454237.bin.gz
2023-01-25 13:47:06+0000: Beginning download of model kata1-b40c256-s12519813888-d3102712936
2023-01-25 13:47:07+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:07+0000: Starting game 56 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:07+0000: Starting game 57 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:07+0000: Starting game 58 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:07+0000: Starting game 59 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:09+0000: Number of nets loaded: selfplay 1 rating 0
2023-01-25 13:47:09+0000: Starting game 60 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:09+0000: Starting game 61 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:09+0000: Starting game 62 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:09+0000: Starting game 63 (training) (kata1-b60c320-s6896081152-d3100727418)
2023-01-25 13:47:09+0000: Done downloading 173551417 bytes for model: https://media.katagotraining.org/uploaded/networks/models/kata1/kata1-b40c256-s12519813888-d3102712936.bin.gz
2023-01-25 13:47:09+0000: Found new neural net kata1-b40c256-s10939463424-d2666454237
2023-01-25 13:47:10+0000: nnRandSeed0 = 6134301522316129695
2023-01-25 13:47:10+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b40c256-s10939463424-d2666454237.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:47:10+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:47:13+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:47:13+0000: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2023-01-25 13:47:13+0000: Cuda backend thread 0: Model name: kata1-b40c256-s10939463424-d2666454237
2023-01-25 13:47:13+0000: Loaded latest neural net kata1-b40c256-s10939463424-d2666454237 from: katago_contribute/kata1/models/kata1-b40c256-s10939463424-d2666454237.bin.gz
2023-01-25 13:47:13+0000: nnRandSeed0 = 5017116527305297802
2023-01-25 13:47:13+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b40c256-s10939463424-d2666454237.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:47:13+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:47:16+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:47:16+0000: Cuda backend thread 0: Model version 10 useFP16 = false useNHWC = false
2023-01-25 13:47:16+0000: Cuda backend thread 0: Model name: kata1-b40c256-s10939463424-d2666454237
2023-01-25 13:47:16+0000: Testing loaded net
2023-01-25 13:47:48+0000: Testing loaded net okay
2023-01-25 13:47:48+0000: GPU 0 finishing, processed 1330 rows 1109 batches
2023-01-25 13:47:48+0000: Loaded new neural net kata1-b40c256-s10939463424-d2666454237
2023-01-25 13:47:48+0000: Found new neural net kata1-b40c256-s12519813888-d3102712936
2023-01-25 13:47:50+0000: nnRandSeed0 = 1082910773259850591
2023-01-25 13:47:50+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b40c256-s12519813888-d3102712936.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:47:50+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:47:52+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:47:52+0000: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2023-01-25 13:47:52+0000: Cuda backend thread 0: Model name: kata1-b40c256-s12519813888-d3102712936
2023-01-25 13:47:53+0000: Loaded latest neural net kata1-b40c256-s12519813888-d3102712936 from: katago_contribute/kata1/models/kata1-b40c256-s12519813888-d3102712936.bin.gz
2023-01-25 13:47:53+0000: nnRandSeed0 = 13248493777852009589
2023-01-25 13:47:53+0000: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b40c256-s12519813888-d3102712936.bin.gz useFP16 auto useNHWC auto
2023-01-25 13:47:53+0000: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2023-01-25 13:47:55+0000: Cuda backend thread 0: Found GPU NVIDIA H100 PCIe memory 85028896768 compute capability major 9 minor 0
2023-01-25 13:47:55+0000: Cuda backend thread 0: Model version 10 useFP16 = false useNHWC = false
2023-01-25 13:47:55+0000: Cuda backend thread 0: Model name: kata1-b40c256-s12519813888-d3102712936
2023-01-25 13:47:55+0000: Testing loaded net
2023-01-25 13:48:27+0000: Testing loaded net okay
2023-01-25 13:48:28+0000: GPU 0 finishing, processed 1330 rows 1109 batches
2023-01-25 13:48:28+0000: Loaded new neural net kata1-b40c256-s12519813888-d3102712936
2023-01-25 13:48:28+0000: Performance: in the last 157.6 seconds, played 495 moves (3.1/sec) and 180307 nn evals (1144.052253/sec)
2023-01-25 13:55:52+0000: Finished game 56 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6896081152-d3100727418/0BA81358FBCACF77.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6896081152-d3100727418/C97AD869856FB39A.npz (17 rows)
2023-01-25 13:55:52+0000: Starting game 64 (rating) (kata1-b40c256-s12519813888-d3102712936 vs kata1-b40c256-s10939463424-d2666454237)
2023-01-25 13:55:52+0000: Performance: in the last 444.8 seconds, played 3777 moves (8.5/sec) and 898127 nn evals (2019.274833/sec)
2023-01-25 13:58:04+0000: Finished game 62 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6896081152-d3100727418/86F3B3B8D4849C23.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6896081152-d3100727418/5A5D79A4A6C957D0.npz (25 rows)
2023-01-25 13:58:04+0000: Starting game 65 (rating) (kata1-b40c256-s12519813888-d3102712936 vs kata1-b40c256-s10939463424-d2666454237)
2023-01-25 13:58:04+0000: Performance: in the last 132.1 seconds, played 1134 moves (8.6/sec) and 254645 nn evals (1928.342496/sec)
2023-01-25 13:59:05+0000: Finished game 18 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6896081152-d3100727418/BC7B47F70645A76C.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6896081152-d3100727418/67ED37CB8860D018.npz (10 rows)
2023-01-25 13:59:05+0000: Starting game 66 (rating) (kata1-b40c256-s12519813888-d3102712936 vs kata1-b40c256-s10939463424-d2666454237)
2023-01-25 13:59:05+0000: Number of nets loaded: selfplay 1 rating 2
2023-01-25 14:01:03+0000: Finished game 46 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6896081152-d3100727418/815672636C247AAF.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6896081152-d3100727418/C9A081D4CC98E213.npz (28 rows)
2023-01-25 14:01:03+0000: Starting game 67 (rating) (kata1-b40c256-s10939463424-d2666454237 vs kata1-b40c256-s12519813888-d3102712936)
2023-01-25 14:01:03+0000: Performance: in the last 178.9 seconds, played 1502 moves (8.4/sec) and 336779 nn evals (1882.482887/sec)
2023-01-25 14:01:05+0000: Finished game 44 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6896081152-d3100727418/DE85B761370C0C66.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6896081152-d3100727418/D6242BE8A416AF34.npz (26 rows)

@century3
Copy link
Author

but why I can't see my username in User List of Last 24 Hours: (my username is maverick)
image

@lightvector
Copy link
Owner

What do you mean? It's there now.
image

Ah, right, stats on this page may be delayed by on the order of half an hour due to how the update logic on the server works, you might have checked it before that point. I'll see about adding a note for that.

@century3
Copy link
Author

what's the web page you sent to me ?
This is only web page I can find :https://katagotraining.org/networks/, and on this page , ELO seems not updated in real time.
image

@lightvector
Copy link
Owner

It's the exact same page as you screenshotted to me. https://katagotraining.org/contributions/

None of the pages are updated exactly in real time. They're all delayed by some minutes, because it takes the server some time to periodically recompute all the stats.

@century3
Copy link
Author

2023-01-28 09:39:04+0000: Finished game 158 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b60c320-s6921528576-d3107233977/41408EFC2AE46B65.sgf and training data katago_contribute/kata1/tdata/kata1-b60c320-s6921528576-d3107233977/2CEE324989089EF8.npz (29 rows)
2023-01-28 09:39:04+0000: Starting game 299 (training) (kata1-b60c320-s6921528576-d3107233977)
2023-01-28 09:39:04+0000: Performance: in the last 349.9 seconds, played 1234 moves (3.5/sec) and 306138 nn evals (874.842075/sec)
terminate called after throwing an instance of 'StringError'
terminate called recursively
what(): CUDA Error, for getOutput file /localhome/local-chunmingw/KataGo/cpp/neuralnet/cudabackend.cpp, func cudaMemcpy(inputBuffers->policyResults, buffers->policyBuf, inputBuffers->singlePolicyResultBytes*batchSize, cudaMemcpyDeviceToHost), line 2496, error unspecified launch failure
terminate called recursively
Aborted (core dumped)

@lightvector
Copy link
Owner

Thanks for the report. I have searched for the error online and I have to wonder whether this is some GPU glitch or something like that. Perhaps the device intermittently fails, or overheats and has to stop, or something like that.

Here is a post where people discuss this extensively for a different case - pytorch, reporting a similar pattern - successful running for a long time in pytorch training, then suddenly a failure: pytorch/pytorch#27837

Some people report setting TDR on Windows to be higher helps, but it sounds to me like this is sort of just patching the symptom rather than the cause - presumably the reason that the GPU times out and fails to respond within 2 seconds is still due to some sort of failure to communicate with the GPU because the GPU has some error, or because it overheats and is forced by the hardware to stall, or something like that? I think I've also heard of cases where GPU drew too much power that could not be supplied by the power supply, although I don't know if that would show up as this kind of error.

I would check on any stats you have about whether your GPU temperature is okay when you run too long, whether it's drawing too much power for your system to support, whether when KataGo is running it ever gets close to using too much GPU memory, etc.

@century3
Copy link
Author

after upgrade driver of GPU GH100 to 530.24, and rebuild katago again, report below errror:
local-chunmingw@ipp2-0328:/KataGo/cpp$ make clean
local-chunmingw@ipp2-0328:
/KataGo/cpp$ make
[ 0%] Generating program/gitinfoupdated.h
[ 1%] Building CXX object CMakeFiles/katago.dir/core/global.cpp.o
[ 2%] Building CXX object CMakeFiles/katago.dir/core/base64.cpp.o
[ 3%] Building CXX object CMakeFiles/katago.dir/core/bsearch.cpp.o
[ 4%] Building CXX object CMakeFiles/katago.dir/core/commandloop.cpp.o
[ 5%] Building CXX object CMakeFiles/katago.dir/core/config_parser.cpp.o
[ 6%] Building CXX object CMakeFiles/katago.dir/core/datetime.cpp.o
[ 7%] Building CXX object CMakeFiles/katago.dir/core/elo.cpp.o
[ 7%] Building CXX object CMakeFiles/katago.dir/core/fancymath.cpp.o
[ 8%] Building CXX object CMakeFiles/katago.dir/core/fileutils.cpp.o
[ 9%] Building CXX object CMakeFiles/katago.dir/core/hash.cpp.o
[ 10%] Building CXX object CMakeFiles/katago.dir/core/logger.cpp.o
[ 11%] Building CXX object CMakeFiles/katago.dir/core/mainargs.cpp.o
[ 12%] Building CXX object CMakeFiles/katago.dir/core/makedir.cpp.o
[ 13%] Building CXX object CMakeFiles/katago.dir/core/md5.cpp.o
[ 14%] Building CXX object CMakeFiles/katago.dir/core/multithread.cpp.o
[ 15%] Building CXX object CMakeFiles/katago.dir/core/rand.cpp.o
[ 15%] Building CXX object CMakeFiles/katago.dir/core/rand_helpers.cpp.o
[ 16%] Building CXX object CMakeFiles/katago.dir/core/sha2.cpp.o
[ 17%] Building CXX object CMakeFiles/katago.dir/core/test.cpp.o
[ 18%] Building CXX object CMakeFiles/katago.dir/core/threadsafecounter.cpp.o
[ 19%] Building CXX object CMakeFiles/katago.dir/core/threadsafequeue.cpp.o
[ 20%] Building CXX object CMakeFiles/katago.dir/core/threadtest.cpp.o
[ 21%] Building CXX object CMakeFiles/katago.dir/core/timer.cpp.o
[ 22%] Building CXX object CMakeFiles/katago.dir/game/board.cpp.o
[ 23%] Building CXX object CMakeFiles/katago.dir/game/rules.cpp.o
[ 23%] Building CXX object CMakeFiles/katago.dir/game/boardhistory.cpp.o
[ 24%] Building CXX object CMakeFiles/katago.dir/game/graphhash.cpp.o
[ 25%] Building CXX object CMakeFiles/katago.dir/dataio/sgf.cpp.o
[ 26%] Building CXX object CMakeFiles/katago.dir/dataio/numpywrite.cpp.o
[ 27%] Building CXX object CMakeFiles/katago.dir/dataio/trainingwrite.cpp.o
[ 28%] Building CXX object CMakeFiles/katago.dir/dataio/loadmodel.cpp.o
[ 29%] Building CXX object CMakeFiles/katago.dir/dataio/homedata.cpp.o
[ 30%] Building CXX object CMakeFiles/katago.dir/dataio/files.cpp.o
[ 30%] Building CXX object CMakeFiles/katago.dir/neuralnet/nninputs.cpp.o
[ 31%] Building CXX object CMakeFiles/katago.dir/neuralnet/modelversion.cpp.o
[ 32%] Building CXX object CMakeFiles/katago.dir/neuralnet/nneval.cpp.o
[ 33%] Building CXX object CMakeFiles/katago.dir/neuralnet/desc.cpp.o
[ 34%] Building CXX object CMakeFiles/katago.dir/neuralnet/cudabackend.cpp.o
[ 35%] Building CXX object CMakeFiles/katago.dir/neuralnet/cudautils.cpp.o
[ 36%] Building CUDA object CMakeFiles/katago.dir/neuralnet/cudahelpers.cu.o
nvcc fatal : Unsupported gpu architecture 'compute_35'
make[2]: *** [CMakeFiles/katago.dir/build.make:629: CMakeFiles/katago.dir/neuralnet/cudahelpers.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/katago.dir/all] Error 2

@lightvector
Copy link
Owner

What version of CUDA? And what is the output of "nvcc --version"? Normally, compute_35 should automatically be disabled for nvcc versions that do not support it, but it sounds like your cmake is failing to disable it, causing nvcc to complain. If you upgraded CUDA beyond version 11.8, it's possible that you need to delete the CMakeCache.txt and CMakeFiles, so that cmake can re-detect your CUDA version and specify the right architecture list.

@Vincentwei1021
Copy link

Vincentwei1021 commented Feb 1, 2023

@century3 Hi did you manage to run with H100? Would appreciate a lot if you could share the Katago's benchmark results with H100:)

@century3
Copy link
Author

century3 commented Feb 3, 2023

what's the command to show Katago's benchmark results with H100?

@century3
Copy link
Author

century3 commented Feb 3, 2023

I met another issue when I want to contribute with A2 GPU, below is my config:
Configuration:
GPU : Nvidia A2
Driver: 530.24
CUDA: 12.1
CUDNN: 8.7.0
OS: Ubuntu20.04
cmake: 3.22.5

after git clone stable branch , and then issue :
xxx@xxx:~/KataGo/cpp$ cmake . -DUSE_BACKEND=CUDA -DBUILD_DISTRIBUTED=1
-- Building 'katago' executable for GTP engine and other tools.
-- -DUSE_BACKEND=CUDA, using CUDA backend.
-- The CUDA compiler identification is NVIDIA 12.1.55
CMake Error at /usr/local/share/cmake-3.22/Modules/CMakeDetermineCUDACompiler.cmake:598 (message):
Failed to find a working CUDA architecture.
Call Stack (most recent call first):
CMakeLists.txt:44 (enable_language)

I have copied CUDNN lib and include to /usr/local/cuda/lib64 and /include directories.
nvcc version is:
xxx@xxx:~/KataGo/cpp$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jan_24_19:16:32_PST_2023
Cuda compilation tools, release 12.1, V12.1.55
Build cuda_12.1.r12.1/compiler.32345990_0

how to resolve such issue?

@Vincentwei1021
Copy link

Vincentwei1021 commented Feb 3, 2023

what's the command to show Katago's benchmark results with H100?

@century3 try
<katago-engine path> benchmark -model <model-weight-path> -config <config file path> -visits 100000
the model you may use the 18b or 60b version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants