-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of 0.17 on V100 (Google Compute): 2000 n/s vs 6000 n/s #2335
Comments
You refer to n/s, which is an unreliable metric. A V100 should get something between 2k and 8k n/s, depending on the board and game. It is better to also measure evals/s, which directly shows how fast the GPU is (compared to n/s, which describes the combination of CPU and GPU). A V100 with 0.17 is probably around 2k evals/s. As for the maximum possible, I will publish a couple tables about this in a few weeks. |
My v100 on google cloud is averaging one game per 108 seconds, or 473 ms/move, over the last ~800 games. Not sure what that translates to in n/s but it is quite a bit faster than my local GTX 1060 6GB. THe script I'm using runs two games simultaneously, and the creator of the script claims it is significantly faster than running one game at a time. |
@ozymandias8 It translates to 3400 n/s (which is in the expected range) |
I get 1600 n/s on V100. Maybe their 6000 n/s is only for the first move, which is 4x faster because of symmetric things? Or maybe it's actually a 4-GPU machine? |
I'm sad that my v100 on google cloud is averaging 2700ms/move over the ~300 games. |
Are you using the script from: |
I tried master branch script.But error happened. I just copied&pasted the script... |
On Facebook [1] someone mentioned getting an average of 6000 n/s on a V100 using network #220.
I set up just such an instance yesterday (6 vCPUs, 1 V100), using Ubuntu 18.04 and CUDA 10, and am "only" getting around 1750 n/s (without "-t"), or at most 2100 n/s (using "-t 16").
Here is the output of "./leelaz -w best-network.gz":
Using OpenCL batch size of 5
Using 10 thread(s).
RNG seed: 11358463697549930105
Leela Zero 0.17 Copyright (C) 2017-2019 Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.
BLAS Core: built-in Eigen 3.3.7 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 10.1.133
Platform profile: FULL_PROFILE
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Device ID: 0
Device name: Tesla V100-SXM2-16GB
Device type: GPU
Device vendor: NVIDIA Corporation
Device driver: 418.56
Device speed: 1530 MHz
Device cores: 80 CU
Device score: 1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-SXM2-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.
Tensor Core support: Yes.
OpenCL: using fp16/half or tensor core compute support.
Loaded existing SGEMM tuning.
Wavefront/Warp size: 32
Max workgroup size: 1024
Max workgroup dimensions: 1024 1024 64
Setting max tree size to 3736 MiB and cache size to 415 MiB.
Is there anything I could do to also get 6000 n/s? Do others get that performance as well?
[1] https://www.facebook.com/groups/go.igo.weiqi.baduk/permalink/10157283599366514/
The text was updated successfully, but these errors were encountered: