Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of 0.17 on V100 (Google Compute): 2000 n/s vs 6000 n/s #2335

Open
ghost opened this issue Apr 12, 2019 · 7 comments
Open

Performance of 0.17 on V100 (Google Compute): 2000 n/s vs 6000 n/s #2335

ghost opened this issue Apr 12, 2019 · 7 comments

Comments

@ghost
Copy link

ghost commented Apr 12, 2019

On Facebook [1] someone mentioned getting an average of 6000 n/s on a V100 using network #220.

I set up just such an instance yesterday (6 vCPUs, 1 V100), using Ubuntu 18.04 and CUDA 10, and am "only" getting around 1750 n/s (without "-t"), or at most 2100 n/s (using "-t 16").

Here is the output of "./leelaz -w best-network.gz":

Using OpenCL batch size of 5
Using 10 thread(s).
RNG seed: 11358463697549930105
Leela Zero 0.17 Copyright (C) 2017-2019 Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

BLAS Core: built-in Eigen 3.3.7 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 10.1.133
Platform profile: FULL_PROFILE
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Device ID: 0
Device name: Tesla V100-SXM2-16GB
Device type: GPU
Device vendor: NVIDIA Corporation
Device driver: 418.56
Device speed: 1530 MHz
Device cores: 80 CU
Device score: 1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-SXM2-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.
Tensor Core support: Yes.
OpenCL: using fp16/half or tensor core compute support.
Loaded existing SGEMM tuning.
Wavefront/Warp size: 32
Max workgroup size: 1024
Max workgroup dimensions: 1024 1024 64
Setting max tree size to 3736 MiB and cache size to 415 MiB.

Is there anything I could do to also get 6000 n/s? Do others get that performance as well?

[1] https://www.facebook.com/groups/go.igo.weiqi.baduk/permalink/10157283599366514/

@nerai
Copy link
Contributor

nerai commented Apr 12, 2019

You refer to n/s, which is an unreliable metric. A V100 should get something between 2k and 8k n/s, depending on the board and game.

It is better to also measure evals/s, which directly shows how fast the GPU is (compared to n/s, which describes the combination of CPU and GPU). A V100 with 0.17 is probably around 2k evals/s. As for the maximum possible, I will publish a couple tables about this in a few weeks.

@ozymandias8
Copy link

ozymandias8 commented Apr 12, 2019

My v100 on google cloud is averaging one game per 108 seconds, or 473 ms/move, over the last ~800 games. Not sure what that translates to in n/s but it is quite a bit faster than my local GTX 1060 6GB. THe script I'm using runs two games simultaneously, and the creator of the script claims it is significantly faster than running one game at a time.

@nerai
Copy link
Contributor

nerai commented Apr 13, 2019

@ozymandias8 It translates to 3400 n/s (which is in the expected range)

@zhanzhenzhen
Copy link
Contributor

I get 1600 n/s on V100. Maybe their 6000 n/s is only for the first move, which is 4x faster because of symmetric things? Or maybe it's actually a 4-GPU machine?

@lonemonkeywithwhiteshell

I'm sad that my v100 on google cloud is averaging 2700ms/move over the ~300 games.
I don't know what to do...

@ozymandias8
Copy link

Are you using the script from:

#1905

@lonemonkeywithwhiteshell

I tried master branch script.But error happened.
No glanceslib. Setting up glanceslib and all other leela-zero packages.
root@instance-fstbrnc:~# exit
logout
Hit:1 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic InRelease
Hit:2 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:3 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:4 http://archive.canonical.com/ubuntu bionic InRelease
Hit:5 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:6 http://security.ubuntu.com/ubuntu bionic-security InRelease
Reading package lists... Done
Hit:1 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic InRelease
Hit:2 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:3 http://asia-east1.gce.archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:4 http://archive.canonical.com/ubuntu bionic InRelease
Hit:5 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:6 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Reading package lists... Done
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

I just copied&pasted the script...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants