Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Toolkit version comparison #160

Closed
OmnipotentEntity opened this issue Mar 10, 2020 · 5 comments
Closed

CUDA Toolkit version comparison #160

OmnipotentEntity opened this issue Mar 10, 2020 · 5 comments

Comments

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Mar 10, 2020

It is possible to compile KataGo against various versions of the CUDA toolkit. I've tested the following combinations of CuDNN and CUDA, and did a short performance test (just the default benchmark really) on each.

System is an otherwise unused (i.e. not connected to a monitor) RTX 2080Ti with two Xeons. FP16 and NHWC are both on. Tcmalloc was turned off for these tests.

CUDA Version (CuDNN version) threads visits/s nnEvals/s nnBatches/s avgBatchSize
v9.2 (v7.2.1) 1 55.55 51.03 51.03 1.00
v9.2 (v7.2.1) 2 56.96 51.60 51.59 1.00
v9.2 (v7.2.1) 4 109.71 101.32 50.84 1.99
v9.2 (v7.2.1) 6 163.24 149.89 50.31 2.98
v9.2 (v7.2.1) 8 205.00 188.57 47.63 3.96
v9.2 (v7.2.1) 12 281.65 259.11 44.04 5.88
v9.2 (v7.2.1) 16 273.25 251.76 32.22 7.81
v9.2 (v7.2.1) 24 351.13 325.87 28.29 11.52
v10.0 (v7.4.2) 1 223.19 204.83 204.83 1.00
v10.0 (v7.4.2) 2 247.29 226.27 226.11 1.00
v10.0 (v7.4.2) 4 483.16 444.59 223.05 1.99
v10.0 (v7.4.2) 6 707.00 650.70 218.16 2.98
v10.0 (v7.4.2) 8 925.52 848.44 214.26 3.96
v10.0 (v7.4.2) 12 1336.94 1219.07 206.56 5.90
v10.0 (v7.4.2) 16 1300.17 1192.64 152.19 7.84
v10.0 (v7.4.2) 24 1758.34 1630.36 141.22 11.54
v10.1 (v7.6.3) 1 226.77 209.87 209.87 1.00
v10.1 (v7.6.3) 2 252.57 232.14 232.11 1.00
v10.1 (v7.6.3) 4 509.19 455.03 228.02 2.00
v10.1 (v7.6.3) 6 721.62 664.70 223.21 2.98
v10.1 (v7.6.3) 8 943.24 872.06 219.85 3.97
v10.1 (v7.6.3) 12 1334.88 1239.58 210.52 5.89
v10.1 (v7.6.3) 16 1322.15 1216.54 155.58 7.82
v10.1 (v7.6.3) 24 1790.61 1650.70 142.53 11.58
v10.2 (v7.6.5) 1 226.49 210.04 210.04 1.00
v10.2 (v7.6.5) 2 252.34 232.30 232.27 1.00
v10.2 (v7.6.5) 4 490.86 453.38 227.40 1.99
v10.2 (v7.6.5) 6 717.23 665.29 223.46 2.98
v10.2 (v7.6.5) 8 938.63 869.89 219.83 3.96
v10.2 (v7.6.5) 12 1346.18 1236.79 209.31 5.91
v10.2 (v7.6.5) 16 1303.60 1217.55 155.63 7.82
v10.2 (v7.6.5) 24 1773.30 1655.87 143.07 11.57

Essentially, the performance for all of the v10 tests are essentially the same within statistical noise. v10.1 and v10.2 have consistently very slightly higher nnEvals than v10.0. Avoid using v9.2.

(visits/s seem slightly consistently lower in v10.2, but I'm guessing that has more to do with lucky/unlucky tree reuse than with raw power, considering the nnEvals are essentially the same.)

@ez4u-L19
Copy link

I assume this was version 1.3.3, but what size net were you using?

@OmnipotentEntity
Copy link
Contributor Author

20 block.

@iopq
Copy link
Contributor

iopq commented Mar 11, 2020

I compiled with TCMalloc and my 2060 can do 1050/s with 16 threads on CUDA 10.0

Can you try a test compiling against TCMalloc?

@OmnipotentEntity
Copy link
Contributor Author

Here's more data based on more threads, more visits (100k), and multiple attempts trials (10 for each setting). I've included the std deviations, and sorted by nnEvals. cuDNN versions are the same from last time.

These are with Tcmalloc enabled.

CUDA Version Threads Mean visits/s Stdev visits/s Mean nnEvals/s Stdev nnEvals/s
10.2 96 3169.628 60.0304379831135 2182.312 6.3243968531043
10.1 96 3153.832 71.3864129306909 2182.109 6.10491869087717
10.0 96 3105.804 42.3049824226152 2153.035 4.06093653675555
10.1 80 3114.952 43.0756677384241 2142.063 6.57008041554028
10.2 80 3104.528 52.5061500736395 2140.212 6.07243096113726
10.0 80 3052.876 52.4754550454041 2112.515 4.53932018894653
10.1 64 2984.377 36.6537022686659 2074.087 6.78727248442092
10.2 64 2987.512 52.6254078474562 2073.265 8.2001995233178
10.0 64 2952.624 45.3938449572186 2044.12 4.87940114722654
10.1 48 2825.34 48.8043946108681 1962.179 3.82764055551489
10.2 48 2836.354 43.4470771808134 1960.93 6.48974062142189
10.0 48 2799.278 40.6205891417423 1933.83 4.17562503638009
10.1 36 2650.222 45.495723071271 1838.507 6.10155362728764
10.2 36 2661.018 35.4982105339285 1838.36 6.87156783015671
10.0 36 2622.164 34.6524506749201 1815.047 9.35553674450472

@OmnipotentEntity
Copy link
Contributor Author

For completeness sake, this is my exact command line used to generate the data:

for j in $(seq 1 10); do for i in "cudatoolkit_10_0" "cudatoolkit_10_1" "cudatoolkit_10_2" ; do nix-shell -E 'with import "/home/omnipotententity/work/nixpkgs" { }; runCommand "dummy" { buildInputs = [ (katago.override { cudaSupport = true; cudnn = cudnn_'"$i"'; cudatoolkit = '"$i"'; useTcmalloc = true; }) ]; } ""' --run 'katago benchmark -config ~/work/KataGo/cpp/build/gtp_example.cfg -model ~/work/KataGo/cpp/build/g170-b20c256x2-s668214784-d222255714.txt.gz -v 100000 -t 36,48,64,80,96 &> ~/work/KataGo/test-'"$i"'-'"$j"'.log'; done; done

The nix-shell stuff is specific to my OS/package manager, and you'll have to use a very recent version of the nixpkgs (to get cudnn_cudatoolkit_10_2), and the following PR NixOS/nixpkgs#82082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants