CUDA Toolkit version comparison #160

OmnipotentEntity · 2020-03-10T15:54:17Z

It is possible to compile KataGo against various versions of the CUDA toolkit. I've tested the following combinations of CuDNN and CUDA, and did a short performance test (just the default benchmark really) on each.

System is an otherwise unused (i.e. not connected to a monitor) RTX 2080Ti with two Xeons. FP16 and NHWC are both on. Tcmalloc was turned off for these tests.

CUDA Version (CuDNN version)	threads	visits/s	nnEvals/s	nnBatches/s	avgBatchSize
v9.2 (v7.2.1)	1	55.55	51.03	51.03	1.00
v9.2 (v7.2.1)	2	56.96	51.60	51.59	1.00
v9.2 (v7.2.1)	4	109.71	101.32	50.84	1.99
v9.2 (v7.2.1)	6	163.24	149.89	50.31	2.98
v9.2 (v7.2.1)	8	205.00	188.57	47.63	3.96
v9.2 (v7.2.1)	12	281.65	259.11	44.04	5.88
v9.2 (v7.2.1)	16	273.25	251.76	32.22	7.81
v9.2 (v7.2.1)	24	351.13	325.87	28.29	11.52
v10.0 (v7.4.2)	1	223.19	204.83	204.83	1.00
v10.0 (v7.4.2)	2	247.29	226.27	226.11	1.00
v10.0 (v7.4.2)	4	483.16	444.59	223.05	1.99
v10.0 (v7.4.2)	6	707.00	650.70	218.16	2.98
v10.0 (v7.4.2)	8	925.52	848.44	214.26	3.96
v10.0 (v7.4.2)	12	1336.94	1219.07	206.56	5.90
v10.0 (v7.4.2)	16	1300.17	1192.64	152.19	7.84
v10.0 (v7.4.2)	24	1758.34	1630.36	141.22	11.54
v10.1 (v7.6.3)	1	226.77	209.87	209.87	1.00
v10.1 (v7.6.3)	2	252.57	232.14	232.11	1.00
v10.1 (v7.6.3)	4	509.19	455.03	228.02	2.00
v10.1 (v7.6.3)	6	721.62	664.70	223.21	2.98
v10.1 (v7.6.3)	8	943.24	872.06	219.85	3.97
v10.1 (v7.6.3)	12	1334.88	1239.58	210.52	5.89
v10.1 (v7.6.3)	16	1322.15	1216.54	155.58	7.82
v10.1 (v7.6.3)	24	1790.61	1650.70	142.53	11.58
v10.2 (v7.6.5)	1	226.49	210.04	210.04	1.00
v10.2 (v7.6.5)	2	252.34	232.30	232.27	1.00
v10.2 (v7.6.5)	4	490.86	453.38	227.40	1.99
v10.2 (v7.6.5)	6	717.23	665.29	223.46	2.98
v10.2 (v7.6.5)	8	938.63	869.89	219.83	3.96
v10.2 (v7.6.5)	12	1346.18	1236.79	209.31	5.91
v10.2 (v7.6.5)	16	1303.60	1217.55	155.63	7.82
v10.2 (v7.6.5)	24	1773.30	1655.87	143.07	11.57

Essentially, the performance for all of the v10 tests are essentially the same within statistical noise. v10.1 and v10.2 have consistently very slightly higher nnEvals than v10.0. Avoid using v9.2.

(visits/s seem slightly consistently lower in v10.2, but I'm guessing that has more to do with lucky/unlucky tree reuse than with raw power, considering the nnEvals are essentially the same.)

ez4u-L19 · 2020-03-10T22:46:38Z

I assume this was version 1.3.3, but what size net were you using?

OmnipotentEntity · 2020-03-11T15:16:26Z

20 block.

iopq · 2020-03-11T21:18:47Z

I compiled with TCMalloc and my 2060 can do 1050/s with 16 threads on CUDA 10.0

Can you try a test compiling against TCMalloc?

OmnipotentEntity · 2020-03-12T14:38:48Z

Here's more data based on more threads, more visits (100k), and multiple attempts trials (10 for each setting). I've included the std deviations, and sorted by nnEvals. cuDNN versions are the same from last time.

These are with Tcmalloc enabled.

CUDA Version	Threads	Mean visits/s	Stdev visits/s	Mean nnEvals/s	Stdev nnEvals/s
10.2	96	3169.628	60.0304379831135	2182.312	6.3243968531043
10.1	96	3153.832	71.3864129306909	2182.109	6.10491869087717
10.0	96	3105.804	42.3049824226152	2153.035	4.06093653675555
10.1	80	3114.952	43.0756677384241	2142.063	6.57008041554028
10.2	80	3104.528	52.5061500736395	2140.212	6.07243096113726
10.0	80	3052.876	52.4754550454041	2112.515	4.53932018894653
10.1	64	2984.377	36.6537022686659	2074.087	6.78727248442092
10.2	64	2987.512	52.6254078474562	2073.265	8.2001995233178
10.0	64	2952.624	45.3938449572186	2044.12	4.87940114722654
10.1	48	2825.34	48.8043946108681	1962.179	3.82764055551489
10.2	48	2836.354	43.4470771808134	1960.93	6.48974062142189
10.0	48	2799.278	40.6205891417423	1933.83	4.17562503638009
10.1	36	2650.222	45.495723071271	1838.507	6.10155362728764
10.2	36	2661.018	35.4982105339285	1838.36	6.87156783015671
10.0	36	2622.164	34.6524506749201	1815.047	9.35553674450472

OmnipotentEntity · 2020-03-12T14:47:12Z

For completeness sake, this is my exact command line used to generate the data:

for j in $(seq 1 10); do for i in "cudatoolkit_10_0" "cudatoolkit_10_1" "cudatoolkit_10_2" ; do nix-shell -E 'with import "/home/omnipotententity/work/nixpkgs" { }; runCommand "dummy" { buildInputs = [ (katago.override { cudaSupport = true; cudnn = cudnn_'"$i"'; cudatoolkit = '"$i"'; useTcmalloc = true; }) ]; } ""' --run 'katago benchmark -config ~/work/KataGo/cpp/build/gtp_example.cfg -model ~/work/KataGo/cpp/build/g170-b20c256x2-s668214784-d222255714.txt.gz -v 100000 -t 36,48,64,80,96 &> ~/work/KataGo/test-'"$i"'-'"$j"'.log'; done; done

The nix-shell stuff is specific to my OS/package manager, and you'll have to use a very recent version of the nixpkgs (to get cudnn_cudatoolkit_10_2), and the following PR NixOS/nixpkgs#82082

OmnipotentEntity closed this as completed Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Toolkit version comparison #160

CUDA Toolkit version comparison #160

OmnipotentEntity commented Mar 10, 2020 •

edited

ez4u-L19 commented Mar 10, 2020

OmnipotentEntity commented Mar 11, 2020

iopq commented Mar 11, 2020

OmnipotentEntity commented Mar 12, 2020

OmnipotentEntity commented Mar 12, 2020

CUDA Toolkit version comparison #160

CUDA Toolkit version comparison #160

Comments

OmnipotentEntity commented Mar 10, 2020 • edited

ez4u-L19 commented Mar 10, 2020

OmnipotentEntity commented Mar 11, 2020

iopq commented Mar 11, 2020

OmnipotentEntity commented Mar 12, 2020

OmnipotentEntity commented Mar 12, 2020

OmnipotentEntity commented Mar 10, 2020 •

edited