why the time cost of opencl version is less than cuda version #48

l1t1 · 2019-07-19T05:56:31Z

D:\KataGo-1.1\katago12bc = 1.2beta cuda
D:\KataGo12b\katago =1.2beta opencl

# Black: KataGo
# BlackCommand: D:\KataGo-1.1\katago12bc gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg
# BlackLabel: KataGo:1.2-beta
# BlackVersion: 1.2-beta
# Date: July 19, 2019 1:33:24 PM CST
# Host: PC
# Komi: 7.5
# Referee: -
# Size: 19
# White: KataGo
# WhiteCommand: D:\KataGo12b\katago gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg
# WhiteLabel: KataGo:1.2-beta
# WhiteVersion: 1.2-beta
# Xml: 0
#
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	302	654	522	0	0	0	
1	B+R	B+R	B+R	0	-	247	523.1	438	0	0	0	
2	B+R	B+R	B+R	0	-	175	485.8	390	0	0	0	
3	B+R	B+R	B+R	0	-	281	689.5	557.9	0	0	0	
4	B+R	B+R	B+R	0	-	225	488.8	398.5	0	0	0	
5	B+R	B+R	B+R	0	-	253	586.4	514.7	0	0	0	
6	B+R	B+R	B+R	0	-	215	472.5	395.5	0	0	0	
7	W+R	W+R	W+R	0	-	208	431.5	374.8	0	0	0	
8	W+R	W+R	W+R	0	-	218	524.6	426.9	0	0	0	
9	W+R	W+R	W+R	0	-	156	305.7	258	0	0	0

The text was updated successfully, but these errors were encountered:

l1t1 · 2019-07-19T06:24:34Z

D:\tool\go_gui\gogui-twogtp -white "D:\KataGo12b\katago gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg" -black "D:\KataGo-1.1\katago12bc gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg" -games 20 -sgffile katagocl_cuda -auto -komi 7.5

lightvector · 2019-07-20T19:07:37Z

How many searchThreads are you using for each?

l1t1 · 2019-07-20T19:09:45Z

they shared one config, both 1000 searchThreads

lightvector · 2019-07-20T19:12:59Z

Ummm... that's a pretty absurd number of search threads. I would not expect good play using that many threads, nor do I have any idea how to think about how performance behaves with that many threads....

l1t1 · 2019-07-20T19:16:06Z

my fault, it was 1 thread. max visits was 1000

lightvector · 2019-07-20T19:35:43Z

Well, 1 thread is probably too few. You can get much better performance on both increasing threads (the best number of threads for each one may be different)

It's possible that OpenCL will be faster than CUDA, or it's possible that CUDA will be faster than OpenCL, but 1 thread is too few to find out for sure.

l1t1 · 2019-07-22T01:16:06Z

I am testing the 1.2 final version with searchThreads 3( my cpu has 4 threads)
D:\tool\go_gui\gogui-twogtp -white "D:\KataGo-1.1\katago12_cuda gtp -model d:\model696.txt -config D:\gtp_example.cfg" -black "D:\KataGo-1.1\katago12_opencl gtp -model d:\model696.txt -config D:\gtp_example.cfg" -games 20 -sgffile katago12_cuda_opencl -auto -komi 7.5
I found most of moves shows 1.5 batch size, such as

Root visits: 1002
NN rows: 629
NN batches: 419
NN avg batch size: 1.50119

also find opencl version is faster than cuda
searchThreads 3

#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	230	378.3	402.8	0	0	0
1	W+R	W+R	W+R	0	-	158	255.9	282.7	0	0	0
2	B+R	B+R	B+R	0	-	181	306.5	324.6	0	0	0

searchThreads 4

Root visits: 1003
NN rows: 200
NN batches: 100
NN avg batch size: 2
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	228	381.3	434.7	0	0	0	
1	W+R	W+R	W+R	0	-	178	265.8	282.3	0	0	0
2	B+R	B+R	B+R	0	-	217	319.4	356.1	0	0	0	
3	B+R	B+R	B+R	0	-	207	362.7	422.4	0	0	0

i also find the mem used by cuda(800M) is about twice of opencl(320M) in windows taskmgr
searchThreads 6

Root visits: 1005
NN rows: 815
NN batches: 268
NN avg batch size: 3.04104
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	B+R	B+R	B+R	0	-	265	410.9	442.7	0	0	0

searchThreads 8

Root visits: 1007
NN rows: 915
NN batches: 204
NN avg batch size: 4.48529
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	B+R	B+R	B+R	0	-	345	469.5	534.3	0	0	0	
1	W+R	W+R	W+R	0	-	334	455.2	478.2	0	0	0	
2	W+R	W+R	W+R	0	-	174	281	310.7	0	0	0

the game will slow down if too many Threads

i use a 10x128 model, same result
searchThreads 5

Root visits: 1004
NN rows: 508
NN batches: 203
NN avg batch size: 2.50246
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	298	966.7	1001.2	0	0	0	
1	W+R	W+R	W+R	0	-	174	552.4	554.3	0	0	0

192x15model , the time is nearly equal
searchThreads 5

Root visits: 1004
NN rows: 498
NN batches: 199
NN avg batch size: 2.50251
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	208	1759.1	1714.1	0	0	0	
1	B+R	B+R	B+R	0	-	171	1601.8	1579	0	0	0	
2	B+R	B+R	B+R	0	-	133	1070.7	1150.9	0	0	0	
3	B+R	B+R	B+R	0	-	167	1507.1	1503.5	0	0	0	
4	B+R	B+R	B+R	0	-	289	2417.3	2449.2	0	0	0	
5	B+R	B+R	B+R	0	-	143	1334.8	1348	0	0	0	
6	W+R	W+R	W+R	0	-	262	2101.5	1997.4	0	0	0	
7	B+R	B+R	B+R	0	-	209	1752.7	1669.9	0	0	0	
8	B+R	B+R	B+R	0	-	173	1411.9	1370.3	0	0	0	
9	W+R	W+R	W+R	0	-	202	1811.8	1754.5	0	0	0	
10	W+R	W+R	W+R	0	-	118	1126.6	1100	0	0	0	
11	B+R	B+R	B+R	0	-	211	1808.9	1799.6	0	0	0	
12	W+R	W+R	W+R	0	-	206	1806.8	1812.2	0	0	0	
13	W+R	W+R	W+R	0	-	198	1898.4	1897.2	0	0	0	
14	B+R	B+R	B+R	0	-	161	1456.9	1485.9	0	0	0	
15	W+R	W+R	W+R	0	-	170	1435.4	1436	0	0	0	
16	W+R	W+R	W+R	0	-	164	1374.9	1369.2	0	0	0

256x20model , the time is nearly equal
searchThreads 5

Root visits: 1004
NN rows: 736
NN batches: 292
NN avg batch size: 2.52055
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	220	3865.1	3291.6	0	0	0	
1	B+R	B+R	B+R	0	-	293	5301.3	4823.3	0	0	0
2	W+R	W+R	W+R	0	-	174	4224	3461	0	0	0	
3	B+R	B+R	B+R	0	-	245	4427.6	3943.2	0	0	0	
4	B+R	B+R	B+R	0	-	207	3126.4	2611.7	0	0	0	
5	B+R	B+R	B+R	0	-	157	2905.8	2577.5	0	0	0	
6	B+R	B+R	B+R	0	-	223	3983.2	3412.9	0	0	0	
7	B+R	B+R	B+R	0	-	203	3735.4	3407.1	0	0	0	
8	W+R	W+R	W+R	0	-	248	4551.7	3983	0	0	0	
9	W+R	W+R	W+R	0	-	266	5371.2	4635	0	0	0	
10	B+R	B+R	B+R	0	-	273	4570.7	4022.7	0	0	0

l1t1 · 2019-07-22T03:05:36Z

it seems no speedup when Threads >2, maybe it limited by my cpu?

l1t1 · 2019-07-23T05:04:36Z

cuda version is faster with 20x256 model

lightvector · 2019-12-17T04:37:12Z

Going ahead and closing this due to lack of new activity, feel free to comment back if you think there's anything to add or reopen. I don't think its too surprising that OpenCL can be faster than CUDA. I might spend some time optimizing one or both a little more before next release. Thanks for posting your stats and results!

l1t1 mentioned this issue Jul 25, 2019

suggest to improve the speed of katago #56

Closed

lightvector closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the time cost of opencl version is less than cuda version #48

why the time cost of opencl version is less than cuda version #48

l1t1 commented Jul 19, 2019 •

edited

Loading

l1t1 commented Jul 19, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 20, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 20, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 22, 2019 •

edited

Loading

l1t1 commented Jul 22, 2019

l1t1 commented Jul 23, 2019

lightvector commented Dec 17, 2019

why the time cost of opencl version is less than cuda version #48

why the time cost of opencl version is less than cuda version #48

Comments

l1t1 commented Jul 19, 2019 • edited Loading

l1t1 commented Jul 19, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 20, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 20, 2019

lightvector commented Jul 20, 2019

l1t1 commented Jul 22, 2019 • edited Loading

l1t1 commented Jul 22, 2019

l1t1 commented Jul 23, 2019

lightvector commented Dec 17, 2019

l1t1 commented Jul 19, 2019 •

edited

Loading

l1t1 commented Jul 22, 2019 •

edited

Loading