Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the time cost of opencl version is less than cuda version #48

Closed
l1t1 opened this issue Jul 19, 2019 · 10 comments
Closed

why the time cost of opencl version is less than cuda version #48

l1t1 opened this issue Jul 19, 2019 · 10 comments

Comments

@l1t1
Copy link

l1t1 commented Jul 19, 2019

D:\KataGo-1.1\katago12bc = 1.2beta cuda
D:\KataGo12b\katago =1.2beta opencl

# Black: KataGo
# BlackCommand: D:\KataGo-1.1\katago12bc gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg
# BlackLabel: KataGo:1.2-beta
# BlackVersion: 1.2-beta
# Date: July 19, 2019 1:33:24 PM CST
# Host: PC
# Komi: 7.5
# Referee: -
# Size: 19
# White: KataGo
# WhiteCommand: D:\KataGo12b\katago gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg
# WhiteLabel: KataGo:1.2-beta
# WhiteVersion: 1.2-beta
# Xml: 0
#
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	302	654	522	0	0	0	
1	B+R	B+R	B+R	0	-	247	523.1	438	0	0	0	
2	B+R	B+R	B+R	0	-	175	485.8	390	0	0	0	
3	B+R	B+R	B+R	0	-	281	689.5	557.9	0	0	0	
4	B+R	B+R	B+R	0	-	225	488.8	398.5	0	0	0	
5	B+R	B+R	B+R	0	-	253	586.4	514.7	0	0	0	
6	B+R	B+R	B+R	0	-	215	472.5	395.5	0	0	0	
7	W+R	W+R	W+R	0	-	208	431.5	374.8	0	0	0	
8	W+R	W+R	W+R	0	-	218	524.6	426.9	0	0	0	
9	W+R	W+R	W+R	0	-	156	305.7	258	0	0	0
@l1t1
Copy link
Author

l1t1 commented Jul 19, 2019

D:\tool\go_gui\gogui-twogtp -white "D:\KataGo12b\katago gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg" -black "D:\KataGo-1.1\katago12bc gtp -model d:\model696.txt -config D:\KataGo12b\gtp_example.cfg" -games 20 -sgffile katagocl_cuda -auto -komi 7.5

@lightvector
Copy link
Owner

How many searchThreads are you using for each?

@l1t1
Copy link
Author

l1t1 commented Jul 20, 2019

they shared one config, both 1000 searchThreads

@lightvector
Copy link
Owner

Ummm... that's a pretty absurd number of search threads. I would not expect good play using that many threads, nor do I have any idea how to think about how performance behaves with that many threads....

@l1t1
Copy link
Author

l1t1 commented Jul 20, 2019

my fault, it was 1 thread. max visits was 1000

@lightvector
Copy link
Owner

Well, 1 thread is probably too few. You can get much better performance on both increasing threads (the best number of threads for each one may be different)

It's possible that OpenCL will be faster than CUDA, or it's possible that CUDA will be faster than OpenCL, but 1 thread is too few to find out for sure.

@l1t1
Copy link
Author

l1t1 commented Jul 22, 2019

I am testing the 1.2 final version with searchThreads 3( my cpu has 4 threads)
D:\tool\go_gui\gogui-twogtp -white "D:\KataGo-1.1\katago12_cuda gtp -model d:\model696.txt -config D:\gtp_example.cfg" -black "D:\KataGo-1.1\katago12_opencl gtp -model d:\model696.txt -config D:\gtp_example.cfg" -games 20 -sgffile katago12_cuda_opencl -auto -komi 7.5
I found most of moves shows 1.5 batch size, such as

Root visits: 1002
NN rows: 629
NN batches: 419
NN avg batch size: 1.50119

also find opencl version is faster than cuda
searchThreads 3

#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	230	378.3	402.8	0	0	0
1	W+R	W+R	W+R	0	-	158	255.9	282.7	0	0	0
2	B+R	B+R	B+R	0	-	181	306.5	324.6	0	0	0

searchThreads 4

Root visits: 1003
NN rows: 200
NN batches: 100
NN avg batch size: 2
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	228	381.3	434.7	0	0	0	
1	W+R	W+R	W+R	0	-	178	265.8	282.3	0	0	0
2	B+R	B+R	B+R	0	-	217	319.4	356.1	0	0	0	
3	B+R	B+R	B+R	0	-	207	362.7	422.4	0	0	0

i also find the mem used by cuda(800M) is about twice of opencl(320M) in windows taskmgr
searchThreads 6

Root visits: 1005
NN rows: 815
NN batches: 268
NN avg batch size: 3.04104
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	B+R	B+R	B+R	0	-	265	410.9	442.7	0	0	0

searchThreads 8

Root visits: 1007
NN rows: 915
NN batches: 204
NN avg batch size: 4.48529
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	B+R	B+R	B+R	0	-	345	469.5	534.3	0	0	0	
1	W+R	W+R	W+R	0	-	334	455.2	478.2	0	0	0	
2	W+R	W+R	W+R	0	-	174	281	310.7	0	0	0

the game will slow down if too many Threads

i use a 10x128 model, same result
searchThreads 5

Root visits: 1004
NN rows: 508
NN batches: 203
NN avg batch size: 2.50246
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	298	966.7	1001.2	0	0	0	
1	W+R	W+R	W+R	0	-	174	552.4	554.3	0	0	0	

192x15model , the time is nearly equal
searchThreads 5

Root visits: 1004
NN rows: 498
NN batches: 199
NN avg batch size: 2.50251
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	208	1759.1	1714.1	0	0	0	
1	B+R	B+R	B+R	0	-	171	1601.8	1579	0	0	0	
2	B+R	B+R	B+R	0	-	133	1070.7	1150.9	0	0	0	
3	B+R	B+R	B+R	0	-	167	1507.1	1503.5	0	0	0	
4	B+R	B+R	B+R	0	-	289	2417.3	2449.2	0	0	0	
5	B+R	B+R	B+R	0	-	143	1334.8	1348	0	0	0	
6	W+R	W+R	W+R	0	-	262	2101.5	1997.4	0	0	0	
7	B+R	B+R	B+R	0	-	209	1752.7	1669.9	0	0	0	
8	B+R	B+R	B+R	0	-	173	1411.9	1370.3	0	0	0	
9	W+R	W+R	W+R	0	-	202	1811.8	1754.5	0	0	0	
10	W+R	W+R	W+R	0	-	118	1126.6	1100	0	0	0	
11	B+R	B+R	B+R	0	-	211	1808.9	1799.6	0	0	0	
12	W+R	W+R	W+R	0	-	206	1806.8	1812.2	0	0	0	
13	W+R	W+R	W+R	0	-	198	1898.4	1897.2	0	0	0	
14	B+R	B+R	B+R	0	-	161	1456.9	1485.9	0	0	0	
15	W+R	W+R	W+R	0	-	170	1435.4	1436	0	0	0	
16	W+R	W+R	W+R	0	-	164	1374.9	1369.2	0	0	0	

256x20model , the time is nearly equal
searchThreads 5

Root visits: 1004
NN rows: 736
NN batches: 292
NN avg batch size: 2.52055
#GAME	RES_B	RES_W	RES_R	ALT	DUP	LEN	TIME_B	TIME_W	CPU_B	CPU_W	ERR	ERR_MSG
0	W+R	W+R	W+R	0	-	220	3865.1	3291.6	0	0	0	
1	B+R	B+R	B+R	0	-	293	5301.3	4823.3	0	0	0
2	W+R	W+R	W+R	0	-	174	4224	3461	0	0	0	
3	B+R	B+R	B+R	0	-	245	4427.6	3943.2	0	0	0	
4	B+R	B+R	B+R	0	-	207	3126.4	2611.7	0	0	0	
5	B+R	B+R	B+R	0	-	157	2905.8	2577.5	0	0	0	
6	B+R	B+R	B+R	0	-	223	3983.2	3412.9	0	0	0	
7	B+R	B+R	B+R	0	-	203	3735.4	3407.1	0	0	0	
8	W+R	W+R	W+R	0	-	248	4551.7	3983	0	0	0	
9	W+R	W+R	W+R	0	-	266	5371.2	4635	0	0	0	
10	B+R	B+R	B+R	0	-	273	4570.7	4022.7	0	0	0

@l1t1
Copy link
Author

l1t1 commented Jul 22, 2019

it seems no speedup when Threads >2, maybe it limited by my cpu?

@l1t1
Copy link
Author

l1t1 commented Jul 23, 2019

cuda version is faster with 20x256 model

@lightvector
Copy link
Owner

Going ahead and closing this due to lack of new activity, feel free to comment back if you think there's anything to add or reopen. I don't think its too surprising that OpenCL can be faster than CUDA. I might spend some time optimizing one or both a little more before next release. Thanks for posting your stats and results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants