Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Leela(OpenCL Accleration) on Android #1679

Open
uestccokey opened this issue Aug 1, 2018 · 80 comments

Comments

@uestccokey
Copy link

commented Aug 1, 2018

I compiled leela that support opencl acceleration and it can runs on Android.

My test phone is XiaoMi 4:
CPU Snapdragon801(ARMv7a)
GPU Adreno330
Android 6.0.1

But when leela starts, I met an error after the tuner.
As follows:

1|root@cancro:/data/local # ./leela -g -w 57.gz
./leela -g -w 57.gz
Using 2 thread(s).
RNG seed: 14892193533682884522
Detecting residual layers...v1...64 channels...5 blocks.
Initializing OpenCL (autodetect precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 QUALCOMM build: commit #ec93b97 changeid #I48a9d37399 Date: 11/17/16 Thu Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.BF.1.1.3_RB1.06.00.01.181.013
Platform profile: EMBEDDED_PROFILE
Platform name:    QUALCOMM Snapdragon(TM)
Platform vendor:  QUALCOMM
Device ID:     0
Device name:   QUALCOMM Adreno(TM)
Device type:   GPU
Device vendor: QUALCOMM
Device driver: OpenCL 1.2 QUALCOMM build: commit #ec93b97 changeid #I48a9d37399 Date: 11/17/16 Thu Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.BF.1.1.3_RB1.06.00.01.181.013 Compiler E031.29.00.00
Device speed:  1 MHz
Device cores:  4 CU
Device score:  112
Selected platform: QUALCOMM Snapdragon(TM)
Selected device: QUALCOMM Adreno(TM)
with OpenCL 1.2 capability.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.
(1/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 3.9311 ms (1.9 GFLOPS)
(2/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 2.3685 ms (3.1 GFLOPS)
(3/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.8636 ms (4.0 GFLOPS)
(5/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.6544 ms (4.5 GFLOPS)
(6/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.6239 ms (4.5 GFLOPS)
(14/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.3377 ms (5.5 GFLOPS)
(66/290) KWG=16 KWI=8 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.3036 ms (5.7 GFLOPS)
(131/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 1.0925 ms (6.7 GFLOPS)
Wavefront/Warp size: 128
Max workgroup size: 512
Max workgroup dimensions: 512 512 512 
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 QUALCOMM build: commit #ec93b97 changeid #I48a9d37399 Date: 11/17/16 Thu Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.BF.1.1.3_RB1.06.00.01.181.013
Platform profile: EMBEDDED_PROFILE
Platform name:    QUALCOMM Snapdragon(TM)
Platform vendor:  QUALCOMM
Device ID:     0
Device name:   QUALCOMM Adreno(TM)
Device type:   GPU
Device vendor: QUALCOMM
Device driver: OpenCL 1.2 QUALCOMM build: commit #ec93b97 changeid #I48a9d37399 Date: 11/17/16 Thu Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.BF.1.1.3_RB1.06.00.01.181.013 Compiler E031.29.00.00
Device speed:  1 MHz
Device cores:  4 CU
Device score:  112
Selected platform: QUALCOMM Snapdragon(TM)
Selected device: QUALCOMM Adreno(TM)
with OpenCL 1.2 capability.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.
(1/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 3.5789 ms (2.1 GFLOPS)
(2/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 2.6663 ms (2.8 GFLOPS)
(3/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 2.2868 ms (3.2 GFLOPS)
(4/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 2.2634 ms (3.3 GFLOPS)
(5/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.7092 ms (4.3 GFLOPS)
(66/290) KWG=16 KWI=8 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 1.5332 ms (4.8 GFLOPS)
(125/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 1.3404 ms (5.5 GFLOPS)
(164/290) KWG=32 KWI=8 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 1.1030 ms (6.7 GFLOPS)
Wavefront/Warp size: 128
Max workgroup size: 512
Max workgroup dimensions: 512 512 512 
BLAS Core: ARMV7
Error in convolve3: clEnqueueNDRangeKernel: -54
Error in convolve3: clEnqueueNDRangeKernel: -54
terminating with uncaught exception of type cl::Error: clEnqueueNDRangeKernel
/buildbot/src/android/ndk-release-r17/external/libcxx/../../external/libcxxabi/src/abort_message.cpp:73: abort_message: assertion "terminating with uncaught exception of type cl::Error: clEnqueueNDRangeKernel" failed
[1] + Stopped (signal)     ./leela -g -w 57.gz 
root@cancro:/data/local # 

Can anyone help me?

It's my compiled leela
https://drive.google.com/open?id=1sqyenhTNfv-Eshb0yD4LbV0b1ORa8oTO

and It's my app AQGO(with leela not acceleration)
https://drive.google.com/open?id=1mmFpi_ia8SiWoDqINCQfREuw3vXJPB_p
device-2018-08-01-232346

@gcp

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

OpenCL error -54 is:

#define CL_INVALID_WORK_GROUP_SIZE -54

@gcp

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

If you can put try catch blocks around each call to queue.enqueueNDRangeKernel in convolve3 that might help narrowing down the exact kernel invocation and parameters that are problematic.

@Ttl

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

I suspect that the issue is with out_transform_fused_bn_in kernel trying to use too large local size that can't fit on the device. Wavefront size in the logs is 128 so it's trying to use (2, 128, 1) local size. The resource use of the kernel can heavily limit the local size on this GPU (https://developer.qualcomm.com/forum/qdn-forums/software/adreno-gpu-sdk/33328).

If the problem is with that kernel these changes should fix it:

diff --git a/src/OpenCL.cpp b/src/OpenCL.cpp
index f22278a..0b5d478 100644
--- a/src/OpenCL.cpp
+++ b/src/OpenCL.cpp
@@ -207,9 +207,6 @@ void OpenCL_Network<net_t>::forward(const std::vector<float>& input,
             auto conv_weights = begin(layer.weights);
             auto bn_weights = begin(layer.weights) + 1;
             auto skip_next_in_trans = false;
-            if (niter->is_residual_block) {
-                skip_next_in_trans = true;
-            }
 
             convolve3(opencl_context,
                      layer.channels,
@@ -242,13 +239,10 @@ void OpenCL_Network<net_t>::forward(const std::vector<float>& input,
                       conv1_weights,
                       nullptr,
                       bn1_weights,
-                      skip_in_trans, true, false,
+                      skip_in_trans, false, false,
                       batch_size);
 
             auto skip_next_in_trans = false;
-            if (niter->is_residual_block) {
-                skip_next_in_trans = true;
-            }
             convolve3(opencl_context,
                       layer.channels,
                       layer.outputs,
@@ -259,7 +253,7 @@ void OpenCL_Network<net_t>::forward(const std::vector<float>& input,
                       conv2_weights,
                       &inBuffer,
                       bn2_weights,
-                      true, skip_next_in_trans, true,
+                      false, skip_next_in_trans, true,
                       batch_size);
             skip_in_trans = skip_next_in_trans;
         } else {
@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 1, 2018

thanks, I will test tomorrow @Ttl @gcp

@gcp

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

So the underlying issue is that it is running out of register/local RAM to store the working area of even a single local group? Probably no easy way to detect this then. I noticed we don't check m_max_workgroup_size when computing the kernel sizes, but 2 * 128 < 512 anyway.

@l1t1

This comment has been minimized.

Copy link

commented Aug 1, 2018

Is AQGO your works, that is nice.

@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 4, 2018

I've tried it, but still not working. The same error..."Error in convolve3: clEnqueueNDRangeKernel: -54"

@22nsuk

This comment has been minimized.

Copy link

commented Aug 8, 2018

Sorry for another question. Can you also upload the latest English version?

@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 8, 2018

That is a multi-language supported version. @22nsuk .You should switch the phone's language to English.

@22nsuk

This comment has been minimized.

Copy link

commented Aug 8, 2018

@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 15, 2018

AQGO has been updated!
1.2.1.7

@l1t1

This comment has been minimized.

Copy link

commented Aug 15, 2018

expect the new version of AQ ymgaq/AQ#103

@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 20, 2018

AQGO has been updated!

  1. Fix many bugs
  2. Support more randomly the first x moves
  3. The maximum playout limit now is 1~3200

1.2.1.9

@uestccokey

This comment has been minimized.

Copy link
Author

commented Aug 31, 2018

AQGO has been updated!

  1. Support Leela analysis
  2. Fix many bugs

1.2.2.1

@uestccokey

This comment has been minimized.

Copy link
Author

commented Sep 8, 2018

AQGO has been updated!

1.Support leela tewari
2.Update leela engine
3.Bug fix

1.2.2.2

@uestccokey

This comment has been minimized.

Copy link
Author

commented Sep 16, 2018

AQGO has been updated!

  1. Leela analysis supports variation diagram
  2. Leela watch can see black and white logs
  3. Leela engine update to latest

1.2.2.4

@l1t1

This comment has been minimized.

Copy link

commented Sep 16, 2018

is the latest next stable?

what is the use of okhttp3 directory in apk?

@uestccokey

This comment has been minimized.

Copy link
Author

commented Sep 16, 2018

Yes.
Okhttp3 is a framework to request network. It should be the folder that SDK created by itself.

@uestccokey

This comment has been minimized.

Copy link
Author

commented Sep 25, 2018

AQGO 1.2.2.7 has been updated!

1.Fix many bugs
2.Support more board and stone theme
3.Support take stone sound
4.Leela engine update to 0918

https://www.reddit.com/r/baduk/comments/9is083/aq_go_v1227_version_has_been_released_9_super/

@uestccokey

This comment has been minimized.

Copy link
Author

commented Nov 21, 2018

Ah Q GO 2.0.0 has been released!

A huge update that contains tons of new features.
Strongly recommended!

http://aqgo.ezandroid.cn/AQGO_2.0.0_Github_2018-11-20.apk

@y-ich

This comment has been minimized.

Copy link

commented Dec 11, 2018

@uestccokey san,

When I released Go AI iOS app recently, several people in Japan asked me about Android version.
And I told them about your app. They were glad.
Why don't you release your app on Google Play? Your app will reach more people.

And I have a question for you.
Qualcomm made snapdragon 855 public.
How many playouts per sec. do you expect for your app on this processor using 20b-224c weight(ELF)?

Thanks and cheers.

@uestccokey

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

@y-ich

Thank you for your recommendation!

Maybe next spring I will put it on Google Play. Before that I want the program is as stable and powerful as possible, and there is still a lot of work to do.

On my Samsung S9 (Qualcomm 845), use elf about 7 playouts per second.
On Qualcomm 855, I hope it can have 10 playouts per second.

@y-ich

This comment has been minimized.

Copy link

commented Dec 11, 2018

@uestccokey san,

I will tell people your plan.

About Qualcomm 855.
My app achieved 170 playouts per sec. on A12X Bionic.
And Qualcomm 855 should be more.
The performance of Neural Engine on A12X Bionic is 5 TOPS but the one of AI Engine on Qualcomm 855 is 7 TOPS.
I recommend you to consider TensorFlow Lite instead of OpenCL.
It will use Neural Networks API and will automatically enable The AI engine.
I guess that your app will achieve more then 200 playouts per sec.

Cheers.

@uestccokey

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

@y-ich
170 playouts! Amazing! How fast is it on other CPUs?

Thanks for your suggestion!
In fact, the current leela engine in Ah Q is a pure CPU version, without hardware acceleration.
I also learned about TensorFlow Lite before, but I don't know how to use it in leela.

For me, maybe the speed is not the most important, because it is already difficult enough for amateur players.

@uestccokey

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

@y-ich
I just installed your app on iphone6s.
Some Chinese translations are not appropriate. I can provide some help if you need it.😆
In addition, Ah Q is also going to provide a Japanese version. Can you help me?

@y-ich

This comment has been minimized.

Copy link

commented Dec 11, 2018

@uestccokey san, thank you for purchasing my app!

Some user kindly provided speed information.
iPhone 5s, 1.1 N/s
iPhone 6 Plus, 1.4 N/s
iPhone X, 13-15 N/s
iPhone Xs Max, 130-160 N/s
iPad Air 2, 2.6 N/s

And I agree with you about the speed. The reason why I pursued the speed is only for ladder reading.

If you help me Chinese in my app, I will really really appreciate you.
I am using Google translation and others but mixing Go jargons leads them to strange world...^^;
My e-mail address is new.three.rs@gmail.com.
Or you can access github issues of my app if you like. https://github.com/new3Rs/a_master_of_go/issues
Please tell me any suggestions when you have time.

And of course, it is my pleasure to help Japanese for your app!
Thank you!

@uestccokey

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

Ok, I have already sent you a email.😆

@l1t1

This comment has been minimized.

Copy link

commented Dec 30, 2018

what is the changelog of v0.39?

@l1t1

This comment has been minimized.

Copy link

commented Dec 30, 2018

I downloaded it from http://gridmaster.tengen.nl/temp/GridMasterFree_v0.38.apk , still got cl:Error clgetPlatfromIDs when remove --cpu-only

@evdwerf

This comment has been minimized.

Copy link

commented Dec 31, 2018

@l1t1

This comment has been minimized.

Copy link

commented Jan 11, 2019

is this useful

624bd7d

@gcp

This comment has been minimized.

Copy link
Member

commented Jan 15, 2019

0.37 does not have the option to completely disable timeouts, which can result in GTP not getting an answer in time on the prototcol_version command -- which IMO is a design flaw in LZ, but that's not for this thread

@evdwerf Can you file an issue with some proposals?

There's no explicit point defined in the GTP spec when you should initialize (unlike UCI, which deals exactly with your problem case here). protocol_version seems obvious, but if you were intending to follow that up with name and version (very normal) that will now hang. A GUI probably also wants to lz-setoptions before the engine "hands on load".

Thinking about it and reading the spec, I think the engine has to load the network at the moment "boardsize" is sent, but that only in GTP mode. "boardsize" seems like the only command the GUI really can't get away with sending. "clear_board" too (at least in GTP 2!), but "boardsize" makes more sense as the networks are per-size anyway.

So delaying the network load till "boardsize" when used in GTP mode might be doable.

@evdwerf

This comment has been minimized.

Copy link

commented Jan 16, 2019

BTW I compiled new versions for the lz-5x64 and lz9x9 packages on my site for GridMaster. They check an additional location in the search for the OpenCL libs, which might help on some 64-bit Android devices.

@l1t1

This comment has been minimized.

Copy link

commented Jan 16, 2019

i updated my 5*64 pkg with the new one, even if i delete --cpu-only, it run cpu only version

@l1t1

This comment has been minimized.

Copy link

commented Jan 17, 2019

@evdwerf

This comment has been minimized.

Copy link

commented Jan 17, 2019

the screenshot is at https://userscloud.com/9sue229lm8ph

You should delete the old version first (press '-' button in the settings), otherwise it might not install the new one (I didn't change the zip file name).

@l1t1

This comment has been minimized.

Copy link

commented Jan 17, 2019

I did as you just told,
when i delete --cpu-only, it still run cpu-only version
the screenshot is at https://userscloud.com/n1xxisgz79d5
https://userscloud.com/6cjufgrg1t2t

@evdwerf

This comment has been minimized.

Copy link

commented Jan 17, 2019

Thanks! At least it's not crashing (which it does on my phone on the first genmove, even though the OpenCL tuner phase passes just fine). I guess it simply doesn't not recognize your hardware and then falls back to cpu-only.

@l1t1

This comment has been minimized.

Copy link

commented Jan 17, 2019

i wonder it show cpu only just like the windows cpu only version, see #2136

@l1t1

This comment has been minimized.

@l1t1

This comment has been minimized.

Copy link

commented Apr 11, 2019

when will the lz 0.17 android version engine for gridmaster release @evdwerf

@l1t1

This comment has been minimized.

Copy link

commented Apr 20, 2019

@evdwerf

This comment has been minimized.

Copy link

commented Jun 19, 2019

@iopq

This comment has been minimized.

Copy link

commented Jun 22, 2019

@evdwerf it doesn't have time to do the tuning on my phone. Fails with "failed to provide protocol version" because it times out the first time

@evdwerf

This comment has been minimized.

Copy link

commented Jun 22, 2019

@iopq

This comment has been minimized.

Copy link

commented Jun 24, 2019

In that case the tuner is failing. It does that half way through the tuning, no other error message. Probably the gpu is not powerful enough.

@evdwerf

This comment has been minimized.

Copy link

commented Jun 24, 2019

@iopq

This comment has been minimized.

Copy link

commented Jun 25, 2019

Tuner still crashes with the 5x64 version, no other error message

@iopq

This comment has been minimized.

Copy link

commented Jun 25, 2019

Takes approximately 50 seconds

@evdwerf

This comment has been minimized.

Copy link

commented Jun 30, 2019

I'm surprised you see nothing extra with the 64x5 downloads; this suggests that the crash is different for you than what I'm used to. On my phone (an S10e) without '--cpu-only' it looks like this:

lzOenClCrash

The interesting message is the segmentation fault that occurred when it called clCreateKernel. I don't know why it happens. The moment when it happens seems random, but when the function is called often enough eventually it always happens. I'm hoping some other gpu's don't have this issue... Perhaps recent Adreno's are better?

If you run the engine from the command line in a shell you might see a bit more (for the larger downloads, which don't have additional debug code, I can only see the segfault error if I run directly in a shell).

BTW The red line at the bottom is not interesting; it just occurs because the engine crashed without providing the protocol version.

@iopq

This comment has been minimized.

Copy link

commented Jul 1, 2019

Mine gets way further
Screenshot_20190702-003331

@evdwerf

This comment has been minimized.

Copy link

commented Jul 2, 2019

So it dies for some unknown reason... What type of GPU does your device have @iopq ? You might still see a bit more if you run directly in a shell, but I don't know if you have the tools in place to do that.

I would like to know what happens on devices where OpenCL is known to work with Leelaz. @Grant-Tao , @l1t1 , perhaps you could give it a try? My solution should be generic (not requiring a different binary for every GPU), but perhaps Android does not allow this (wouldn't be the first time one of google's obscure 'security' features causes problems), or it might just be problems with some specific GPU's.

@iopq

This comment has been minimized.

Copy link

commented Jul 8, 2019

My device is a OnePlus 3 with an Adreno 530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.