Tengine slow on Raspberry Pi 4. #18002

Qengineering · 2020-07-31T12:48:15Z

System information (version)

OpenCV =>4.4:
Operating System / Platform => Raspberry Pi 4 with Raspberry 64 Bit OS
Compiler =>GCC 8.3.0 (aarch64-linux-gnu)

Detailed description

Tengine installed as below on RPi 4.
OpenCV 4.4 installed with the cmake.

Two different version, with and without Tengine.
Run different caffe models with the dnn::OpenCV module. They all work like charm.
However, the execution time of those with the Tengine is longer than without the accelerator.
One should expect the opposite.

Raspberry Pi 4 - 2 MB with Raspberry 64 Bit OS
Threads: 4 - Clock: 1500MHz - Time in mSec
Hinted by issue #17562 I tried another test without OpenMP. Even worse results.

Model	no Tengine	Tengine accelerator on	Tengine on - OpenMP off
VGG16	2340	2650	2800
ResNet50	1164	1165	1284
GoolgeLeNet	265	312	340
MobileNetV1_SSD	199	202	206

Update :
Raspberry Pi 4 - 2 MB with Raspberry 32 Bit OS
With the regular Raspbain 32 bit OS the same results, except for VGG16. This time Tengine gives some improvement.
Threads: 4 - Clock: 1500MHz - Time in mSec

Model	no Tengine	Tengine accelerator on
VGG16	2960	2620
ResNet50	1600	1727
GoolgeLeNet	345	415
MobileNetV1_SSD	266	276

Steps to reproduce

Install Tengine on Raspberry Pi 4 with 64-bit OS

sudo apt-get install git cmake
sudo apt-get install libprotobuf-dev protobuf-compiler libboost-all-dev 
sudo apt-get install libgoogle-glog-dev libopenblas-dev

wget -O tengine.zip https://github.com/OAID/Tengine/archive/tengine-opencv.zip
unzip tengine.zip
mv Tengine-tengine-opencv tengine

cd tengine
mkdir build
cd build

cmake -DCONFIG_ARCH_ARM64=ON \
      -DBUILT_IN_OPENCV=ON \
      -DOPENCV_3P_LIB_INSTALL_PATH=/home/pi/tengine/core/lib \
      ..

make -j4
sudo make install

Install flags cmake Raspberry Pi 4 with 64-bit OS with Tengine
Without Tengine the flags are the same except the -D OPENCV_LIBTENGINE_ROOT_DIR=~/tengine/core and -D WITH_TENGINE=ON are missing.

cmake -D CMAKE_BUILD_TYPE=RELEASE \
        -D CMAKE_INSTALL_PREFIX=/usr/local \
        -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
        -D OPENCV_LIBTENGINE_ROOT_DIR=~/tengine/core \
        -D ENABLE_NEON=ON \
        -D BUILD_TIFF=ON \
        -D WITH_FFMPEG=ON \
        -D WITH_GSTREAMER=ON \
        -D WITH_TBB=ON \
        -D BUILD_TBB=ON \
        -D BUILD_TESTS=OFF \
        -D WITH_EIGEN=OFF \
        -D WITH_V4L=ON \
        -D WITH_LIBV4L=ON \
        -D WITH_TENGINE=ON \
        -D OPENCV_ENABLE_NONFREE=ON \
        -D INSTALL_C_EXAMPLES=OFF \
        -D INSTALL_PYTHON_EXAMPLES=OFF \
        -D BUILD_NEW_PYTHON_SUPPORT=ON \
        -D BUILD_opencv_python3=TRUE \
        -D OPENCV_GENERATE_PKGCONFIG=ON \
        -D BUILD_EXAMPLES=OFF ..

Issue submission checklist

[x ] I report the issue, it's not a question
[ x] I checked the problem with documentation, FAQ, open issues,
answers.opencv.org, Stack Overflow, etc and have not found solution
[x ] I updated to latest OpenCV version and the issue is still there
[x ] There is reproducer code and related data files: videos, images, onnx, etc

The text was updated successfully, but these errors were encountered:

vpisarev · 2020-08-21T10:58:30Z

@liqi-c, could you please look at the issue?

liqi-c · 2020-08-21T11:53:26Z

@vpisarev ok .
@Qengineering can you try auto built tengine in OpenCV , in this way you just need to configure -DWITH_TENGINE=ON ,and then tengine can downlowd and built automatic .
This problem may because you didn't run in tengine or run in reference operator in tengine .

liqi-c · 2020-08-21T12:20:58Z

or can you add Tengine Configuration log here ？
set PROF_TIME=1 when you run in you board ，and send me the run log. thanks
@Qengineering

Qengineering · 2020-08-23T09:33:49Z

As suggested by @liqi-c build OpenCV with only the -DWITH_TENGINE=ON flag, to force an automatic download and built of the Tengine accelerator. Results are the same.

Raspberry Pi 4 - 2 MB with Raspberry 64 Bit OS
Threads: 4 - Clock: 1500MHz - Time in mSec
Time averaged over 10 runs, after 2 dummy runs.

By the way, I didn't average the previous outcomes, just take the lowest times of a few runs.
Calling ResNet50 net.forward() in a loop saves a lot of time, as it seems.
The other models have more or less the same execution times.

Model	no Tengine	Tengine accelerator on
VGG16	2318	2606
ResNet50	586	564
GoolgeLeNet	244	284
MobileNetV1_SSD	201	198

Below the debug output of a run with the VGG16 model forced by the PROF_TIME=1 flag.

==== graph0x0:1804289383: time stats by operator: ====
total time: 15184 us, repeat 1
PER RUN: time 15184 us on 173.41 Mfops, RATE: 11420.46 Mfops
0: Convolution used 15184 us (100.00%)


  0 [ 100.00% : 15.184 ms ] Node_idx:    3  Convolution		1x3x224x224 -> 1x64x224x224	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 173.41	Rate: 11420

total accumulated time: 15184 us. roughly [15184] us per run

==== graph0x0:846930886: time stats by operator: ====
total time: 256337 us, repeat 1
PER RUN: time 256337 us on 3699.38 Mfops, RATE: 14431.69 Mfops
0: Convolution used 256337 us (100.00%)


  0 [ 100.00% : 256.337 ms ] Node_idx:    3  Convolution		1x64x224x224 -> 1x64x224x224	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 14432

total accumulated time: 256337 us. roughly [256337] us per run

==== graph0x0:1681692777: time stats by operator: ====
total time: 78830 us, repeat 1
PER RUN: time 78830 us on 1849.69 Mfops, RATE: 23464.27 Mfops
0: Convolution used 78830 us (100.00%)


  0 [ 100.00% : 78.830 ms ] Node_idx:    3  Convolution		1x64x112x112 -> 1x128x112x112	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 23464

total accumulated time: 78830 us. roughly [78830] us per run

==== graph0x0:1714636915: time stats by operator: ====
total time: 209553 us, repeat 1
PER RUN: time 209553 us on 3699.38 Mfops, RATE: 17653.65 Mfops
0: Convolution used 209553 us (100.00%)


  0 [ 100.00% : 209.553 ms ] Node_idx:    3  Convolution		1x128x112x112 -> 1x128x112x112	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 17654

total accumulated time: 209553 us. roughly [209553] us per run

==== graph0x0:1957747793: time stats by operator: ====
total time: 90314 us, repeat 1
PER RUN: time 90314 us on 1849.69 Mfops, RATE: 20480.63 Mfops
0: Convolution used 90314 us (100.00%)


  0 [ 100.00% : 90.314 ms ] Node_idx:    3  Convolution		1x128x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 20481

total accumulated time: 90314 us. roughly [90314] us per run

==== graph0x0:424238335: time stats by operator: ====
total time: 198447 us, repeat 1
PER RUN: time 198447 us on 3699.38 Mfops, RATE: 18641.63 Mfops
0: Convolution used 198447 us (100.00%)


  0 [ 100.00% : 198.447 ms ] Node_idx:    3  Convolution		1x256x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 18642

total accumulated time: 198447 us. roughly [198447] us per run

==== graph0x0:719885386: time stats by operator: ====
total time: 192132 us, repeat 1
PER RUN: time 192132 us on 3699.38 Mfops, RATE: 19254.35 Mfops
0: Convolution used 192132 us (100.00%)


  0 [ 100.00% : 192.132 ms ] Node_idx:    3  Convolution		1x256x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 19254

total accumulated time: 192132 us. roughly [192132] us per run

==== graph0x0:1649760492: time stats by operator: ====
total time: 92359 us, repeat 1
PER RUN: time 92359 us on 1849.69 Mfops, RATE: 20027.16 Mfops
0: Convolution used 92359 us (100.00%)


  0 [ 100.00% : 92.359 ms ] Node_idx:    3  Convolution		1x256x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 20027

total accumulated time: 92359 us. roughly [92359] us per run

==== graph0x0:596516649: time stats by operator: ====
total time: 461690 us, repeat 1
PER RUN: time 461690 us on 3699.38 Mfops, RATE: 8012.68 Mfops
0: Convolution used 461690 us (100.00%)


  0 [ 100.00% : 461.690 ms ] Node_idx:    3  Convolution		1x512x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 8013

total accumulated time: 461690 us. roughly [461690] us per run

==== graph0x0:1189641421: time stats by operator: ====
total time: 498997 us, repeat 1
PER RUN: time 498997 us on 3699.38 Mfops, RATE: 7413.62 Mfops
0: Convolution used 498997 us (100.00%)


  0 [ 100.00% : 498.997 ms ] Node_idx:    3  Convolution		1x512x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 7414

total accumulated time: 498997 us. roughly [498997] us per run

==== graph0x0:1025202362: time stats by operator: ====
total time: 133838 us, repeat 1
PER RUN: time 133838 us on 924.84 Mfops, RATE: 6910.18 Mfops
0: Convolution used 133838 us (100.00%)


  0 [ 100.00% : 133.838 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6910

total accumulated time: 133838 us. roughly [133838] us per run

==== graph0x0:1350490027: time stats by operator: ====
total time: 135337 us, repeat 1
PER RUN: time 135337 us on 924.84 Mfops, RATE: 6833.64 Mfops
0: Convolution used 135337 us (100.00%)


  0 [ 100.00% : 135.337 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6834

total accumulated time: 135337 us. roughly [135337] us per run

==== graph0x0:783368690: time stats by operator: ====
total time: 135179 us, repeat 1
PER RUN: time 135179 us on 924.84 Mfops, RATE: 6841.62 Mfops
0: Convolution used 135179 us (100.00%)


  0 [ 100.00% : 135.179 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6842

total accumulated time: 135179 us. roughly [135179] us per run
669:mosque: 0.0295
Time : 2863 mSec

liqi-c · 2020-08-24T02:04:06Z

@Qengineering Thanks very much .
Your compilation and use looks correct, I don't know why there is a performance gap. Maybe you can look forward to our next update. We are already in development .
in addition ，you can run with export TENGINE_CPU_LIST=0,1,2,3 ，Try it for the last time and see the performance .

Qengineering · 2020-08-24T16:38:25Z

I'm sorry to say, the environment variable export TENGINE_CPU_LIST=0,1,2,3 didn't bring any improvement. All timings are the same. Please see screen dump below.

cricket1 · 2020-10-22T15:03:15Z

@Qengineering @liqi-c has the performance changed in the most recent update

Qengineering · 2020-12-09T17:49:31Z

@liqi-c I want to do the same test again with OpenCV 4.5 and your new Tengine-Lite. Still the best way to incorporate Tengine-Lite in OpenCV by setting DWITH_TENGINE=ON ?

qwersem · 2021-06-21T13:16:06Z

@liqi-c @Qengineering Hello) could you solve this problem?

qwersem · 2021-06-22T08:38:36Z

I decided my problem: builded tengine with option -DTENGINE_OPENMP=ON

asmorkalov · 2024-06-28T12:16:40Z

The Tengine support was dropped due lack of support and performance issues. Closed.

asmorkalov added the optimization label Jan 26, 2021

asmorkalov closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tengine slow on Raspberry Pi 4. #18002

Tengine slow on Raspberry Pi 4. #18002

Qengineering commented Jul 31, 2020 •

edited

Loading

vpisarev commented Aug 21, 2020

liqi-c commented Aug 21, 2020

liqi-c commented Aug 21, 2020

Qengineering commented Aug 23, 2020 •

edited by alalek

Loading

liqi-c commented Aug 24, 2020

Qengineering commented Aug 24, 2020

cricket1 commented Oct 22, 2020

Qengineering commented Dec 9, 2020

qwersem commented Jun 21, 2021

qwersem commented Jun 22, 2021

asmorkalov commented Jun 28, 2024

Tengine slow on Raspberry Pi 4. #18002

Tengine slow on Raspberry Pi 4. #18002

Comments

Qengineering commented Jul 31, 2020 • edited Loading

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

vpisarev commented Aug 21, 2020

liqi-c commented Aug 21, 2020

liqi-c commented Aug 21, 2020

Qengineering commented Aug 23, 2020 • edited by alalek Loading

liqi-c commented Aug 24, 2020

Qengineering commented Aug 24, 2020

cricket1 commented Oct 22, 2020

Qengineering commented Dec 9, 2020

qwersem commented Jun 21, 2021

qwersem commented Jun 22, 2021

asmorkalov commented Jun 28, 2024

Qengineering commented Jul 31, 2020 •

edited

Loading

Qengineering commented Aug 23, 2020 •

edited by alalek

Loading