Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tengine slow on Raspberry Pi 4. #18002

Closed
Qengineering opened this issue Jul 31, 2020 · 11 comments
Closed

Tengine slow on Raspberry Pi 4. #18002

Qengineering opened this issue Jul 31, 2020 · 11 comments
Labels
category: dnn category: 3rdparty optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc

Comments

@Qengineering
Copy link

Qengineering commented Jul 31, 2020

System information (version)
  • OpenCV =>4.4:
  • Operating System / Platform => Raspberry Pi 4 with Raspberry 64 Bit OS
  • Compiler =>GCC 8.3.0 (aarch64-linux-gnu)
Detailed description

Tengine installed as below on RPi 4.
OpenCV 4.4 installed with the cmake.

Two different version, with and without Tengine.
Run different caffe models with the dnn::OpenCV module. They all work like charm.
However, the execution time of those with the Tengine is longer than without the accelerator.
One should expect the opposite.

Raspberry Pi 4 - 2 MB with Raspberry 64 Bit OS
Threads: 4 - Clock: 1500MHz - Time in mSec
Hinted by issue #17562 I tried another test without OpenMP. Even worse results.

Model no Tengine Tengine accelerator on Tengine on - OpenMP off
VGG16 2340 2650 2800
ResNet50 1164 1165 1284
GoolgeLeNet 265 312 340
MobileNetV1_SSD 199 202 206

Update :
Raspberry Pi 4 - 2 MB with Raspberry 32 Bit OS
With the regular Raspbain 32 bit OS the same results, except for VGG16. This time Tengine gives some improvement.
Threads: 4 - Clock: 1500MHz - Time in mSec

Model no Tengine Tengine accelerator on
VGG16 2960 2620
ResNet50 1600 1727
GoolgeLeNet 345 415
MobileNetV1_SSD 266 276
Steps to reproduce

Install Tengine on Raspberry Pi 4 with 64-bit OS

sudo apt-get install git cmake
sudo apt-get install libprotobuf-dev protobuf-compiler libboost-all-dev 
sudo apt-get install libgoogle-glog-dev libopenblas-dev

wget -O tengine.zip https://github.com/OAID/Tengine/archive/tengine-opencv.zip
unzip tengine.zip
mv Tengine-tengine-opencv tengine

cd tengine
mkdir build
cd build

cmake -DCONFIG_ARCH_ARM64=ON \
      -DBUILT_IN_OPENCV=ON \
      -DOPENCV_3P_LIB_INSTALL_PATH=/home/pi/tengine/core/lib \
      ..

make -j4
sudo make install

Install flags cmake Raspberry Pi 4 with 64-bit OS with Tengine
Without Tengine the flags are the same except the -D OPENCV_LIBTENGINE_ROOT_DIR=~/tengine/core and -D WITH_TENGINE=ON are missing.

cmake -D CMAKE_BUILD_TYPE=RELEASE \
        -D CMAKE_INSTALL_PREFIX=/usr/local \
        -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
        -D OPENCV_LIBTENGINE_ROOT_DIR=~/tengine/core \
        -D ENABLE_NEON=ON \
        -D BUILD_TIFF=ON \
        -D WITH_FFMPEG=ON \
        -D WITH_GSTREAMER=ON \
        -D WITH_TBB=ON \
        -D BUILD_TBB=ON \
        -D BUILD_TESTS=OFF \
        -D WITH_EIGEN=OFF \
        -D WITH_V4L=ON \
        -D WITH_LIBV4L=ON \
        -D WITH_TENGINE=ON \
        -D OPENCV_ENABLE_NONFREE=ON \
        -D INSTALL_C_EXAMPLES=OFF \
        -D INSTALL_PYTHON_EXAMPLES=OFF \
        -D BUILD_NEW_PYTHON_SUPPORT=ON \
        -D BUILD_opencv_python3=TRUE \
        -D OPENCV_GENERATE_PKGCONFIG=ON \
        -D BUILD_EXAMPLES=OFF ..
Issue submission checklist
  • [x ] I report the issue, it's not a question
  • [ x] I checked the problem with documentation, FAQ, open issues,
    answers.opencv.org, Stack Overflow, etc and have not found solution
  • [x ] I updated to latest OpenCV version and the issue is still there
  • [x ] There is reproducer code and related data files: videos, images, onnx, etc
@vpisarev
Copy link
Contributor

@liqi-c, could you please look at the issue?

@liqi-c
Copy link
Contributor

liqi-c commented Aug 21, 2020

@vpisarev ok .
@Qengineering can you try auto built tengine in OpenCV , in this way you just need to configure -DWITH_TENGINE=ON ,and then tengine can downlowd and built automatic .
This problem may because you didn't run in tengine or run in reference operator in tengine .

@liqi-c
Copy link
Contributor

liqi-c commented Aug 21, 2020

or can you add Tengine Configuration log here ?
set PROF_TIME=1 when you run in you board ,and send me the run log. thanks
@Qengineering

@Qengineering
Copy link
Author

Qengineering commented Aug 23, 2020

As suggested by @liqi-c build OpenCV with only the -DWITH_TENGINE=ON flag, to force an automatic download and built of the Tengine accelerator. Results are the same.

Raspberry Pi 4 - 2 MB with Raspberry 64 Bit OS
Threads: 4 - Clock: 1500MHz - Time in mSec
Time averaged over 10 runs, after 2 dummy runs.

By the way, I didn't average the previous outcomes, just take the lowest times of a few runs.
Calling ResNet50 net.forward() in a loop saves a lot of time, as it seems.
The other models have more or less the same execution times.

Model no Tengine Tengine accelerator on
VGG16 2318 2606
ResNet50 586 564
GoolgeLeNet 244 284
MobileNetV1_SSD 201 198

Below the debug output of a run with the VGG16 model forced by the PROF_TIME=1 flag.

==== graph0x0:1804289383: time stats by operator: ====
total time: 15184 us, repeat 1
PER RUN: time 15184 us on 173.41 Mfops, RATE: 11420.46 Mfops
0: Convolution used 15184 us (100.00%)


  0 [ 100.00% : 15.184 ms ] Node_idx:    3  Convolution		1x3x224x224 -> 1x64x224x224	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 173.41	Rate: 11420

total accumulated time: 15184 us. roughly [15184] us per run

==== graph0x0:846930886: time stats by operator: ====
total time: 256337 us, repeat 1
PER RUN: time 256337 us on 3699.38 Mfops, RATE: 14431.69 Mfops
0: Convolution used 256337 us (100.00%)


  0 [ 100.00% : 256.337 ms ] Node_idx:    3  Convolution		1x64x224x224 -> 1x64x224x224	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 14432

total accumulated time: 256337 us. roughly [256337] us per run

==== graph0x0:1681692777: time stats by operator: ====
total time: 78830 us, repeat 1
PER RUN: time 78830 us on 1849.69 Mfops, RATE: 23464.27 Mfops
0: Convolution used 78830 us (100.00%)


  0 [ 100.00% : 78.830 ms ] Node_idx:    3  Convolution		1x64x112x112 -> 1x128x112x112	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 23464

total accumulated time: 78830 us. roughly [78830] us per run

==== graph0x0:1714636915: time stats by operator: ====
total time: 209553 us, repeat 1
PER RUN: time 209553 us on 3699.38 Mfops, RATE: 17653.65 Mfops
0: Convolution used 209553 us (100.00%)


  0 [ 100.00% : 209.553 ms ] Node_idx:    3  Convolution		1x128x112x112 -> 1x128x112x112	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 17654

total accumulated time: 209553 us. roughly [209553] us per run

==== graph0x0:1957747793: time stats by operator: ====
total time: 90314 us, repeat 1
PER RUN: time 90314 us on 1849.69 Mfops, RATE: 20480.63 Mfops
0: Convolution used 90314 us (100.00%)


  0 [ 100.00% : 90.314 ms ] Node_idx:    3  Convolution		1x128x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 20481

total accumulated time: 90314 us. roughly [90314] us per run

==== graph0x0:424238335: time stats by operator: ====
total time: 198447 us, repeat 1
PER RUN: time 198447 us on 3699.38 Mfops, RATE: 18641.63 Mfops
0: Convolution used 198447 us (100.00%)


  0 [ 100.00% : 198.447 ms ] Node_idx:    3  Convolution		1x256x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 18642

total accumulated time: 198447 us. roughly [198447] us per run

==== graph0x0:719885386: time stats by operator: ====
total time: 192132 us, repeat 1
PER RUN: time 192132 us on 3699.38 Mfops, RATE: 19254.35 Mfops
0: Convolution used 192132 us (100.00%)


  0 [ 100.00% : 192.132 ms ] Node_idx:    3  Convolution		1x256x56x56 -> 1x256x56x56	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 19254

total accumulated time: 192132 us. roughly [192132] us per run

==== graph0x0:1649760492: time stats by operator: ====
total time: 92359 us, repeat 1
PER RUN: time 92359 us on 1849.69 Mfops, RATE: 20027.16 Mfops
0: Convolution used 92359 us (100.00%)


  0 [ 100.00% : 92.359 ms ] Node_idx:    3  Convolution		1x256x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 1849.69	Rate: 20027

total accumulated time: 92359 us. roughly [92359] us per run

==== graph0x0:596516649: time stats by operator: ====
total time: 461690 us, repeat 1
PER RUN: time 461690 us on 3699.38 Mfops, RATE: 8012.68 Mfops
0: Convolution used 461690 us (100.00%)


  0 [ 100.00% : 461.690 ms ] Node_idx:    3  Convolution		1x512x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 8013

total accumulated time: 461690 us. roughly [461690] us per run

==== graph0x0:1189641421: time stats by operator: ====
total time: 498997 us, repeat 1
PER RUN: time 498997 us on 3699.38 Mfops, RATE: 7413.62 Mfops
0: Convolution used 498997 us (100.00%)


  0 [ 100.00% : 498.997 ms ] Node_idx:    3  Convolution		1x512x28x28 -> 1x512x28x28	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 3699.38	Rate: 7414

total accumulated time: 498997 us. roughly [498997] us per run

==== graph0x0:1025202362: time stats by operator: ====
total time: 133838 us, repeat 1
PER RUN: time 133838 us on 924.84 Mfops, RATE: 6910.18 Mfops
0: Convolution used 133838 us (100.00%)


  0 [ 100.00% : 133.838 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6910

total accumulated time: 133838 us. roughly [133838] us per run

==== graph0x0:1350490027: time stats by operator: ====
total time: 135337 us, repeat 1
PER RUN: time 135337 us on 924.84 Mfops, RATE: 6833.64 Mfops
0: Convolution used 135337 us (100.00%)


  0 [ 100.00% : 135.337 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6834

total accumulated time: 135337 us. roughly [135337] us per run

==== graph0x0:783368690: time stats by operator: ====
total time: 135179 us, repeat 1
PER RUN: time 135179 us on 924.84 Mfops, RATE: 6841.62 Mfops
0: Convolution used 135179 us (100.00%)


  0 [ 100.00% : 135.179 ms ] Node_idx:    3  Convolution		1x512x14x14 -> 1x512x14x14	K: 3x3 | S: 1x1 | P: 1 1 1 1        Mfops: 924.84	Rate: 6842

total accumulated time: 135179 us. roughly [135179] us per run
669:mosque: 0.0295
Time : 2863 mSec

@liqi-c
Copy link
Contributor

liqi-c commented Aug 24, 2020

@Qengineering Thanks very much .
Your compilation and use looks correct, I don't know why there is a performance gap. Maybe you can look forward to our next update. We are already in development .
in addition ,you can run with export TENGINE_CPU_LIST=0,1,2,3 ,Try it for the last time and see the performance .

@Qengineering
Copy link
Author

I'm sorry to say, the environment variable export TENGINE_CPU_LIST=0,1,2,3 didn't bring any improvement. All timings are the same. Please see screen dump below.
image

@cricket1
Copy link

@Qengineering @liqi-c has the performance changed in the most recent update

@Qengineering
Copy link
Author

@liqi-c I want to do the same test again with OpenCV 4.5 and your new Tengine-Lite. Still the best way to incorporate Tengine-Lite in OpenCV by setting DWITH_TENGINE=ON ?

@qwersem
Copy link

qwersem commented Jun 21, 2021

@liqi-c @Qengineering Hello) could you solve this problem?

@qwersem
Copy link

qwersem commented Jun 22, 2021

I decided my problem: builded tengine with option -DTENGINE_OPENMP=ON

@asmorkalov
Copy link
Contributor

The Tengine support was dropped due lack of support and performance issues. Closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: dnn category: 3rdparty optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc
Projects
None yet
Development

No branches or pull requests

7 participants